Pandas is an excellent Python librarie to manage data. Barely started using it, it already solves some of my recurring needs. Matplotlib allows to create advanced charts and Openpyxl is very usefull to manage Excel files, read or generate them for example.
So here some tips for my memory and maybe for you.
Let's go learn data analysis with this 3 beautiful tools. First install or check Python and Pip, then the libraries.
pip install pandas pip install matplotlib pip install openpyxl
Work with data file (CSV, Excel, MySQL, Postgres, JSON...)
Read and display an Excel file
import pandas as pd from tabulate import tabulate df = pd.read_excel('C:/Users/Your file.xlsx', sheet_name='Export', engine='openpyxl', usecols=['My first column', 'My second column']) print(tabulate(df.head(35), headers='keys', tablefmt='psql', showindex=True))
I set engine='openpyxl'
because it is from an XLSX file. sheet_name
specifies the sheet to import, specify your columns with usecols
.
Then I use tabulate
to display properly my data in a Postgres way (tablefmt='psql'
).
With head(35)
I just display the 35 first lines, but the dataframe here named df
contains all the data.
If you wan to re-organize the columns, please see the Miscellaneous chapter.
Read a CSV file
import pandas as pd from tabulate import tabulate df_subscriber = pd.read_csv(workDirectory+'joo_acymailing_subscriber.csv', sep=';', usecols=['subid', 'source', 'email']) print(tab(df_subscriber.head(10), headers='keys', tablefmt='psql', showindex=False))
Add textual column names from a CSV without column names
Use names=[]
:
import pandas as pd from tabulate import tabulate df = pd.read_csv(inpuMyFilePath, names=['Mon Champ 1', 'Mon Champ 2', 'Mon Champ 3'], sep=',', keep_default_na=False).astype(str) print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False))
Read a non-standard Excel file
Sometimes a file has wrong top lines, or you can not trust the future column names.
To read a file from a specific row for example, skipping the last rows.
import pandas as pd from tabulate import tabulate df = pd.read_excel(r'C:/Users/Georges/Downloads/test excel.xlsx', sheet_name='Sheet1', engine='openpyxl', usecols=[0,1], header=5-1, skipfooter=1, names=['id', 'value']) print(tabulate(df.head(35), headers='keys', tablefmt='psql', showindex=False))
Or skip a specific line (with skiprows=[...]
):
... df = pd.read_excel(my_file, sheet_name='Feuil1', engine='openpyxl', skiprows=[2]) ...
Read from MySQL
Installating sqlalchemy
(the best way with pandas):
from tabulate import tabulate as tab import pandas as pd from sqlalchemy import create_engine username = 'root' password = '' port = 3308 database = 'my_beautiful_db' host ='localhost' engine = create_engine('mysql+mysqldb://%s:%s@%s:%i/%s' % (username, password, host, port, database)) sql = 'SELECT COUNT(*) AS total FROM communes ;' df = pd.read_sql_query(sql, engine).set_index('total') print(tab(df.head(5), headers='keys', tablefmt='psql'))
Or installating mysqlclient
:
import MySQLdb from tabulate import tabulate import pandas as pd mydb = MySQLdb.connect(user='root', password='', database='MyDb', host='localhost', port=3306) query = 'SELECT name FROM users ;' df = pd.read_sql(query, mydb) mydb.close() print(tabulate(df.head(10), headers='keys', tablefmt='psql')) number = df.shape[0] print(number)
Or below, installating mysql.connector
:
import mysql.connector as connection import pandas as pd from tabulate import tabulate as tab mydb = connection.connect(host='XXX.XX.XXX.XX', port=3308, database='MyDb',user='Me', passwd='pwdXXX!456', use_pure=True) try: query = "SELECT firstname, lastname FROM customer ;" df = pd.read_sql(query, mydb) except Exception as e: mydb.close() print(str(e)) print(tab(df.head(5), headers='keys', tablefmt='psql'))
Read from Postgres
import pandas as pd from tabulate import tabulate import psycopg2 con = psycopg2.connect(database='mytransformations', user='postgres', password='', host='localhost', port='5432') print("OK, we are connect to mytransformations") # DISPLAY WITH PANDAS df = pd.read_sql("SELECT * FROM myparcels ;", con) df.drop('geom', axis=1, inplace=True) print('\nResults:') print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False)) print('Number of records:', df.shape[0])
Read from a JSON
import pandas as pd from tabulate import tabulate import json import requests response = requests.get(MON_URL) my_file = json.loads(response.text) my_json = my_file['data'] mon_df = pd.DataFrame.from_dict(my_json)
Export a data-frame in an Excel file
df.to_excel("C:/Users/Your new file.xlsx", index=False, sheet_name='Tab 1')
index=False
to not get the index column.
Export a data-frame in an existing Excel sheet
Use mode='a'
and if_sheet_exists
in a real writer
:
from openpyxl import load_workbook import openpyxl My_File_Path= 'C:/My/File.xlsx' with pd.ExcelWriter(My_File_Path, 'openpyxl', mode='a', if_sheet_exists='replace') as writer: writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets) df.to_excel(writer, 'My sheet', index=False)
Export several dataframes in Excel sheets
If you have several data-frame to export in one Excel file, each in unique sheets, you should open an ExcelWriter
. And close it.
writer = pd.ExcelWriter('C:/Users/Your new file.xlsx', engine='xlsxwriter') df_Country.to_excel(writer, index=False, sheet_name='Country', header=['Country', 'Total']) df_Emails.to_excel(writer, index=False, sheet_name='Emails', header=['Emails', 'Total']) writer.save()
Export several dataframes in the same sheet
df.to_excel(writer, index=False, sheet_name='Registrations', startcol=3, startrow=34, header=['Filed', 'Total', '%'])
Export a dataframe as a txt file, with separator and ANSI encoding
df.to_csv(my_file_import, header=None, index=None, sep='\t', encoding='mbcs')
Add filters in all columns in one sheet in an Excel file
import openpyxl from openpyxl.utils import get_column_letter outputExcelFile = r'C:/User/Your file.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) worksheet = workbook['Sheet name'] FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row) worksheet.auto_filter.ref = FullRange workbook.save(outputExcelFile)
Add filters in all columns in all sheets in an Excel file
import openpyxl from openpyxl.utils import get_column_letter outputExcelFile = r'C:/User/Your file.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) sheetsLits = workbook.sheetnames for sheet in sheetsLits: worksheet = workbook[sheet] FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row) worksheet.auto_filter.ref = FullRange workbook.save(outputExcelFile)
Add colors in all column names in a sheet in an Excel file
import openpyxl from openpyxl.styles import PatternFill workbook = openpyxl.load_workbook(My_File_Path) worksheet = workbook['My sheet'] for cell in worksheet[1]: worksheet[cell.coordinate].fill = PatternFill(fgColor='FFC6C1C1', fill_type='solid') workbook.save(My_File_Path)
Add colors in all column names in all sheets in an Excel file
import openpyxl from openpyxl.styles import PatternFill outputExcelFile = r'C:/Users/Your File.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) sheetsLits = workbook.sheetnames for sheet in sheetsLits: worksheet = workbook[sheet] for cell in workbook[sheet][1]: worksheet[cell.coordinate].fill = PatternFill(fgColor = 'FFC6C1C1', fill_type = 'solid') workbook.save(outputExcelFile)
Modify a column width in one sheet in an Excel file
import openpyxl outputExcelFile = r'C:/Users/Your File.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) sheetsLits = workbook.sheetnames workbook['Your sheet'].column_dimensions['A'].width = 75 workbook.save(outputExcelFile)
Modify all columns width in all sheets in an Excel file
The code below set the A column size to 30, and the others to 10.
import openpyxl from openpyxl.utils import get_column_letter outputExcelFile = r'C:/Users/Your File.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) sheetsLits = workbook.sheetnames for sheet in sheetsLits: for cell in workbook[sheet][1]: if get_column_letter(cell.column) == 'A': workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 30 else: workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 10 workbook.save(outputExcelFile)
Add color to some line recovering their indexes from condition
list_duplicate = df.index[df['Ligne en doublon'] == 'DOUBLON !'].tolist() for index in list_duplicate: for cell in worksheet[index+2]: worksheet[cell.coordinate].fill = PatternFill(fgColor='FF6666', fill_type='solid') workbook.save(My_File_Path)
Freeze the top row of an Excel file
worksheet.freeze_panes = 'A2'
Change font color and style in an Excel file
from openpyxl.styles import Font worksheet['A1'].font = Font(color='FFFFFF', italic=False, bold=True)
Work on several Excel files
Here we freeze the top row, add filters, color and change the width of all columns from 3 Excel files.
import openpyxl from openpyxl.utils import get_column_letter from openpyxl.styles import PatternFill File1 = r'D:/Work/My file 1.xlsx' File2 = r'D:/Work/My file 2.xlsx' File3 = r'D:/Work/My file 3.xlsx' ExcelList = [File1, File2, File3] for ExcelFile in ExcelList: workbook = openpyxl.load_workbook(ExcelFile) sheetsLits = workbook.sheetnames for sheet in sheetsLits: worksheet = workbook[sheet] FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row) worksheet.auto_filter.ref = FullRange worksheet.freeze_panes = 'A2' for cell in workbook[sheet][1]: worksheet[cell.coordinate].fill = PatternFill(fgColor='FFC6C1C1', fill_type='solid') if get_column_letter(cell.column) == 'A': workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 12 else: workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 20 workbook.save(ExcelFile)
Insert a picture in an Excel file
And delete them after.
img = openpyxl.drawing.image.Image('C:/Users/myplot.png') img.anchor = 'E6' workbook['Your sheet'].add_image(img) workbook.save(outputExcelFile) os.remove('C:/Users/myplot.png')
Open a box to search a file and get his path
Use Tkinter.
from tkinter import Tk from tkinter.filedialog import askopenfilename Tk().withdraw() filename = askopenfilename() print(filename)
Open a box to save an Excel file
from tkinter import Tk from tkinter.filedialog import asksaveasfilename fileName = asksaveasfilename(filetypes=(("Excel files", "*.xlsx"), ("All files", "*.*")), defaultextension='.xslx') df.to_excel(fileName, index=False, sheet_name='Tab 1', header=None)
Use first row as label column
my_labels = df.iloc[0] df = df[1:] df.columns = my_labels
Check if a file exists
And use it only if it exists:
import os my_file = 'C:/_gh/0/export.xlsx' if os.path.isfile(my_file) == True: df_ExportFeuil1 = pd.read_excel(my_file, sheet_name='Feuil1', engine='openpyxl')
Excel conditional formatting
If a number is equal to, in just one cell
Here we add in one cell (B3) a rule to fill it in blue if it equals to 2:
import openpyxl from openpyxl import styles, formatting myFile = 'C:/Path/test.xlsx' myDF.to_excel(myFile, index=False, sheet_name='Tab 1') workbook = openpyxl.load_workbook(myFile) worksheet = workbook['Sheet1'] myBlue = styles.PatternFill(start_color='0088cc', end_color='0088cc', fill_type='solid') worksheet.conditional_formatting.add('B3', formatting.rule.CellIsRule(operator='equal', formula=['2'], fill=myBlue)) workbook.save(myFile)
If a number is greater than, in whole column
Same as above, we just change the operator with greaterThan
and we target the whole column B as Excel did, with $1:$1048576
:
import openpyxl from openpyxl import styles, formatting myFile = 'C:/Path/test.xlsx' myDF.to_excel(myFile, index=False, sheet_name='Tab 1') workbook = openpyxl.load_workbook(myFile) worksheet = workbook['Sheet1'] myBlue = styles.PatternFill(start_color='0088cc', end_color='0088cc', fill_type='solid') worksheet.conditional_formatting.add('B1:B1048576', formatting.rule.CellIsRule(operator='greaterThan', formula=['2'], fill=myBlue)) workbook.save(myFile)
Operators
The above method supports several operators:
{'beginsWith', 'endsWith', 'notBetween', 'notContains', 'lessThan', 'greaterThanOrEqual', 'notEqual', 'between', 'lessThanOrEqual', 'containsText', 'greaterThan', 'equal'}
If contains a string
For match with strings it is a little specific, with redundant code and building the column_slot
ourself, using the size of our dataframe +1
:
Here for example we fill in red the cells from A column when contain the string CSS.
import pandas as pd import openpyxl from openpyxl import styles, formatting from openpyxl.styles.differential import DifferentialStyle from openpyxl.formatting.rule import Rule myFile = 'C:/Path/test.xlsx' myDF.to_excel(myFile, index=False, sheet_name='Tab 1') workbook = openpyxl.load_workbook(myFile) worksheet = workbook['Sheet1'] myRed = styles.PatternFill(start_color='ffc7ce', end_color='ffc7ce', fill_type='solid') column_slot = 'A2:A'+str(myDF.shape[0]+1) styleRed = DifferentialStyle(fill=myRed) rule = Rule(type='containsText', operator='containsText', text='css', dxf=styleRed) rule.formula = ['SEARCH("css",' + column_slot + ")"] worksheet.conditional_formatting.add(column_slot, rule) workbook.save(myFile)
Split file
Here below is not right about Pandas, but usefull in some Pandas contexts.
First we count the number of lines in the file (from the recurrence of \n
), then we split it in files with 50,000 lines if it exceeds 50,000 lines.
with open(workDirectory+'FixQueries.sql', 'r') as myfile: data = myfile.read() taille_fichier = data.count("\n") lines_max = 50000 numero_fichier = 0 if taille_fichier > lines_max: print('Attention : le fichier de sortie fait '+str(taille_fichier)+' lignes ! Veuillez patienter...') smallfile = None with open(workDirectory+'FixQueries.sql') as bigfile: for lineno, line in enumerate(bigfile): if lineno % lines_max == 0: if smallfile: smallfile.close() numero_fichier += 1 small_filename = workDirectory + 'FixQueries {}.sql'.format(numero_fichier) smallfile = open(small_filename, "w") smallfile.write(line) if smallfile: smallfile.close() print('Nous l\'avons découpé en ', ceil(taille_fichier/lines_max), 'fichiers !\n') file1.close() os.remove(workDirectory+'FixQueries.sql')
And then merge the files:
filenames = ['C:/_gh/0/file_25000.txt', 'C:/_gh/0/file_50000.txt', 'C:/_gh/0/file_75000.txt', 'C:/_gh/0/file_100000.txt', 'C:/_gh/0/file_125000.txt'] with open('C:/_gh/0/CUMUL1.txt', 'w') as outfile: for names in filenames: with open(names) as infile: outfile.write(infile.read())
Maybe you will need to list the names of the files before to merge them, with PowerShell (Alt+F+R):
get-childitem | select-object -expandproperty name > _files_list.txt
Be careful to keep the order of the file, to keep the last line empty.
Replace text in a file
my_file = r'C:/_gh/0/_Urbanisme/test.txt' replacements = [ ('heros', 'héros'), ('Batm?n', 'Batman'), ('télévisee', 'télévisée'), (' s attaquent', ' s\'attaquent') ] with open(my_file, 'r') as file: data = file.read() for a, b in replacements: data = data.replace(a, b) with open(my_file, 'w') as file: file.write(data) # Confirm print("OK !")
Advanced write in a TXT file
This is not right related to Pandas, but it is very useful to store some special note to remember during a data process, and automate it.
Create/replace a TXT file and write something into
Use the Write option (w
):
MyTxtFile = 'C:/_gh/0/My beautiful file.txt' with open(MyTxtFile, 'w') as file: file.write('Blablabla...\n') file.write(MyVariable)
Add some lines in an existing file
Use the Append option (a
):
MyTxtFile = 'C:/_gh/0/My beautiful file.txt' with open(MyTxtFile, 'a') as file: file.write('Blablabla...\n') file.write(MyVariable)
Replace some lines in an existing file
Here we first Read (r
option) and store the content of the file, and then re-create it removing the lines we would delete:
MyTxtFile = 'C:/_gh/0/My beautiful file.txt' with open(MyTxtFile, 'r') as f: lines = f.readlines() with open(MyTxtFile, 'w') as f: for line in lines: if not line.strip('\n').startswith('First line to delete...') : if not line.strip('\n').startswith('Second line to delete...') : if not line.strip('\n').startswith('Third line to delete...') : f.write(line) with open(MyTxtFile, 'a') as file: file.write('New line 1') file.write('New line 2') file.write('New line 3')
Replace a simple string in a TXT file
MyTxtFile = 'C:/_gh/0/My beautiful file.txt' with open(MyTxtFile, 'r') as file : filedata = file.read() filedata = filedata.replace('String 1', 'String 2') with open(MyTxtFile, 'w') as file: file.write(filedata)
Replace a string with a regular expression (regex) in a TXT file
FichierTest = 'C:/_gh/0/FichierTest.txt' import re # Find number : r'[0-9]' # Find line break : r'[\n]' # Find double line break : r'[\r\n]{2,}' with open(FichierTest, 'r+') as file: text = file.read() text = re.sub(r'[\r\n]{2,}', '\n\n', text) file.seek(0, 0) # seek to beginning file.write(text) file.truncate() # get rid of any trailing characters
Test if a comma is used as decimal separator in a TXT file
with open(MyTxtFile, 'r') as myfile: data = myfile.read() pb_regex = re.findall(r'(?:\t|^)\d+\K,(?=\d+(?:\t|$))', data) if pb_regex: print(colored('\nAttention, comma is used as a decimal separator in numeric value!\n' + 'Find them in Notepad with:\n(?:\\t|^)\d+\K,(?=\d+(?:\\t|$))', 'yellow')) else: pass
Analyze data from a multi-valued field
Suppose you have a field containing values coming from a multi-valued drop-down list. But now you want to count how many times each value exists.
If values were uniques in each cell, a simple GROUP BY
in SQL would do the job.
But here we first have to split each cell to recover unique values, then we will be able to count them.
#Pandas
Split the data
Start splitting the data with str.split
specifying the separator and expanding, to create new columns to store each unique values.
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') serie = df['Your field'].str.split(', ', expand=True) print(serie)
Great! Now we get virtual columns containing only one value, with as much columns as necessary.
Melt the virtual data
Now let's go to melt our virtual columns in just one.
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') serieMelt = pd.melt(df['Your field'].str.split(', ',expand=True))['value'] print(serieMelt)
str.split
gives virtual column labels named value
, that we can use.
Count unique values
Now just put our melted data in a new data frame, before to apply a sorted groupby
by size()
(a COUNT
in SQL).
The full code will be:
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value']) df_Count = pd.DataFrame(temp_df.groupby(['value']).size(), columns=['Total']).reset_index() print(df_Count)
We add a reset_index()
to recover column names properly, to use them later.
Percentage
Ok, but suppose you want also the recurrence percentages of your unique values. The usual method will take the total from the melted data, giving strange results. Indeed the melted data cumulated each values from the multi-valued field in new rows, so a lot!
To get real percentages you have to use shape[0]
to count the real number of rows from your original data-frame, and use it in the percentage calculation.
We start creating a new field in the results data-frame, then we calculate percentages, rounding them.
df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100 df_Count['Percent'] =df_Count['Percent'].round(decimals=2)
Of course, you can not add the percentage results, it will exceed 100% and it is normal.
Empty values management
Good. But about our empty values? Just add dropna=False
to see them.
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value']) df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index() df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100 df_Count['Percent'] =df_Count['Percent'].round(decimals=2) print(df_Count)
Any problem?
Arrgg! The groupby
counted empty values from the melted data, and the melted data has added new columns as many times as necessary (based on the record with the highest number of unique values). Each them containing empty values ... It skews the count and percentage.
We have to cheat to fix that.
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value']) df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index() df_Count = df_Count.fillna('AZERTY') df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100 df_Count['Percent'] =df_Count['Percent'].round(decimals=2) valuesEmpty = df['Industries:Industries'].isna().sum() valuesEmptyPercent = round((valuesEmpty / df.shape[0]) * 100, 2) df_Count.loc[(df_Count['value'] == 'AZERTY')] = [['Unknow', valuesEmpty, valuesEmptyPercent]] df_Count = df_Count.sort_values(['Total'], ascending=False) print(df_Count)
The 7th line adds a specific value in the results data-frame to target the wrong empty values calculation (AZERTY
).
The 12 and 13th count the empty values and its percentage aside, and store results in specific variables.
Then the 15th line replaces the empty values calculation with our new value and variables, using loc
to identify the line to fix.
Finally we can sort our data-frame with sort_values(['Total']
.
Work with dates
Group by months
If you have a data-frame df
with a field Created at containing dates like:
2021-02-05 10:34:21 UTC
You can work with it, example to group and sort by month, you can cheat doing:
df_Temp = pd.DataFrame(df['Created at'].str[:7]) df_Created_count = pd.DataFrame(df_Temp.groupby(['Created at'], dropna=False).size(), columns=['Total']).sort_values(['Created at'], ascending=True).reset_index()
indeed with str[:7]
we extract years and months from strings. Then a usual groupby
.
Ok but it is more a pandas-way to convert your field in a true date format, in a new virtual field. Then you can extract real months and store them in a proper data-frame.
df['Created'] = pd.to_datetime(df['Created at']) df['Created'] = df['Created'].dt.to_period("M") df_TempCreated = pd.DataFrame(df['Created'])
Then easy to group and count them.
df_Created_count = pd.DataFrame(df_TempCreated.groupby(['Created'], dropna=False).size(), columns=['Total']).sort_values(['Created'], ascending=True).reset_index()
As now you use a real date format, you can display the full months with dt.strftime
.
df_Created_count['Created'] = df_Created_count['Created'].dt.strftime('%B %Y')
Format date
If you get date like 2022-10-24 and you would 24/10/2022:
df['My date'] = pd.to_datetime(df['My date'], format='%Y-%m-%d') df['My date'] = df['My date'].dt.strftime("%d/%m/%Y")
Get simple date (DDDDMMDD)
# SIMPLE DATE df_DataUnion['date_simple'] = df_DataUnion['date_submit'].astype(str).str[:10] df_DataUnion['date_simple'] = df_DataUnion['date_simple'].replace({'-': ''}, regex=True)
Extract the previous month
PreviousMonth = today - pd.DateOffset(months=1) print(PreviousMonth.strftime("%Y-%m"))
Convert false date field from Excel
Excel can concatenate true date format and string in the same column, arrgh! Later when you would use it in Pandas...
I use this tip to manage these kind of malicious fields in pandas:
df.loc[~df_GDC['date'].str.contains(' ', na=False), 'date'] = df['date'].astype(str).str.split(expand=True)[0] df.loc[~df_GDC['date'].str.contains(' ', na=False), 'date'] = \ df['date'].astype(str).str.split('-', expand=True)[2] + '/' +\ df['date'].astype(str).str.split('-', expand=True)[1] + '/' +\ df['date'].astype(str).str.split('-', expand=True)[0]
Eventually I put the code above in a try:... except:...
Eventually I convert all my field in string before, with astype(str)
Manage numeric date from Excel
Sometimes dates from Excel come as integer, arrgghhh!
import xlrd def read_date(date): return xlrd.xldate.xldate_as_datetime(date, 0) df.loc[df['My date'].astype(str).str.isdigit() == True, 'My date'] = \ pd.to_datetime(df['My date'].loc[df['My date'].astype(str).str.isdigit()].astype(int).apply(read_date), errors='coerce')
Check date format
Sometimes (especilally when it comes from Excel...) it is good to check your dates after treatments.
Below we search dates not starting with DD/MM/YYYY:
df2 = df[~df['my date'].str.contains('^[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]', regex=True)] print(tabulate(df2, headers='keys', tablefmt='psql', showindex=False))
Pie Chart
Simple pie
If you have a data-frame with fields Categories and Total, you can create a pie chart.
import matplotlib.pyplot as plt chartLabel = df_Categories_count['Categories'].tolist() chartValue = df_Categories_count['Total'].tolist() colors = plt.rcParams['axes.prop_cycle'].by_key()['color'] fig1 = plt.figure() plt.pie(chartValue, labels=chartLabel, colors=colors, autopct=None, shadow=False, startangle=90) plt.axis('equal') plt.title('Categories', pad=20, fontsize=15) plt.show() plt.clf()
Display percentages in the chart
Use autopct
.
plt.pie(chartValue, labels=chartLabel, colors=colors, autopct='%1.1f%%', shadow=False, startangle=90)
Change the labels font size
plt.rcParams['font.size'] = 5
Hide a label in chart
Just replace the value with an empty string.
chartLabel[-1] = ''
Add a legend in your chart
plt.legend(chartLabel, loc='best', fontsize=8)
Explode
To explode your chart, you need to pass values in explode
, example if you have 4 values to plot:
plt.pie(myValue, labels=myLabels, colors=colors, explode=[0.05, 0.05, 0.05, 0.05], autopct=None, shadow=False, startangle=90)
If you want explode all part by default, just create a list before:
explodeValues = [] for i in myValue: explodeValues.append(0.05) plt.pie(myValue, labels=myLabels, colors=colors, explode=explodeValues, autopct=None, shadow=False, startangle=90)
Add a legend with several fields
For example you want display labels and percentages in the legend. First calculate percentages in another field in your data-frame, then:
import matplotlib.pyplot as plt chartLabel = df_Categories_count['Categories'].tolist() chartLegendLabel = df_Categories_count['Categories'].tolist() chartValue = df_Categories_count['Total'].tolist() chartLegendPercent = df_Categories_count['Percent'].tolist() legendLabels = [] for i, j in zip(chartLegendLabel, map(str, chartLegendPercent)): legendLabels.append(i + ' (' + j + ' %)') colors = plt.rcParams['axes.prop_cycle'].by_key()['color'] fig1 = plt.figure() plt.pie(chartValue, labels=chartLabel, colors=colors, autopct=None, shadow=False, startangle=90) plt.axis('equal') plt.title('Categories', pad=20, fontsize=15) plt.legend(legendLabels, loc='best', fontsize=8) plt.show() plt.clf()
Bar chart
Vertical bar chart
chartLabel = df_Created_count['Labels field'].tolist() chartValue = df_Created_count['Values field'].tolist() fig5 = plt.figure(figsize=(13,6)) bar_plot = plt.bar(chartLabel, chartValue) # plt.ylabel('yyy') # plt.xlabel('xxx') plt.xticks(rotation=30, ha='right') # HIDE BORDERS plt.gca().spines['left'].set_color('none') plt.gca().spines['right'].set_color('none') plt.gca().spines['top'].set_color('none') # HIDE TICKS plt.tick_params(axis='y', labelsize=0, length=0) plt.yticks([]) # ADD VALUE ON THE END OF HORIZONTAL BARS # for index, value in enumerate(chartValue): # plt.text(value, index, str(value)) # ADD VALUE ON THE TOP OF VERTICAL BARS def autolabel(rects): for idx, rect in enumerate(bar_plot): height = rect.get_height() plt.text(rect.get_x() + rect.get_width()/2, height, chartValue[idx], ha='center', va='bottom', rotation=0) autolabel(bar_plot) plt.title('Registrations by month', pad=20, fontsize=15) plt.show() plt.clf()
Work with shape files
Here we will use numpy and basemap.
pip install numpy pip install basemap
To install basemap on WIndows, download a package here, according your versions, and install it with wheel.
Thanks to ramiro.org. I got the code below on his website, I just adapted it.
Suppose you have an Excel file with a country field, you get count and map them with a country shape. The country names in the Excel file and in the shape must be the same.
import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import pandas as pd from matplotlib.patches import Polygon from matplotlib.collections import PatchCollection from mpl_toolkits.basemap import Basemap from PIL import Image, ImageOps shp_simple_countries = r'C:/Users/Downloads/simple_countries/simple_countries' inputExcelFile = r'C:/Users/Downloads/My file.xslx' df = pd.read_excel(inputExcelFile, sheet_name='Export', engine='openpyxl', usecols=['ID', 'My Country Field']) # Count the countries df_Country_count = pd.DataFrame(df.groupby(['My Country Field'], dropna=True).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index() df_Country_count = df_Country_count.fillna('Unknow') df_Country_count['Percent'] = (df_Country_count['Total'] / df_Country_count['Total'].sum()) * 100 df_Country_count['Percent'] = df_Country_count['Percent'].round(decimals=2) # Set countries as index df_Country_count.set_index('My Country Field', inplace=True) my_values = df_Country_count['Percent'] num_colors = 30 cm = plt.get_cmap('Blues') scheme = [cm(i / num_colors) for i in range(num_colors)] my_range = np.linspace(my_values.min(), my_values.max(), num_colors) # -1 TO AVOID SEARCHS IN A PANDAS DATA-FRAME INCLUDING START AND STOP VALUE (I think ...) df_Country_count['Percent'] = np.digitize(my_values, my_range) - 1 map1 = plt.figure(figsize=(14, 8)) ax = map1.add_subplot(111, frame_on=False) map1.suptitle('Countries', fontsize=30, y=.95) m = Basemap(lon_0=0, projection='robin') m.drawmapboundary(color='w') m.readshapefile(shp_simple_countries, 'units', color='#444444', linewidth=.2, default_encoding='iso-8859-15') # Create the chloro map for info, shape in zip(m.units_info, m.units): shp_ctry = info['COUNTRY_HB'] if shp_ctry not in df_Country_count.index: color = '#dddddd' else: color = scheme[df_Country_count.loc[shp_ctry]['Percent']] patches = [Polygon(np.array(shape), True)] pc = PatchCollection(patches) pc.set_facecolor(color) ax.add_collection(pc) # Cover up Antarctica so legend can be placed over it ax.axhspan(0, 1000 * 1800, facecolor='w', edgecolor='w', zorder=2) # Draw color legend ax_legend = map1.add_axes([0.2, 0.14, 0.6, 0.03], zorder=3) cmap = mpl.colors.ListedColormap(scheme) cb = mpl.colorbar.ColorbarBase(ax_legend, cmap=cmap, ticks=my_range, boundaries=my_range, orientation='horizontal') cb.ax.set_xticklabels([str(round(i, 1)) for i in my_range]) cb.ax.tick_params(labelsize=7) cb.set_label('Percentage', rotation=0) #cb.remove() # Set the map footer # description = 'Bla bla bla' # plt.annotate(description, xy=(-.8, -3.2), size=14, xycoords='axes fraction') map1.savefig('C:/Users/Downloads/mymap1.png', dpi=100, bbox_inches='tight') plt.show() plt.clf() im = Image.open('C:/Users/Downloads/mymap1.png') bordered = ImageOps.expand(im, border=1, fill=(0, 0, 0)) bordered.save('C:/Users/Downloads/mymap1.png')
Joins
SQL JOIN LEFT
df_Join = pd.merge(df1, df2, left_on='Email', right_on='email', how='left')[['Email', 'source']]
Another example:
df_Merged = df1.merge(df2, on='id', how='left')
Another example to find the non-matching records on 2 fields:
df_join = pd.merge(df1, df2, left_on=['id1', 'field1'], right_on=['id2', 'field2'], how='left')[['id1', 'field1', 'field2']] df_NoMatch = df_join[df_join['field2'].isna()]
SQL INNER JOIN
df_Join = pd.merge(df1, df2, on='Email')
Or:
df_Merged = df1.merge(df2, on='id', how='inner')
Find difference
df_difference = pd.DataFrame(df_First[~df_First['Email'].isin(df_Second['email'])])
Group, count and sort
Count rows
number = df.shape[0] print(number)
Count values (group by)
If you have a field named Country in a data-frame named df
:
CountCountry = df.groupby(['Country'], dropna=False).size().sort_values(ascending=False) print(CountCountry)
Count values (group by) and get it in a dataframe
CountCommune = pd.DataFrame(df_GR_Reglement.groupby(['id_commune'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index() print(tabulate(CountCommune.head(10), headers='keys', tablefmt='psql', showindex=False))
Count values grouping by several fields to find duplicates
df_DuplicatePages = pd.DataFrame(df_GR_Reglement.groupby(['id_commune', 'PAGES'], dropna=False).size(), columns=['Total'])\ .sort_values(['Total'], ascending=False).reset_index() df_DuplicatePages = df_DuplicatePages[(df_DuplicatePages['Total'] > 1)] print(tabulate(df_DuplicatePages.head(10), headers='keys', tablefmt='psql', showindex=False))
Group concat
df = df.groupby(['country'])['city'].apply(lambda x: ', '.join(x)).reset_index()
Count a specific value
print (len(df[df['Your field'] == 'United States']))
Count with condition
myCount = df[df['Field where search the condition'] == 'A value'].count()['Field to count']
Count empty values
number = df['Your field'].isna().sum()
Keep empty values in a group by and fill them
newDf = pd.DataFrame(df.groupby(['Your field'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index() newDf = newDf.fillna('Unknow')
Count email domains (group by)
You have a Email field with clean emails. Thanks to Pandas it is easy to group and count their DNS.
First create a virtual field to split the emails and recover the 2nd part of the split. Then just group your new field.
df['Domain'] = df['Email'].str.split(<a href="mailto:'@').str">'@').str</a>[1] df_Email_DNS_count = pd.DataFrame(df.groupby(['Domain']).size(), columns=['Total'])\ .sort_values(['Total'], ascending=False).reset_index()
Calculate percentages
Acording your needs, you can use the sum
from a Total field.
df['Percent'] = (df_SpecialtiesPerCountry_count['Total'] / df['Total'].sum()) * 100 df['Percent'] = df['Percent'].round(decimals=2)
Or the number of rows in your data-frame (useful if you work with a multi-valued field, see above the Analyze data from a multi-valued field chapter) .
df['Percent'] = (df['Total'] / df.shape[0]) * 100 df['Percent'] = df['Percent'].round(decimals=2)
Sort on a field
df.sort_values(by=['Field'], ascending=False, inplace=True) # Recreate the index if necessary (if you will use the index later) df = df.reset_index(drop=True)
Sort on a field containing both text and numerical values
df = df.loc[pd.to_numeric(df['My field'], errors='coerce').sort_values().index]
Replace, remove, edit with conditions
Replace empty cells
df['speciality'] = df['speciality'].fillna('Other')
Replace a value (the full value)
df['Your field'] = df['Your field'].replace(['Old value'],'New value')
Replace a part of string
df['Your field'] = df['Your field'].replace({'Old value': 'New value'}, regex=True)
Replace a string if it contains
df.loc[df['speciality'].str.contains('Researcher'), 'speciality'] = 'Research Scientist'
If not contains
Add ~
:
df.loc[~df['speciality'].str.contains('Researcher'), 'speciality'] = 'Research Scientist'
Localize and remove a row
ind_drop = df[df['Your field'].apply(lambda x: x == ('A value'))].index df = df.drop(ind_drop)
Localize and remove a row starting with ...
ind_drop = df[df['Your field'].apply(lambda x: x.startswith('A value'))].index df = df.drop(ind_drop)
Localize and remove a row ending with ...
ind_drop = df[df['Your field'].apply(lambda x: x.endswith('A value'))].index df = df.drop(ind_drop)
Localize and replace a full row
df.loc[(df['A field'] == 'TARGET')] = [[NewValue1, NewValue2, NewValue3]]
Localize a row according a regex and other field
df.loc[df[' Field1'].str.contains(pat='^place ', regex=True), 'Field2'] = 'Yes'
Remove some first characters
Here we delete the 2 first characters if the cell starts with a comma then a space.
df['Field'] = df['Field'].apply(lambda x: x[2:] if x.startswith(', ') else x)
Remove the content from a field in another field
df['NewField'] = df.apply(lambda x : x['FieldToWork'].replace(str(x['FieldWithStringToRemove']), ''), axis=1)
Or with a regex, example to remove the content only if it is at the beginning of the field:
df['NewField'] = df.apply(lambda x : re.sub('^'+str(x['StringToRemove']), '', str(x['FieldToWork'])) if str(x['FieldToWork']).startswith(str(x['StringToRemove'])) else str(x['FieldToWork']), axis=1)
Test if a field contains another field
df['A contains B'] = [x[0] in x[1] for x in zip(df['FieldB'], df['FieldA'])]
Edit with a condition
Increment a field if another field is empty.
df.loc[df['My field maybe empty'].notna(), 'Field to increment'] += 1
Fill a field if a field is greater or equal to another field.
df.loc[df['Field A'] >= df['Field B'], 'Field to fill'] = 'Yes'
Edit several fields in the same time.
df.loc[df['Field A'] >= df['Field B'], ['Field A to fill', 'Field B to fill']] = ['Yes', 'No']
Edit with several conditions
Condition "AND" (&
)
df.loc[(df['My field maybe empty'].notna()) & (df['An integer field'] == 1) & (df['An string field'] != 'OK'), 'Field to increment'] += 1
Please replace "&" with a simple &
.
Condition "OR" (|
)
df.loc[(df['My field maybe empty'].notna()) | (df['An integer field'] == 1) | (df['An string field'] != 'OK'), 'Field to fill'] = 'Yes'
Edit with IN
or NOT IN
condition (as SQL)
Just use isin
:
df.loc[df['Id field'].isin([531733,569732,652626]), 'Filed to edit'] = 'Yes'
And for NOT IN
:
df.loc[df['Id field'].isin([531733,569732,652626]) == False, 'Filed to edit'] = 'No'
Replace string beginning with
df['id_commune'] = df['id_commune'].str.replace(r'(^75.*$)', '75056', regex=True)
Remove letters
df['mobile'] = df['mobile'].str.extract('(\d+)', expand=False).fillna('')
Extract before or after a string
Example if Job='IT: DBA'
df['type'] = df['Job'].str.split(': ').str[0] df['speciality'] = df['Job'].str.split(': ').str[1]
Remove all after a string
df_Files['new field'] = df_Files['old field'].str.replace("(StringToRemoveWithAfterToo).*","", regex=True)
Remove all before a string
df_Files['file'] = df_Files['file'].str.replace("^.*?_","_", regex=True)
Get in title case
df['firstname'] = df['firstname'].str.title()
Remove if contains less of n character (lenght)
df.loc[df['mobile'].str.len() < 6, 'mobile'] = ''
Filters
Get lines with a field containing a string
df2 = df1[df1['My field'].str.contains('AAA').fillna(False)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Get lines with a field not containing a string
Just use ~
:
df2 = df1[~df1['My field'].str.contains('AAA').fillna(False)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Get lines with a field matching a regex
Here we search where the field is a number with a comma as decimal separator.
dfTest = df_ImportTxt[df_ImportTxt['My field'].str.match(pat='^\d+,\d+$')==True]
Get lines searching several strings in a field
Example searching space, star and parenthesis:
MySearch = ['\*', ' ', '\(', '\)'] df2 = df1[df1['zone'].str.contains('|'.join(MySearch)).fillna(False)] print(tabulate(df2.head(40), headers='keys', tablefmt='psql'))
Get lines with a field in digit
df2 = df1.loc[df1['zone'].astype(str).str.isdigit()] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Get lines according several condition
Below we get lines with a field with null values and another field with empty values:
df2 = df1[(df1['My field'].isna()) | (df1['My other field'] == '')] print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))
Get lines searching if values are IN a list
searchIn= ['AAA', 'BBB', 'CCC'] df2 = df1[df1['My field'].isin(searchIn)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))
Get lines with a field not finishing with a concatenation from other fields
Suppose you have this dataframe df
:
Language | |
Python | docA-Python.pdf |
SQL | docF-SQ.pdf |
PHP | docS-PHP.pdf |
CSS | docG-CSS.pdf |
Javascript | docR-Javascript.pdf |
HTML | docW-HTML.pdf |
Now you would like extract lines where the PDF field does not finish with [Language field] + '.pdf'
:
Compare = df[['Language', 'PDF']].apply(lambda x: x['PDF'].endswith(x['Language'] + '.pdf'), axis=1) df_temp = pd.DataFrame(Compare) df_temp.columns =['True or false'] df_concat = pd.concat([df, df_temp], axis=1) df_filtered = df_concat[['Language', 'PDF']][df_concat['True or false'] == False] print(tabulate(df_filtered.head(5), headers='keys', tablefmt='psql', showindex=False))
Language | |
SQL | docF-SQ.pdf |
Merge/concatenate
Merge 2 lists in 2 columns and properly display them in the terminal
import pandas as pd from tabulate import tabulate LanguageList = ['Python', 'SQL', 'PHP', 'CSS', 'Javascript', 'HTML'] LevelList = ['Fun', 'OK', 'Arghh', 'OK', 'Arghh', 'Easy'] LanguageLevel = list(zip(LanguageList, LevelList)) df = pd.DataFrame(LanguageLevel, columns =['Language', 'Level']) print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))
The 7th line uses zip
to link the values from our lists in tuples. list
encloses our tuples in a list.
The 9th line creates a data-frame (pd.DataFrame
) from our new list (now containing tuples). Specifying the column labels.
The print uses tabulate
to display our fields with a beautiful Postgres style (psql
).
Split a dataframe with conditions, and them merge the splits according other conditions
firstSelection = df.loc[df['Field'] == 'Yes'].head(100) secondSelection = df.loc[df['Field'] == 'No'].head(200) # Merge the 2 selections. df_Selections = pd.concat([firstSelection, secondSelection], axis=0)
Warnings
Avoid indexation problems after dataframes concat
Sometimes after a dataframes concatenation, the indexes are melted and can cause some problems. Just ignore the index in your new dataframe:
df_Selections = pd.concat([firstSelection, secondSelection], axis=0, ignore_index=True)
Avoid copy warning
import warnings from pandas.core.common import SettingWithCopyWarning warnings.simplefilter(action='ignore', category=SettingWithCopyWarning)
Avoid user warning
import warnings warnings.simplefilter(action='ignore', category=UserWarning)
Miscellaneous
Create a dataframe from a previous dataframe (copy)
df2 = df1.copy()
Copy data from a dataframe in the clipboard
Very convenient to copy-paste quickly in a text file, CSV, Excel ...
df.to_clipboard(sep=',', index=False, header=None)
Create a dataframe from a list
import pandas as pd from tabulate import tabulate MyList = ['7878', 'Bbbb', 'azerfg', '545', 'XXX'] df = pd.DataFrame(MyList) df.columns =['My field'] number = df.shape[0] print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False)) print(number)
Dedupe
To dedupe a dataframe on an Email field keeping only the first duplicate record, use drop_duplicates
.
df['Email'] = df['Email'].str.lower() dfDeduped = df.drop_duplicates(subset=['Email'], keep='first')
Find duplicates
Use duplicated
:
df2 = df1[df1.duplicated(['My field'], keep=False)] print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))
Export a picture from a chart
And add a border.
from PIL import Image, ImageOps fig1.savefig('C:/Users/myplot.png', dpi=100) im = Image.open('C:/Users/myplot2.png') bordered = ImageOps.expand(im, border=1, fill=(0, 0, 0)) bordered.save('C:/Users/myplot2.png')
Re-organize columns
df = df[['Field 1', 'Adresse mail', 'Note algorithmique', 'Substitutions']]
Re-name columns
From their original name:
df.rename(columns={'Old name 1': 'New name 1', 'Old name 1': 'New name 1'}, inplace=True)
From their position:
df.columns.values[12] = 'New name'
Delete column
df.drop('My field', axis=1, inplace=True)
Format number with space thousand separator
number = df.shape[0] print(format(number, ',').replace(',', ' '), 'customers')
Or:
f"{number_of_rows:,}".replace(',', ' ')
Ranking
To rank according to a field, from highest to lowest:
df['My rank'] = df['My field to rank'].rank(ascending=False)
Avoid duplicate:
df['My rank'] = df['My field to rank'].rank(ascending=False, method='first')
Generate a unique random integer
If you want a unique random integer from 1 included to the number of records included.
howManyRecord = df.shape[0] df['Random number'] = np.random.choice(range(howManyRecord), howManyRecord, replace=False) df.loc[df['Random number'] == 0, 'Random number'] = howManyRecord
Select with condition
df = df.loc[df['Field to filter'] == 'Yes']
Prevent nan
Use keep_default_na=False
:
Data = pd.read_excel(inputExcelFile, sheet_name='Feuil1', engine='openpyxl', keep_default_na=False)
Get tuples from a file (in a SQL UPDATE
shape)
...df.itertuples()...
Very usefull to build SQL update queries from a file with the primary key in first column :
file1 = open(MyPath+'FixQueries.sql', 'w', encoding='cp1252') # ÉCRITURE DU FICHIER for i in df.itertuples(index=False): file1.write('UPDATE MyTable SET ' +\ str(i)[7:-1].split(', ', 1)[-1] +\ " WHERE id = '" + i[0] + "' ;\n")
Start the index to a special value
Vey usefull to then use it in Excel formulas (starting with 2 ...).
df.index = df.index + 2
Use index in concatenation
df.index.astype(str)
Reset index
df.reset_index(drop=True, inplace=True)
Open a box to enter a string and use it as a variable
import tkinter as tk from tkinter import simpledialog window = tk.Tk() window.eval('tk::PlaceWindow . center') window.withdraw() repertoireCarto = simpledialog.askstring("Saisie", "Répertoire ?\t\t\t\t", parent=window) print('Le répertoire est', repertoireCarto)
Get some line indexes according condition
list_duplicate = df.index[df['My field'] == 'Text'].tolist()
Import external scripts
Your working script needs to know where to find the external scripts:
sys.path.append('C:\\Users\\Georges\\PycharmProjects\\Your_Directory') from YourScriptWithoutExtension import SomeVariables, ...
Subprocess
Not related to Pandas but very useful to run shell command, bash, console script...
subprocess.run('psql -h localhost -p 5432 -d work -U postgres -c "SELECT NOW() ;"', shell=True)
Strings comparison
Compare the similarity of 2 strings, to get a indicator in a new field:
import difflib from difflib import SequenceMatcher df['indicator'] = df[['Field1', 'Field2']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)
Lists
Here below is not right about Pandas, but usefull in some Pandas contexts.
Search if elements from a list are in another list
Chek if elements from List2 are in List1:
import collections List1 = ['Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh'] List2 = ['Eeee', 'Cccc'] print(List1) print(List2) check = all(item in List1 for item in List2) print(check)
Or:
import collections List1 = ['Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh'] List2 = ['Bbbb', 'Aaa', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Hhhh', 'Gggg'] print(List1) print(List2) def iCompare(c1, c2): if(collections.Counter(c1)==collections.Counter(c2)): return 'Yes' else: return 'No' info = iCompare(List1, List2) print(info)
Extract common elements from several lists
List1 = ['Aaa', 'Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh'] List2 = ['Eeee', 'Cccc', 'Cccc'] List3 = ['Cccc', 'Bbbb', 'Eeee'] print('List1:', List1) print('List2:', List2) print('List3:', List3) Compare = sorted(list(set(List1) & set(List2) & set(List3) )) print('\nCompare:', Compare)
Extract no-common elements from several lists
List1 = ['Aaa', 'Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh'] List2 = ['Eeee', 'Cccc', 'Cccc'] print('List1:', List1) print('List2:', List2) Compare = sorted(list(set(List1) - set(List2) )) print('\nCompare:', Compare)
Concatenate 2 fields in list/sentence
Assemblage = list(zip(df['Field 1'].tolist(), df['Field 2'].tolist())) b = [] for a in Assemblage: b.append(str(list(a)[0]) + ' (' + list(a)[1] + ')') print(str(b).replace('[', '').replace(']', '').replace("'", ''), '\n')
Another example:
Assemblage = list(zip(df['field1'].tolist(), df['field2'].tolist())) b = [] for a in Assemblage: b.append('Blablabla... ' + str(list(a)[0]) + ' ...Blablabla... ' + list(a)[1] + ' ...Blablabla.') for i in b: print(i)