Pandas is an excellent Python librarie to manage data. Matplotlib allows to create advanced charts and Openpyxl is very usefull to read/write Excel files. These 3 tools, combined with other classic Python features, allow to do data analysis and engineering.
First install or check Python and Pip, then the 3 libraries:
pip install pandas pip install matplotlib pip install openpyxl
Read data file (CSV, Excel, MySQL, Postgres, JSON, Shape ...)
Read and display an Excel file
import pandas as pd from tabulate import tabulate df = pd.read_excel('C:/Users/Your file.xlsx', sheet_name='Export', engine='openpyxl', usecols=['My first column', 'My second column']) print(tabulate(df.head(35), headers='keys', tablefmt='psql', showindex=True))
I set engine='openpyxl'
because it is from an XLSX file. sheet_name
specifies the sheet to import, specify your columns with usecols
.
Then I use tabulate
to display properly my data in a Postgres way (tablefmt='psql'
).
With head(35)
I just display the 35 first lines, but the dataframe here named df
contains all the data.
If you wan to re-organize the columns, please see the Miscellaneous chapter.
Read a CSV file
import pandas as pd from tabulate import tabulate df_subscriber = pd.read_csv(workDirectory+'joo_acymailing_subscriber.csv', sep=';', usecols=['subid', 'source', 'email']) print(tab(df_subscriber.head(10), headers='keys', tablefmt='psql', showindex=False))
Add textual column names from a CSV without column names
Use names=[]
:
import pandas as pd from tabulate import tabulate df = pd.read_csv(inpuMyFilePath, names=['Mon Champ 1', 'Mon Champ 2', 'Mon Champ 3'], sep=',', keep_default_na=False).astype(str) print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False))
Read a non-standard Excel file
Sometimes a file has wrong top lines, or you can not trust the future column names.
To read a file from a specific row for example, skipping the last rows.
import pandas as pd from tabulate import tabulate df = pd.read_excel(r'C:/Users/Georges/Downloads/test excel.xlsx', sheet_name='Sheet1', engine='openpyxl', usecols=[0,1], header=5-1, skipfooter=1, names=['id', 'value']) print(tabulate(df.head(35), headers='keys', tablefmt='psql', showindex=False))
Or skip a specific line (with skiprows=[...]
):
... df = pd.read_excel(my_file, sheet_name='Feuil1', engine='openpyxl', skiprows=[2]) ...
Use first row as label column
my_labels = df.iloc[0] df = df[1:] df.columns = my_labels
Read from MySQL
Installing sqlalchemy
(the best way with pandas):
from tabulate import tabulate as tab import pandas as pd from sqlalchemy import create_engine username = 'root' password = '' port = 3308 database = 'my_beautiful_db' host ='localhost' engine = create_engine('mysql+mysqldb://%s:%s@%s:%i/%s' % (username, password, host, port, database)) sql = 'SELECT COUNT(*) AS total FROM communes ;' df = pd.read_sql_query(sql, engine).set_index('total') print(tab(df.head(5), headers='keys', tablefmt='psql'))
Specify database encoding
engine = create_engine('mysql+mysqldb://%s:%s@%s:%i/%s?charset=utf8mb4' % (username, password, host, port, database))
Or installing mysqlclient
:
import MySQLdb from tabulate import tabulate import pandas as pd mydb = MySQLdb.connect(user='root', password='', database='MyDb', host='localhost', port=3306) query = 'SELECT name FROM users ;' df = pd.read_sql(query, mydb) mydb.close() print(tabulate(df.head(10), headers='keys', tablefmt='psql')) number = df.shape[0] print(number)
Or below, installing mysql.connector
:
import mysql.connector as connection import pandas as pd from tabulate import tabulate as tab mydb = connection.connect(host='XXX.XX.XXX.XX', port=3308, database='MyDb',user='Me', passwd='pwdXXX!456', use_pure=True) try: query = "SELECT firstname, lastname FROM customer ;" df = pd.read_sql(query, mydb) except Exception as e: mydb.close() print(str(e)) print(tab(df.head(5), headers='keys', tablefmt='psql'))
Read from Postgres
import pandas as pd from tabulate import tabulate import psycopg2 con = psycopg2.connect(database='mytransformations', user='postgres', password='', host='localhost', port='5432') print("OK, we are connect to mytransformations") # DISPLAY WITH PANDAS df = pd.read_sql("SELECT * FROM myparcels ;", con) df.drop('geom', axis=1, inplace=True) print('\nResults:') print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False)) print('Number of records:', df.shape[0])
Read from a JSON
import pandas as pd from tabulate import tabulate import json import requests response = requests.get(MON_URL) my_file = json.loads(response.text) my_json = my_file['data'] mon_df = pd.DataFrame.from_dict(my_json)
Read from a shape
Use geopandas
:
import geopandas as gpd from tabulate import tabulate myShape = 'C:\\Path\\Of\\My\\Shape.shp' print('\n' + myShape) df = gpd.read_file(myShape) df['type'] = df['geometry'].astype(str).str.replace(r' .*|\(.*', '', regex=True) df = df[['id', 'type', 'geometry']] print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=True)) print(df.shape[0]) MyFieldList = df['id'].drop_duplicates().dropna().sort_values(ascending=False).tolist() print('\n' + str(MyFieldList)) MyGeomTypeList = df['type'].drop_duplicates().dropna().sort_values(ascending=False).tolist() print('\n' + str(MyGeomTypeList))
Export a data-frame in a new Excel file
df.to_excel("C:/Users/Your new file.xlsx", index=False, sheet_name='Tab 1')
index=False
to not get the index column.
Export a data-frame in an existing Excel sheet
Use mode='a'
and if_sheet_exists
:
from openpyxl import load_workbook import openpyxl My_File_Path= 'C:/My/File.xlsx' with pd.ExcelWriter(My_File_Path, 'openpyxl', mode='a', if_sheet_exists='replace') as writing: df.to_excel(writing, 'My sheet', index=False)
Export several dataframes in Excel sheets
If you have several data-frame to export in one Excel file, each in unique sheets, you should open an ExcelWriter
. And close it.
writer = pd.ExcelWriter('C:/Users/Your new file.xlsx', engine='xlsxwriter') df_Country.to_excel(writer, index=False, sheet_name='Country', header=['Country', 'Total']) df_Emails.to_excel(writer, index=False, sheet_name='Emails', header=['Emails', 'Total']) writer.save()
Export several dataframes in the same sheet
df.to_excel(writer, index=False, sheet_name='Registrations', startcol=3, startrow=34, header=['Filed', 'Total', '%'])
Export a dataframe as a txt file, with separator and ANSI encoding
df.to_csv(my_file_import, header=None, index=None, sep='\t', encoding='mbcs')
Replace a string with a dataframe in an existing txt file
Do a replace
with a df.to_string()
, maybe with a little regex to replace unwanted spaces.
with open(MtTxtFile, 'r') as file: MyTxtData = MyTxtData.replace('STRING TO REPLACE', df.to_string(header=False, index=False, justify='left')) MyTxtData = re.sub(r'^ *', '', MyTxtData, flags=re.MULTILINE) with open(MtTxtFile, 'w') as file: file.write(MyTxtData)
Add filters in all columns in one sheet in an Excel file
import openpyxl from openpyxl.utils import get_column_letter outputExcelFile = r'C:/User/Your file.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) worksheet = workbook['Sheet name'] FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row) worksheet.auto_filter.ref = FullRange workbook.save(outputExcelFile)
Add filters in all columns in all sheets in an Excel file
import openpyxl from openpyxl.utils import get_column_letter outputExcelFile = r'C:/User/Your file.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) sheetsLits = workbook.sheetnames for sheet in sheetsLits: worksheet = workbook[sheet] FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row) worksheet.auto_filter.ref = FullRange workbook.save(outputExcelFile)
Add colors in all column names in a sheet in an Excel file
import openpyxl from openpyxl.styles import PatternFill workbook = openpyxl.load_workbook(My_File_Path) worksheet = workbook['My sheet'] for cell in worksheet[1]: worksheet[cell.coordinate].fill = PatternFill(fgColor='FFC6C1C1', fill_type='solid') workbook.save(My_File_Path)
Add colors in all column names in all sheets in an Excel file
import openpyxl from openpyxl.styles import PatternFill outputExcelFile = r'C:/Users/Your File.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) sheetsLits = workbook.sheetnames for sheet in sheetsLits: worksheet = workbook[sheet] for cell in workbook[sheet][1]: worksheet[cell.coordinate].fill = PatternFill(fgColor = 'FFC6C1C1', fill_type = 'solid') workbook.save(outputExcelFile)
Modify a column width in one sheet in an Excel file
import openpyxl outputExcelFile = r'C:/Users/Your File.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) sheetsLits = workbook.sheetnames workbook['Your sheet'].column_dimensions['A'].width = 75 workbook.save(outputExcelFile)
Modify all columns width in all sheets in an Excel file
The code below set the A column size to 30, and the others to 10.
import openpyxl from openpyxl.utils import get_column_letter outputExcelFile = r'C:/Users/Your File.xlsx' workbook = openpyxl.load_workbook(outputExcelFile) sheetsLits = workbook.sheetnames for sheet in sheetsLits: for cell in workbook[sheet][1]: if get_column_letter(cell.column) == 'A': workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 30 else: workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 10 workbook.save(outputExcelFile)
Add color to some line recovering their indexes from condition
list_duplicate = df.index[df['Ligne en doublon'] == 'DOUBLON !'].tolist() for index in list_duplicate: for cell in worksheet[index+2]: worksheet[cell.coordinate].fill = PatternFill(fgColor='FF6666', fill_type='solid') workbook.save(My_File_Path)
Freeze the top row of an Excel file
worksheet.freeze_panes = 'A2'
Freeze the top row and the first column of an Excel file
worksheet.freeze_panes = 'B2'
Freeze the top row and the 2 first columns of an Excel file
worksheet.freeze_panes = 'C2'
Change font color and style in an Excel file
from openpyxl.styles import Font worksheet['A1'].font = Font(color='FFFFFF', italic=False, bold=True)
Work on several Excel files
Here we freeze the top row, add filters, color and change the width of all columns from 3 Excel files.
import openpyxl from openpyxl.utils import get_column_letter from openpyxl.styles import PatternFill File1 = r'D:/Work/My file 1.xlsx' File2 = r'D:/Work/My file 2.xlsx' File3 = r'D:/Work/My file 3.xlsx' ExcelList = [File1, File2, File3] for ExcelFile in ExcelList: workbook = openpyxl.load_workbook(ExcelFile) sheetsLits = workbook.sheetnames for sheet in sheetsLits: worksheet = workbook[sheet] FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row) worksheet.auto_filter.ref = FullRange worksheet.freeze_panes = 'A2' for cell in workbook[sheet][1]: worksheet[cell.coordinate].fill = PatternFill(fgColor='FFC6C1C1', fill_type='solid') if get_column_letter(cell.column) == 'A': workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 12 else: workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 20 workbook.save(ExcelFile)
Insert a picture in an Excel file
And delete them after.
img = openpyxl.drawing.image.Image('C:/Users/myplot.png') img.anchor = 'E6' workbook['Your sheet'].add_image(img) workbook.save(outputExcelFile) os.remove('C:/Users/myplot.png')
Open a box to search a file and get his path
Use Tkinter.
from tkinter import Tk from tkinter.filedialog import askopenfilename Tk().withdraw() filename = askopenfilename() print(filename)
Open a box to save an Excel file
from tkinter import Tk from tkinter.filedialog import asksaveasfilename fileName = asksaveasfilename(filetypes=(("Excel files", "*.xlsx"), ("All files", "*.*")), defaultextension='.xslx') df.to_excel(fileName, index=False, sheet_name='Tab 1', header=None)
Check if a file exists
And use it only if it exists:
import os my_file = 'C:/_gh/0/export.xlsx' if os.path.isfile(my_file) == True: df_ExportFeuil1 = pd.read_excel(my_file, sheet_name='Feuil1', engine='openpyxl')
Rename a file
import os os.rename('My file.xlsx', 'My file renamed.xlsx')
Get dataframe from lists
One list, one field
import pandas as pd from tabulate import tabulate print('\nONE LIST, ONE FIELD') MyList = ['7878', 'Bbbb', 'azerfg', '545', 'XXX'] df = pd.DataFrame(MyList) df.columns = ['My field'] number = df.shape[0] print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False)) print(number)
Several lists in their own field, each list with same lenght
import pandas as pd from tabulate import tabulate print('\nSEVERAL LISTS IN THEIR OWN FIELD') print('EACH LIST WITH SAME LENGHT') MyList1 = ['7878', 'Bbbb', 'azerfg', '545', 'XXX'] MyList2 = ['Bruno', 'Constance', 'Mathieu', 'Abbes', 'Georges'] df = pd.DataFrame( {'My field 1': MyList1, 'My field 2': MyList2 }) number = df.shape[0] print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False)) print(number)
Several lists in the same field, lists with any lenght
import pandas as pd from tabulate import tabulate print('\nSEVERAL LISTS IN THE SAME FIELD') print('LISTS WITH ANY LENGHT') MyList1 = ['7878', 'Bbbb', 'azerfg', '545', 'XXX'] MyList2 = ['Bruno', 'Constance', 'Mathieu', 'Abbes'] df = pd.DataFrame(MyList1 + MyList2) df.columns = ['My field'] number = df.shape[0] print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False)) print(number)
Several lists from dictionaries, each list element as a field, lists with any lenght
import pandas as pd from tabulate import tabulate print('\nSEVERAL LISTS FROM DICTIONARIES') print('EACH LIST ELEMENT AS A FIELD') print('LISTS WITH ANY LENGHT') MyList1 = ['Bruno', '11111', 'Rouge'] MyList2 = ['Constance', '22222', 'Jaune'] MyList3 = ['Mathieu', '33333', 'Bleu'] MyList4 = ['Abbes', '44444'] df = pd.DataFrame([MyList1] + [MyList2] + [MyList3] + [MyList4]) df.columns = ['My field 1', 'My field 2', 'My field 3'] number = df.shape[0] print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False)) print(number)
A list from several lists from dictionaries, each sub-list as a row, each element from sub-list as a field, lists with any lenght
import pandas as pd from tabulate import tabulate print('\n LIST FROM SEVERAL LISTS FROM DICTIONARIES') print('EACH SUB-LIST AS A ROW') print('EACH ELEMENT FROM SUB-LIST AS A FIELD') print('LISTS WITH ANY LENGHT') MyList = [ ['Bruno', '11111', 'Rouge'], ['Constance', '22222', 'Jaune'], ['Mathieu', '33333', 'Bleu'], ['Abbes', '44444'] ] df = pd.DataFrame(columns=['My field 1', 'My field 2', 'My field 3'], data=MyList) number = df.shape[0] print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False)) print(number)
Work with lists
Here below is not right about Pandas, but usefull in some Pandas contexts.
Search if elements from a list are in another list
Chek if elements from List2 are in List1:
import collections List1 = ['Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh'] List2 = ['Eeee', 'Cccc'] print(List1) print(List2) check = all(item in List1 for item in List2) print(check)
Or:
import collections List1 = ['Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh'] List2 = ['Bbbb', 'Aaa', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Hhhh', 'Gggg'] print(List1) print(List2) def iCompare(c1, c2): if(collections.Counter(c1)==collections.Counter(c2)): return 'Yes' else: return 'No' info = iCompare(List1, List2) print(info)
Extract common elements from several lists
List1 = ['Aaa', 'Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh'] List2 = ['Eeee', 'Cccc', 'Cccc'] List3 = ['Cccc', 'Bbbb', 'Eeee'] print('List1:', List1) print('List2:', List2) print('List3:', List3) Compare = sorted(list(set(List1) & set(List2) & set(List3) )) print('\nCompare:', Compare)
Extract no-common elements from several lists
List1 = ['Aaa', 'Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh'] List2 = ['Eeee', 'Cccc', 'Cccc'] print('List1:', List1) print('List2:', List2) Compare = sorted(list(set(List1) - set(List2) )) print('\nCompare:', Compare)
Concatenate 2 fields in list/sentence
Assemblage = list(zip(df['field1'].tolist(), df['field2'].tolist())) b = [] for a in Assemblage: b.append('Blablabla... ' + str(list(a)[0]) + ' ...Blablabla... ' + list(a)[1] + ' ...Blablabla.') for i in b: print(i)
List all file from a directory in a Python list
repertoireCheck = 'E:/_CheckTables/' listTxtFile = [] for path, subdirs, files in os.walk(repertoireCheck): for name in files: print(os.path.join(path, name)) listTxtFile.append(os.path.join(path, name)) print(listTxtFile) print(len(listTxtFile))
Split a big list in smaller sub-lists (1)
SubListsSize = 750 SubLists = [MyBigList[x:x+SubListsSize] for x in range(0, len(MyBigList), SubListsSize)] for SubList in SubLists: print(SubList)
Split a big list in smaller sub-lists (2)
def split_list(my_list, size_max): sublists = [] for i in range(0, len(my_list),size_max): sublist = liste[i:i+size_max] sublists.append(sublist) return sublists sublists = split_list(MyPersonalList, 10)
Advanced read/write in Excel
Formulas
Write formula
No problem to create Excel formulas with Pandas, but use these tips below.
- As you will use simple and double quote in your formulas, frame your string with triple quotes.
- Translate in english all Excel keywords (IF, CHAR, SUBSTITUTE...).
- Use a simple comma to separate your parameters (english...).
- Re-sort your dataframe and reset your index before the formula:
df.sort_values(by=['Your Id'], ascending=True, inplace=True) df = df.reset_index(drop=True)
- Start the index with a special value (to start with 2, as usual in Excel):
df.index = df.index + 2
- Then use the index as a string in your formula:
'''...''' + df.index.astype(str) + '''...'''
- Do not add other sorts after the formulas creation, it causes index issues and impacts your formulas.
Read formula
Sometime you want read formula as string. If the Excel file was created by Excel himself, no problem, just use data_only=True
:
wb = openpyxl.load_workbook('My file.xlsx', data_only = True) ws = wb['My tab'] df = pd.DataFrame(ws.values) print(tabulate(df.head(5), headers='keys', tablefmt='psql', showindex=False))But if the file, and the formulas, was created with Python, so the Excel cache do not know your formulas and... oups!So just open and close your file before to use
data_only=True
:
import xlwings ... excel_app = xlwings.App(visible=False) excel_book = excel_app.books.open('My file.xlsx') excel_book.save() excel_book.close() excel_app.quit()
Conditional formatting
If a number is equal to, in just one cell
Here we add in one cell (B3) a rule to fill it in blue if it equals to 2:
import openpyxl from openpyxl import styles, formatting myFile = 'C:/Path/test.xlsx' myDF.to_excel(myFile, index=False, sheet_name='Tab 1') workbook = openpyxl.load_workbook(myFile) worksheet = workbook['Sheet1'] myBlue = styles.PatternFill(start_color='0088cc', end_color='0088cc', fill_type='solid') worksheet.conditional_formatting.add('B3', formatting.rule.CellIsRule(operator='equal', formula=['2'], fill=myBlue)) workbook.save(myFile)
If a number is greater than, in whole column
Same as above, we just change the operator with greaterThan
and we target the whole column B as Excel did, with $1:$1048576
:
import openpyxl from openpyxl import styles, formatting myFile = 'C:/Path/test.xlsx' myDF.to_excel(myFile, index=False, sheet_name='Tab 1') workbook = openpyxl.load_workbook(myFile) worksheet = workbook['Sheet1'] myBlue = styles.PatternFill(start_color='0088cc', end_color='0088cc', fill_type='solid') worksheet.conditional_formatting.add('B1:B1048576', formatting.rule.CellIsRule(operator='greaterThan', formula=['2'], fill=myBlue)) workbook.save(myFile)
Operators
The above method supports several operators:
{'beginsWith', 'endsWith', 'notBetween', 'notContains', 'lessThan', 'greaterThanOrEqual', 'notEqual', 'between', 'lessThanOrEqual', 'containsText', 'greaterThan', 'equal'}
If contains a string
For match with strings it is a little specific, with redundant code and building the column_slot
ourself, using the size of our dataframe +1
:
Here for example we fill in red the cells from A column when contain the string CSS.
import pandas as pd import openpyxl from openpyxl import styles, formatting from openpyxl.styles.differential import DifferentialStyle from openpyxl.formatting.rule import Rule myFile = 'C:/Path/test.xlsx' myDF.to_excel(myFile, index=False, sheet_name='Tab 1') workbook = openpyxl.load_workbook(myFile) worksheet = workbook['Sheet1'] myRed = styles.PatternFill(start_color='ffc7ce', end_color='ffc7ce', fill_type='solid') column_slot = 'A2:A'+str(myDF.shape[0]+1) styleRed = DifferentialStyle(fill=myRed) rule = Rule(type='containsText', operator='containsText', text='css', dxf=styleRed) rule.formula = ['SEARCH("css",' + column_slot + ")"] worksheet.conditional_formatting.add(column_slot, rule) workbook.save(myFile)
Fix Excel percentage
When your data comes from an Excel file and contains percentages, something like dtype
(search it in Miscellaneous chapter) will be non-efficient, because Excel stores percentages as numbers (indices between 0 and 1). The displaying with % is just a view. Actually 35% will be stored as 0.3 for example.
In addition, it is possible that these percentages are melted with true strings or numbers in the same columns, and sometimes with or without decimal ... Argghhh 🥵
But just relax:
import re ... df[MyField] = df[MyField].apply(lambda x: str(round(float(x) * 100, 2)).rstrip('0').rstrip('.') + ' %' if re.search(r'^0\.\d+$|^0$|^1$',x) else x)
- Attention
- If the Excel file is wrong builded, melting true numbers between 0 and 1 with percentages, they will be converted. So maybe add red alerts to manage these special cases.
Advanced write in a TXT file
Split file according to lines numbers
Here below is not right about Pandas, but usefull in some Pandas contexts.
First we count the number of lines in the file (from the recurrence of \n
), then we split it in files with 50,000 lines if it exceeds 50,000 lines.
with open(workDirectory+'FixQueries.sql', 'r') as myfile: data = myfile.read() taille_fichier = data.count("\n") lines_max = 50000 numero_fichier = 0 if taille_fichier > lines_max: print('Attention : le fichier de sortie fait '+str(taille_fichier)+' lignes ! Veuillez patienter...') smallfile = None with open(workDirectory+'FixQueries.sql') as bigfile: for lineno, line in enumerate(bigfile): if lineno % lines_max == 0: if smallfile: smallfile.close() numero_fichier += 1 small_filename = workDirectory + 'FixQueries {}.sql'.format(numero_fichier) smallfile = open(small_filename, "w") smallfile.write(line) if smallfile: smallfile.close() print('Nous l\'avons découpé en ', ceil(taille_fichier/lines_max), 'fichiers !\n') file1.close() os.remove(workDirectory+'FixQueries.sql')
And then merge the files:
filenames = ['C:/_gh/0/file_25000.txt', 'C:/_gh/0/file_50000.txt', 'C:/_gh/0/file_75000.txt', 'C:/_gh/0/file_100000.txt', 'C:/_gh/0/file_125000.txt'] with open('C:/_gh/0/CUMUL1.txt', 'w') as outfile: for names in filenames: with open(names) as infile: outfile.write(infile.read())
Maybe you will need to list the names of the files before to merge them, with PowerShell (Alt+F+R):
get-childitem | select-object -expandproperty name > _files_list.txt
- Attention 👈
- Be careful to keep the order of the file, to keep the last line empty.
-
Replace text in a file
my_file = r'C:/_gh/0/_Urbanisme/test.txt' replacements = [ ('heros', 'héros'), ('Batm?n', 'Batman'), ('télévisee', 'télévisée'), (' s attaquent', ' s\'attaquent') ] with open(my_file, 'r') as file: data = file.read() for a, b in replacements: data = data.replace(a, b) with open(my_file, 'w') as file: file.write(data) # Confirm print("OK !")
This is not right related to Pandas, but it is very useful to store some special note to remember during a data process, and automate it.
Create/replace a TXT file and write something into
Use the Write option (w
):
MyTxtFile = 'C:/_gh/0/My beautiful file.txt' with open(MyTxtFile, 'w') as file: file.write('Blablabla...\n') file.write(MyVariable)
Add some lines at the start of an existing file
MyTxtFile = 'C:/_gh/0/My beautiful file.txt' with open(MyTxtFile, 'r') as original: data = original.read() with open(MyTxtFile, 'w') as modified: modified.write('XPTDR\n' + data)
Add some lines at the end of an existing file
Use the Append option (a
):
MyTxtFile = 'C:/_gh/0/My beautiful file.txt' with open(MyTxtFile, 'a') as file: file.write('Blablabla...\n') file.write(MyVariable)
Replace some lines in an existing file
Here we first Read (r
option) and store the content of the file, and then re-create it removing the lines we would delete:
MyTxtFile = 'C:/_gh/0/My beautiful file.txt' with open(MyTxtFile, 'r') as f: lines = f.readlines() with open(MyTxtFile, 'w') as f: for line in lines: if not line.strip('\n').startswith('First line to delete...') : if not line.strip('\n').startswith('Second line to delete...') : if not line.strip('\n').startswith('Third line to delete...') : f.write(line) with open(MyTxtFile, 'a') as file: file.write('New line 1') file.write('New line 2') file.write('New line 3')
Replace a simple string in a TXT file
MyTxtFile = 'C:/_gh/0/My beautiful file.txt' with open(MyTxtFile, 'r') as file : filedata = file.read() filedata = filedata.replace('String 1', 'String 2') with open(MyTxtFile, 'w') as file: file.write(filedata)
Replace a string with a regular expression (regex) in a TXT file
FichierTest = 'C:/_gh/0/FichierTest.txt' import re # Find number : r'[0-9]' # Find line break : r'[\n]' # Find double line break : r'[\r\n]{2,}' with open(FichierTest, 'r+') as file: text = file.read() text = re.sub(r'[\r\n]{2,}', '\n\n', text) file.seek(0, 0) # seek to beginning file.write(text) file.truncate() # get rid of any trailing characters
Merge 2 txt files
MyTxtFile1 = 'C:/_gh/0/My beautiful file 1.txt' MyTxtFile2 = 'C:/_gh/0/My beautiful file 2.txt' MyTxtFileMerged = 'C:/_gh/0/My beautiful file merged.txt' with open(MyTxtFile1, 'r') as _MyTxtFile1: dataMyTxtFile1 = _MyTxtFile1.read() with open(MyTxtFile2, 'r') as _MyTxtFile2: dataMyTxtFile2 = _MyTxtFile2.read() with open(MyTxtFileMerged, 'w', encoding='utf-8') as _MyTxtFileMerged: _MyTxtFileMerged.write(dataMyTxtFile1) _MyTxtFileMerged.write(dataMyTxtFile2) os.remove(MyTxtFile1) os.remove(MyTxtFile2)
Test if a comma is used as decimal separator in a TXT file
with open(MyTxtFile, 'r') as myfile: data = myfile.read() pb_regex = re.findall(r'(?:\t|^)\d+\K,(?=\d+(?:\t|$))', data) if pb_regex: print(colored('\nAttention, comma is used as a decimal separator in numeric value!\n' + 'Find them in Notepad with:\n(?:\\t|^)\d+\K,(?=\d+(?:\\t|$))', 'yellow')) else: pass
Write sentences from dataframe
MyTxtFile = WorkDirectory + ' - My file.txt' with open(MyTxtFile, 'w') as file: file.write('\nSpecial cases:\n\n') Assemblage = list(zip(df['Field1'].tolist(), df['Field2'].tolist(), df['Field3'].tolist(), df['Field4'].tolist())) b = [] for a in Assemblage: b.append('- ' + str(list(a)[0]) + ' ' + str(list(a)[1]) + ' (' + str(list(a)[2]) + ') in ' + str(list(a)[3])) for i in b: file.write(i + '\n')
Convert encoding
With chardet
:
from chardet import detect def get_encoding_type(file): with open(file, 'rb') as f: rawdata = f.read() return detect(rawdata)['encoding'] from_codec = get_encoding_type(MyOriginalFile) if from_codec != 'utf-8': print(colored('Attention, script is ' + from_codec + '!', 'yellow')) try: with open(, 'r', encoding=from_codec) as f, open(MyNewFile, 'w', encoding='utf-8') as e: text = f.read() e.write(text) os.remove(MyOriginalFile) os.rename(MyNewFile, MyOriginalFile) print(colored('We converted the file as UTF-8.', 'yellow')) except UnicodeDecodeError: print('Decode Error') except UnicodeEncodeError: print('Encode Error')
With io
:
import io with io.open(OriginalTxtFile, mode='r', encoding='utf8') as fd: content = fd.read() with io.open(FinalTxtFile, mode='w', encoding='cp1252') as fd: fd.write(content)
Advanced read in a TXT file
Work data in a TXT file according the line number
You can grab data in a TXT file according the line number, or counting a number of lines before or after a specific string.
file = open('C:/Users/Georges/Downloads/MyFile.txt', 'r') lines = file.readlines() listSubject = [] listFirstname = [] for num, x in enumerate(lines): if x.startswith('Subject:\t'): listSubject.append(x) listFirstname.append(lines[num+6]) MergeLists = list(zip(listSubject, listFirstname)) df = pd.DataFrame(MergeLists, columns=['field Subject', 'field Firstname'])
Get encoding of a file
from chardet import detect def get_encoding_type(file): with open(file, 'rb') as f: rawdata = f.read() return detect(rawdata)['encoding'] from_codec = get_encoding_type(MyFile) print('from_codec') print(from_codec)
Merge, concatenate, append
Merge 2 lists in 2 columns and properly display them in the terminal
import pandas as pd from tabulate import tabulate LanguageList = ['Python', 'SQL', 'PHP', 'CSS', 'Javascript', 'HTML'] LevelList = ['Fun', 'OK', 'Arghh', 'OK', 'Arghh', 'Easy'] LanguageLevel = list(zip(LanguageList, LevelList)) df = pd.DataFrame(LanguageLevel, columns =['Language', 'Level']) print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))
The 7th line uses zip
to link the values from our lists in tuples. list
encloses our tuples in a list.
The 9th line creates a data-frame (pd.DataFrame
) from our new list (now containing tuples). Specifying the column labels.
The print uses tabulate
to display our fields with a beautiful Postgres style (psql
).
Split a dataframe with conditions, and them merge the splits according other conditions
firstSelection = df.loc[df['Field'] == 'Yes'].head(100) secondSelection = df.loc[df['Field'] == 'No'].head(200) # Merge the 2 selections. df_Selections = pd.concat([firstSelection, secondSelection], axis=0)
Concatenate dataframes in a loop
df_Concatenation = pd.DataFrame() for ... : df... df_Concatenation = pd.concat([df_Concatenation, df], axis=0)
Aggregate files
Imagine that you want to iterate through a directory, which itself contains other directories, to aggregate Excel files contained in each of the subdirectories:
Context:
- The main directory is here named _ForAnalysis in C.
- The name of each subdirectories contains the word typical directory.
- The files that you would to aggregate are named Typical file.xlsx.
- Each Excel file contains a sheet named Typical sheet with a Field1 and a Field2.
Code:
from termcolor import colored import os import pandas as pd from tabulate import tabulate # workDirectory workDirectory = 'C:/_ForAnalysis/' print('\nWork directory: ' + workDirectory) TypicalFile = 'Typical file.xlsx' NumberTypicalFile = 0 df_CumulatedFiles = pd.DataFrame() print(colored('\nCheck typical files (' + str(len(os.listdir(workDirectory))) + '):', 'blue')) for rep in os.listdir(workDirectory): if 'typical directory' in rep: NumberTypicalFile += 1 TypicalPath = os.path.join(workDirectory, rep) TypicalPath = TypicalPath + '/' + TypicalFile print(TypicalPath) df_TypicalFile = pd.read_excel(TypicalPath, sheet_name='Typical sheet', engine='openpyxl', usecols=['Field1', 'Field2']) # DISPLAY ONE BY ONE print(tabulate(df_TypicalFile.head(10), headers='keys', tablefmt='psql', showindex=False)) print(str(df_TypicalFile.shape[0]), 'commune(s)') print('') # FILES CONCATENATION df_CumulatedFiles = pd.concat([df_CumulatedFiles, df_TypicalFile], axis=0) print(colored('\n' + str(NumberTypicalFile) + ' file(s) have been aggregated:', 'blue')) print(tabulate(df_CumulatedFiles.head(10), headers='keys', tablefmt='psql', showindex=False)) print(str(df_CumulatedFiles.shape[0]), 'lines')
Fill a dataframe from a list (append)
df = pd.DataFrame() for l in MyList: dfTemp = pd.DataFrame({ 'Field 1': 'Georges', 'Field 2': [l] }) df = pd.concat([df, dfTemp], axis=0) print('\ndf :') print(tabulate(df.head(2), headers='keys', tablefmt='psql', showindex=False)) print(df.shape[0])
Analyze data from a multi-valued field
Suppose you have a field containing values coming from a multi-valued drop-down list. But now you want to count how many times each value exists.
If values were uniques in each cell, a simple GROUP BY
in SQL would do the job.
But here we first have to split each cell to recover unique values, then we will be able to count them.
#Pandas
Split the data
Start splitting the data with str.split
specifying the separator and expanding, to create new columns to store each unique values.
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') serie = df['Your field'].str.split(', ', expand=True) print(serie)
Great! Now we get virtual columns containing only one value, with as much columns as necessary.
Melt the virtual data
Now let's go to melt our virtual columns in just one.
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') serieMelt = pd.melt(df['Your field'].str.split(', ',expand=True))['value'] print(serieMelt)
str.split
gives virtual column labels named value
, that we can use.
Count unique values
Now just put our melted data in a new data frame, before to apply a sorted groupby
by size()
(a COUNT
in SQL).
The full code will be:
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value']) df_Count = pd.DataFrame(temp_df.groupby(['value']).size(), columns=['Total']).reset_index() print(df_Count)
We add a reset_index()
to recover column names properly, to use them later.
Percentage
Ok, but suppose you want also the recurrence percentages of your unique values. The usual method will take the total from the melted data, giving strange results. Indeed the melted data cumulated each values from the multi-valued field in new rows, so a lot!
To get real percentages you have to use shape[0]
to count the real number of rows from your original data-frame, and use it in the percentage calculation.
We start creating a new field in the results data-frame, then we calculate percentages, rounding them.
df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100 df_Count['Percent'] =df_Count['Percent'].round(decimals=2)
Of course, you can not add the percentage results, it will exceed 100% and it is normal.
Empty values management
Good. But about our empty values? Just add dropna=False
to see them.
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value']) df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index() df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100 df_Count['Percent'] =df_Count['Percent'].round(decimals=2) print(df_Count)
Any problem?
Arrgg! The groupby
counted empty values from the melted data, and the melted data has added new columns as many times as necessary (based on the record with the highest number of unique values). Each them containing empty values ... It skews the count and percentage.
We have to cheat to fix that.
import pandas as pd df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl') temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value']) df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index() df_Count = df_Count.fillna('AZERTY') df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100 df_Count['Percent'] =df_Count['Percent'].round(decimals=2) valuesEmpty = df['Industries:Industries'].isna().sum() valuesEmptyPercent = round((valuesEmpty / df.shape[0]) * 100, 2) df_Count.loc[(df_Count['value'] == 'AZERTY')] = [['Unknow', valuesEmpty, valuesEmptyPercent]] df_Count = df_Count.sort_values(['Total'], ascending=False) print(df_Count)
The 7th line adds a specific value in the results data-frame to target the wrong empty values calculation (AZERTY
).
The 12 and 13th count the empty values and its percentage aside, and store results in specific variables.
Then the 15th line replaces the empty values calculation with our new value and variables, using loc
to identify the line to fix.
Finally we can sort our data-frame with sort_values(['Total']
.
Work with dates
Get true date format from string data
Example, you have dates like 21/09/2022 but you would a true date format to work with it:
df['TrueDate'] = pd.to_datetime(df['OriginalDate'], infer_datetime_format=True).dt.strftime('%Y-%m-%d')
Group by months
If you have a data-frame df
with a field Created at containing dates like:
2021-02-05 10:34:21 UTC
You can work with it, example to group and sort by month, you can cheat doing:
df_Temp = pd.DataFrame(df['Created at'].str[:7]) df_Created_count = pd.DataFrame(df_Temp.groupby(['Created at'], dropna=False).size(), columns=['Total']).sort_values(['Created at'], ascending=True).reset_index()
indeed with str[:7]
we extract years and months from strings. Then a usual groupby
.
Ok but it is more a pandas-way to convert your field in a true date format, in a new virtual field. Then you can extract real months and store them in a proper data-frame.
df['Created'] = pd.to_datetime(df['Created at']) df['Created'] = df['Created'].dt.to_period("M") df_TempCreated = pd.DataFrame(df['Created'])
Then easy to group and count them.
df_Created_count = pd.DataFrame(df_TempCreated.groupby(['Created'], dropna=False).size(), columns=['Total']).sort_values(['Created'], ascending=True).reset_index()
As now you use a real date format, you can display the full months with dt.strftime
.
df_Created_count['Created'] = df_Created_count['Created'].dt.strftime('%B %Y')
Format date
If you get date like 2022-10-24 and you would 24/10/2022:
df['My date'] = pd.to_datetime(df['My date'], format='%Y-%m-%d') df['My date'] = df['My date'].dt.strftime("%d/%m/%Y")
Format date from french format
If you get date like 11/05/2023 and you would 2023-05-11:
df['My date'] = df['French date'].str.split('/').str[2] + '-' + df['French date'].str.split('/').str[1] + '-' + df['French date'].str.split('/').str[0]
Get simple date (DDDDMMDD)
# SIMPLE DATE df_DataUnion['date_simple'] = df_DataUnion['date_submit'].astype(str).str[:10] df_DataUnion['date_simple'] = df_DataUnion['date_simple'].replace({'-': ''}, regex=True)
Extract the previous month
PreviousMonth = today - pd.DateOffset(months=1) print(PreviousMonth.strftime("%Y-%m"))
Convert false date field from Excel
Excel can concatenate true date format and string in the same column, arrgh! Later when you would use it in Pandas...
I use this tip to manage these kind of malicious fields in pandas:
df.loc[~df_GDC['date'].str.contains(' ', na=False), 'date'] = df['date'].astype(str).str.split(expand=True)[0] df.loc[~df_GDC['date'].str.contains(' ', na=False), 'date'] = \ df['date'].astype(str).str.split('-', expand=True)[2] + '/' +\ df['date'].astype(str).str.split('-', expand=True)[1] + '/' +\ df['date'].astype(str).str.split('-', expand=True)[0]
Eventually I put the code above in a try:... except:...
Eventually I convert all my field in string before, with astype(str)
Manage numeric date from Excel
Sometimes dates from Excel come as integer, arrgghhh!
import xlrd def read_date(date): return xlrd.xldate.xldate_as_datetime(date, 0) df.loc[df['My date'].astype(str).str.isdigit() == True, 'My date'] = \ pd.to_datetime(df['My date'].loc[df['My date'].astype(str).str.isdigit()].astype(int).apply(read_date), errors='coerce')
Check date format
Sometimes (especilally when it comes from Excel...) it is good to check your dates after treatments.
Below we search dates not starting with DD/MM/YYYY:
df2 = df[~df['my date'].str.contains('^[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]', regex=True)] print(tabulate(df2, headers='keys', tablefmt='psql', showindex=False))
Check if a date is in the future
There are many ways, but here converting in number:
from datetime import date, datetime today = date.today() df_ProblemeDate = df[['My date']][pd.to_numeric(df['My date'].str.split('/').str[2] + df['My date'].str.split('/').str[1] + df['My date'].str.split('/').str[0]) > int(str(today).replace('-', ''))]
Count opening days with the same date column between the 2 following lines
Imagine each line of your dataframe is a delivery, with a Date field:
import pandas as pd from tabulate import tabulate from datetime import date import numpy as np import datetime as dt import warnings from pandas.core.common import SettingWithCopyWarning warnings.simplefilter(action='ignore', category=SettingWithCopyWarning) ... df['Year'] = df['Date'].str.split('/').str[2] df['Month'] = df['Date'].str.split('/').str[1] df['Day'] = df['Date'].str.split('/').str[0] df['Opening days'] = '' df.reset_index(drop=True, inplace=True) for i, row in df.iterrows(): dateDeliveryYear = int(row['Year']) dateDeliveryMonth = int(row['Month']) dateDeliveryDay = int(row['Day']) if i == 0: df.at[i, 'Opening days'] = 'First delivery' if i > 0: beforeDateDeliveryYear = int(df.iloc[[i-1]]['Year'].item()) beforeDateDeliveryMonth = int(df.iloc[[i-1]]['Month'].item()) beforeDateDeliveryDay = int(df.iloc[[i-1]]['Day'].item()) start = dt.date(beforeDateDeliveryYear,beforeDateDeliveryMonth,beforeDateDeliveryDay) end = dt.date(dateDeliveryYear,dateDeliveryMonth,dateDeliveryDay) df.at[i, 'Opening days'] = np.busday_count(start, end) print('\n' + u) print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=True)) print(df.shape[0])
Work with phones
phonenumbers library
If you have a phone field and a country code field:
... import phonenumbers ... df['Phone'] = df['Phone'].fillna('') df['CountryCode'] = df['CountryCode'].fillna('') df['TestPhone'] = df.apply( lambda x: phonenumbers.parse(str(x['Phone']), str(x['CountryCode'])) if str(x['Phone']) != '' and str(x['CountryCode']) != '' else '', axis='columns' )
Full phone fix
# FIX PHONE df['Phone'] = df['Phone'].str.strip() df['Phone'] = df['Phone'].replace({'-': ' '}, regex=True) df['Phone'] = df['Phone'].replace({'_': ' '}, regex=True) df['Phone'] = df['Phone'].replace({'\.': ' '}, regex=True) df['Phone'] = df['Phone'].replace({',': ' '}, regex=True) df['Phone'] = df['Phone'].replace({' ': ' '}, regex=True) df['Phone'] = df['Phone'].replace({'[a-zA-Z]': ''}, regex=True) df['Phone'] = df['Phone'].replace({' ': ' '}, regex=True) df['Phone'] = df['Phone'].replace({'^ ': ''}, regex=True) df['Phone'] = df['Phone'].replace({'\*': ''}, regex=True) df['Phone'] = df['Phone'].replace({'\(': ''}, regex=True) df['Phone'] = df['Phone'].replace({'\)': ''}, regex=True) df['Phone'] = df['Phone'].replace({'\$': ''}, regex=True) df['Phone'] = df['Phone'].str.strip() df['Phone'] = df['Phone'].fillna('') df.loc[df['Phone'].str.replace(' ', '').str.len() < 4, 'Phone'] = '' df['CountryCode'] = df['CountryCode'].fillna('') # CREATE TESTPHONE FIELD df['TestPhone'] = '' df = df.reset_index(drop=True) for i, row in df.iterrows(): myPhone = row['Phone'] myCountryCode = row['CountryCode'] MyTest = '' try: # IF PHONE EXISTS AND START WITH "+" AND COUNTRY EXISTS if myPhone != '' and myPhone.startswith('+') and myCountryCode != '': MyTest = phonenumbers.parse(myPhone, myCountryCode) # IF PHONE EXISTS AND NOT START WITH "+" AND COUNTRY EXISTS if myPhone != '' and not myPhone.startswith('+') and myCountryCode != '': MyTest = phonenumbers.parse(myPhone, myCountryCode) # IF PHONE EXISTS AND START WITH "+" AND COUNTRY NOT EXISTS if myPhone != '' and myPhone.startswith('+') and myCountryCode == '': MyTest = phonenumbers.parse(myPhone, myCountryCode) # IF PHONE EXISTS AND NOT START WITH "+" AND COUNTRY NOT EXISTS if myPhone != '' and not myPhone.startswith('+') and myCountryCode == '': MyTest = phonenumbers.parse('+' + myPhone, myCountryCode) except: pass df.at[i,'TestPhone'] = MyTest df['AutoCode'] = '' df.loc[df['TestPhone'] != '', 'AutoCode'] = '+' + df['TestPhone'].astype(str).replace({'Country Code: ': ''}, regex=True).replace({'( National Number: ).*': ''}, regex=True) df['AutoNationalPhone'] = '' df.loc[df['TestPhone'] != '', 'AutoNationalPhone'] = df['TestPhone'].astype(str).replace({'^.*? National Number: ': ''}, regex=True).astype(str).replace({'( Leading Zero).*': ''}, regex=True) # CONCATENATION df['Formatted Phone'] = df[['AutoCode', 'AutoNationalPhone']].apply(lambda x: ' '.join(x.dropna()), axis=1).fillna('') df.loc[~df['Formatted Phone'].str.startswith(('+')), 'Formatted Phone'] = df['Phone'] # AUTOCOUNTRYCODE df['AutoCountryCode'] = '' df = df.reset_index(drop=True) for i, row in df.iterrows(): myCountryCode = row['CountryCode'] MyTest = '' try: MyTest = phonenumbers.country_code_for_region(myCountryCode) except: pass df.at[i,'AutoCountryCode'] = MyTest # LAST df.loc[(~df['Formatted Phone'].str.startswith(('+'))) & (df['AutoCountryCode'] != 0) & (df['Phone'] != ''), 'Formatted Phone'] = '+' + df['AutoCountryCode'].astype(str) + ' ' + df['Phone'].replace({'^0': ''}, regex=True) # IS_POSSIBLE_NUMBER df['is_possible_number'] = '' for i, row in df.iterrows(): myPhone = row['Formatted Phone'] MyTest = '' try: MyTest = phonenumbers.is_possible_number(phonenumbers.parse(myPhone, None)) except: pass df.at[i, 'is_possible_number'] = MyTest # IS_VALID_NUMBER df['is_valid_number'] = '' for i, row in df.iterrows(): myPhone = row['Formatted Phone'] MyTest = '' try: MyTest = phonenumbers.is_valid_number(phonenumbers.parse(myPhone, None)) except: pass df.at[i, 'is_valid_number'] = MyTest # DELETE WRONG NUMBER df.loc[(df['is_valid_number'] == False) & (df['is_possible_number'] == False), 'Formatted Phone'] = '' # RE-ORDER df = df[['Phone', 'CountryCode', 'Formatted Phone']] print(tabulate(df.head(30), headers='keys', tablefmt='psql', showindex=False))
Pie Chart
Simple pie
If you have a data-frame with fields Categories and Total, you can create a pie chart.
import matplotlib.pyplot as plt chartLabel = df_Categories_count['Categories'].tolist() chartValue = df_Categories_count['Total'].tolist() colors = plt.rcParams['axes.prop_cycle'].by_key()['color'] fig1 = plt.figure() plt.pie(chartValue, labels=chartLabel, colors=colors, autopct=None, shadow=False, startangle=90) plt.axis('equal') plt.title('Categories', pad=20, fontsize=15) plt.show() plt.clf()
Display percentages in the chart
Use autopct
.
plt.pie(chartValue, labels=chartLabel, colors=colors, autopct='%1.1f%%', shadow=False, startangle=90)
Change the labels font size
plt.rcParams['font.size'] = 5
Hide a label in chart
Just replace the value with an empty string.
chartLabel[-1] = ''
Add a legend in your chart
plt.legend(chartLabel, loc='best', fontsize=8)
Explode
To explode your chart, you need to pass values in explode
, example if you have 4 values to plot:
plt.pie(myValue, labels=myLabels, colors=colors, explode=[0.05, 0.05, 0.05, 0.05], autopct=None, shadow=False, startangle=90)
If you want explode all part by default, just create a list before:
explodeValues = [] for i in myValue: explodeValues.append(0.05) plt.pie(myValue, labels=myLabels, colors=colors, explode=explodeValues, autopct=None, shadow=False, startangle=90)
Add a legend with several fields
For example you want display labels and percentages in the legend. First calculate percentages in another field in your data-frame, then:
import matplotlib.pyplot as plt chartLabel = df_Categories_count['Categories'].tolist() chartLegendLabel = df_Categories_count['Categories'].tolist() chartValue = df_Categories_count['Total'].tolist() chartLegendPercent = df_Categories_count['Percent'].tolist() legendLabels = [] for i, j in zip(chartLegendLabel, map(str, chartLegendPercent)): legendLabels.append(i + ' (' + j + ' %)') colors = plt.rcParams['axes.prop_cycle'].by_key()['color'] fig1 = plt.figure() plt.pie(chartValue, labels=chartLabel, colors=colors, autopct=None, shadow=False, startangle=90) plt.axis('equal') plt.title('Categories', pad=20, fontsize=15) plt.legend(legendLabels, loc='best', fontsize=8) plt.show() plt.clf()
Bar chart
Vertical bar chart
chartLabel = df_Created_count['Labels field'].tolist() chartValue = df_Created_count['Values field'].tolist() fig5 = plt.figure(figsize=(13,6)) bar_plot = plt.bar(chartLabel, chartValue) # plt.ylabel('yyy') # plt.xlabel('xxx') plt.xticks(rotation=30, ha='right') # HIDE BORDERS plt.gca().spines['left'].set_color('none') plt.gca().spines['right'].set_color('none') plt.gca().spines['top'].set_color('none') # HIDE TICKS plt.tick_params(axis='y', labelsize=0, length=0) plt.yticks([]) # ADD VALUE ON THE END OF HORIZONTAL BARS # for index, value in enumerate(chartValue): # plt.text(value, index, str(value)) # ADD VALUE ON THE TOP OF VERTICAL BARS def autolabel(rects): for idx, rect in enumerate(bar_plot): height = rect.get_height() plt.text(rect.get_x() + rect.get_width()/2, height, chartValue[idx], ha='center', va='bottom', rotation=0) autolabel(bar_plot) plt.title('Registrations by month', pad=20, fontsize=15) plt.show() plt.clf()
Work with shapefiles (without Geopandas)
Here we will use numpy and basemap.
pip install numpy pip install basemap
To install basemap on WIndows, download a package here, according your versions, and install it with wheel.
Thanks to ramiro.org. I got the code below on his website, I just adapted it.
Suppose you have an Excel file with a country field, you get count and map them with a country shape. The country names in the Excel file and in the shape must be the same.
import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import pandas as pd from matplotlib.patches import Polygon from matplotlib.collections import PatchCollection from mpl_toolkits.basemap import Basemap from PIL import Image, ImageOps shp_simple_countries = r'C:/Users/Downloads/simple_countries/simple_countries' inputExcelFile = r'C:/Users/Downloads/My file.xslx' df = pd.read_excel(inputExcelFile, sheet_name='Export', engine='openpyxl', usecols=['ID', 'My Country Field']) # Count the countries df_Country_count = pd.DataFrame(df.groupby(['My Country Field'], dropna=True).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index() df_Country_count = df_Country_count.fillna('Unknow') df_Country_count['Percent'] = (df_Country_count['Total'] / df_Country_count['Total'].sum()) * 100 df_Country_count['Percent'] = df_Country_count['Percent'].round(decimals=2) # Set countries as index df_Country_count.set_index('My Country Field', inplace=True) my_values = df_Country_count['Percent'] num_colors = 30 cm = plt.get_cmap('Blues') scheme = [cm(i / num_colors) for i in range(num_colors)] my_range = np.linspace(my_values.min(), my_values.max(), num_colors) # -1 TO AVOID SEARCHS IN A PANDAS DATA-FRAME INCLUDING START AND STOP VALUE (I think ...) df_Country_count['Percent'] = np.digitize(my_values, my_range) - 1 map1 = plt.figure(figsize=(14, 8)) ax = map1.add_subplot(111, frame_on=False) map1.suptitle('Countries', fontsize=30, y=.95) m = Basemap(lon_0=0, projection='robin') m.drawmapboundary(color='w') m.readshapefile(shp_simple_countries, 'units', color='#444444', linewidth=.2, default_encoding='iso-8859-15') # Create the chloro map for info, shape in zip(m.units_info, m.units): shp_ctry = info['COUNTRY_HB'] if shp_ctry not in df_Country_count.index: color = '#dddddd' else: color = scheme[df_Country_count.loc[shp_ctry]['Percent']] patches = [Polygon(np.array(shape), True)] pc = PatchCollection(patches) pc.set_facecolor(color) ax.add_collection(pc) # Cover up Antarctica so legend can be placed over it ax.axhspan(0, 1000 * 1800, facecolor='w', edgecolor='w', zorder=2) # Draw color legend ax_legend = map1.add_axes([0.2, 0.14, 0.6, 0.03], zorder=3) cmap = mpl.colors.ListedColormap(scheme) cb = mpl.colorbar.ColorbarBase(ax_legend, cmap=cmap, ticks=my_range, boundaries=my_range, orientation='horizontal') cb.ax.set_xticklabels([str(round(i, 1)) for i in my_range]) cb.ax.tick_params(labelsize=7) cb.set_label('Percentage', rotation=0) #cb.remove() # Set the map footer # description = 'Bla bla bla' # plt.annotate(description, xy=(-.8, -3.2), size=14, xycoords='axes fraction') map1.savefig('C:/Users/Downloads/mymap1.png', dpi=100, bbox_inches='tight') plt.show() plt.clf() im = Image.open('C:/Users/Downloads/mymap1.png') bordered = ImageOps.expand(im, border=1, fill=(0, 0, 0)) bordered.save('C:/Users/Downloads/mymap1.png')
Zoom on France
map1.suptitle('Académies', fontsize=20, y=.87) m = Basemap(projection='merc', resolution='l', \ llcrnrlon=-7, # Longitude minimale : étend vers l'est \ llcrnrlat=39.5, # Latitude minimale : étend vers le sud \ urcrnrlon=13, # Longitude maximale : étend vers l'ouest \ urcrnrlat=52) # Latitude maximale : étend vers le nord
Add labels or POI
ax.text(0.05, # Vers la droite 0.59, # Vers le haut 'Guadeloupe', fontsize=10, ha='center', transform=ax.transAxes)
Another example
If you use the shape attached named France-Departements-Deformation.shp:
############## Carte df_DepEtablissement # Set academies as index df_DepEtablissement.set_index('nom_departement_etablissement', inplace=True) my_values = df_DepEtablissement['Pourcentage'] num_colors = 30 cm = plt.get_cmap('Blues') scheme = [cm(i / num_colors) for i in range(num_colors)] my_range = np.linspace(my_values.min(), my_values.max(), num_colors) # -1 TO AVOID SEARCHS IN A PANDAS DATA-FRAME INCLUDING START AND STOP VALUE (I think ...) df_DepEtablissement['Pourcentage'] = np.digitize(my_values, my_range) - 1 map2 = plt.figure(figsize=(14, 10)) ax = map2.add_subplot(111, frame_on=False) map2.suptitle('Départements', fontsize=20, y=.87) m = Basemap(projection='merc', resolution='l', llcrnrlon=-9, \ # Longitude minimale : étend vers l'est llcrnrlat=39.5, \ # Latitude minimale : étend vers le sud urcrnrlon=15, \ # Longitude maximale : étend vers l'ouest urcrnrlat=52) \ # Latitude maximale : étend vers le nord m.drawmapboundary(color='w') m.readshapefile(shp_departements, 'units', color='#444444', linewidth=.2, default_encoding='utf-8') # Create the chloro map for info, shape in zip(m.units_info, m.units): shp_departements = info['dep_name'] if shp_departements not in df_DepEtablissement.index: color = '#dddddd' else: color = scheme[df_DepEtablissement.loc[shp_departements]['Pourcentage']] patches = [Polygon(np.array(shape), True)] pc = PatchCollection(patches) pc.set_facecolor(color) ax.add_collection(pc) # Cover up Antarctica so legend can be placed over it # ax.axhspan(0, 1000 * 1800, facecolor='w', edgecolor='w', zorder=2) # Draw color legend ax_legend = map2.add_axes([0.2, 0.14, 0.6, 0.03], zorder=3) cmap = mpl.colors.ListedColormap(scheme) cb = mpl.colorbar.ColorbarBase(ax_legend, cmap=cmap, ticks=my_range, boundaries=my_range, orientation='horizontal') cb.ax.set_xticklabels([str(round(i, 1)) for i in my_range]) cb.ax.tick_params(labelsize=7) cb.set_label('Pourcentage', rotation=0) # cb.remove() ax.text(0.125, 0.565,'Guadeloupe', fontsize=10, ha='center', transform=ax.transAxes) ax.text(0.175, 0.46, 'Martinique', fontsize=10, ha='center', transform=ax.transAxes) ax.text(0.1, 0.18, 'Guyane', fontsize=10, ha='center', transform=ax.transAxes) ax.text(0.42, 0.155, 'Mayotte', fontsize=10, ha='center', transform=ax.transAxes) ax.text(0.6, 0.155, 'La Réunion', fontsize=10, ha='center', transform=ax.transAxes) ax.text(0.73, 0.15, 'Corse', fontsize=10, ha='center', transform=ax.transAxes) ax.text(0.83, 0.41, 'Nouvelle-Calédonie', fontsize=10, ha='center', transform=ax.transAxes) ax.text(0.8, 0.515, 'Polynésie Française', fontsize=10, ha='center', transform=ax.transAxes) ax.text(0.836, 0.69, 'Inconnu', fontsize=8, ha='center', transform=ax.transAxes) ax.text(0.86, 0.903, 'Étranger', fontsize=8, ha='center', transform=ax.transAxes) map2.savefig(downloadsDir + 'mymap2.png', dpi=80, bbox_inches='tight') # plt.show() # plt.clf() im = Image.open(downloadsDir + 'mymap2.png') bordered = ImageOps.expand(im, border=1, fill=(0, 0, 0)) bordered.save(downloadsDir + 'mymap2.png') # INSERT IN EXCEL img = openpyxl.drawing.image.Image(downloadsDir+'mymap2.png') img.anchor = 'E4' workbook['Départements établissements'].add_image(img) workbook.save(statsFile) ################# REMOVE PICTURES os.remove(downloadsDir + 'mymap2.png')
Geopandas
Read a shape
import geopandas as gpd from tabulate import tabulate myShape = 'C:\\Path\\Of\\My\\Shape.shp' print('\n' + myShape) df = gpd.read_file(myShape) df['type'] = df['geometry'].astype(str).str.replace(r' .*|\(.*', '', regex=True) df = df[['id', 'type', 'geometry']] print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=True)) print(df.shape[0]) MyFieldList = df['id'].drop_duplicates().dropna().sort_values(ascending=False).tolist() print('\n' + str(MyFieldList)) MyGeomTypeList = df['type'].drop_duplicates().dropna().sort_values(ascending=False).tolist() print('\n' + str(MyGeomTypeList))
Display a map
from tabulate import tabulate import matplotlib.pyplot as plt MyLayer = 'E:\\Path\\to\\shape.shp' df = gpd.read_file(MyLayer) df.plot() plt.title('My layer', pad=10, fontsize=10) plt.show()
Display the Coordinate Reference Systems (CRS)
print('roj : ' + str(df.crs))
Check geometry
df['valid ?'] = df.is_valid df = df[df['valie ?'] == False] print(tabulate(df.head(5), headers='keys', tablefmt='psql', showindex=False)) print(df.shape[0])
Create an empty shape
import geopandas as gpd myShape = 'C:\\Path\\beautiful shape.shp' schema = {"geometry": "Polygon", "properties": {"myid": "int"}} crs = "EPSG:2154" dfEmpty = gpd.GeoDataFrame(geometry=[]) dfEmpty.to_file(myShape, driver='ESRI Shapefile', schema=schema, crs=crs)
Joins
SQL JOIN LEFT
df_Join = pd.merge(df1, df2, left_on='Email', right_on='email', how='left')[['Email', 'source']]
Another example:
df_Merged = df1.merge(df2, on='id', how='left')
Another example to find the non-matching records on 2 fields:
df_join = pd.merge(df1, df2, left_on=['id1', 'field1'], right_on=['id2', 'field2'], how='left')[['id1', 'field1', 'field2']] df_NoMatch = df_join[df_join['field2'].isna()]
SQL INNER JOIN
df_Join = pd.merge(df1, df2, on='Email')
Or:
df_Merged = df1.merge(df2, on='id', how='inner')
Find difference
df_difference = pd.DataFrame(df_First[~df_First['Email'].isin(df_Second['email'])])
Group, count and sort
List unique values
print(df_Deduped.devise.unique())
Count rows
number = df.shape[0] print(number)
Count values (group by)
If you have a field named Country in a data-frame named df
:
CountCountry = df.groupby(['Country'], dropna=False).size().sort_values(ascending=False) print(CountCountry)
Count values (group by) and get it in a dataframe
CountCommune = pd.DataFrame(df.groupby(['id_commune'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index() print(tabulate(CountCommune.head(10), headers='keys', tablefmt='psql', showindex=False))
Count values grouping by several fields to find duplicates
df_Duplicates = pd.DataFrame(df.groupby(['id_commune', 'PAGES'], dropna=False).size(), columns=['Total'])\ .sort_values(['Total'], ascending=False).reset_index() df_Duplicates = df_Duplicates[(df_Duplicates['Total'] > 1)] print(tabulate(df_Duplicates.head(10), headers='keys', tablefmt='psql', showindex=False))
Group concat
df = df.groupby(['country'])['city'].apply(lambda x: ', '.join(x)).reset_index()
Count a specific value
print (len(df[df['Your field'] == 'United States']))
Count with condition
myCount = df[df['Field where search the condition'] == 'A value'].count()['Field to count']
Count empty values
number = df['Your field'].isna().sum()
Keep empty values in a group by and fill them
newDf = pd.DataFrame(df.groupby(['Your field'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index() newDf = newDf.fillna('Unknow')
Count email domains (group by)
You have a Email field with clean emails. Thanks to Pandas it is easy to group and count their DNS.
First create a virtual field to split the emails and recover the 2nd part of the split. Then just group your new field.
df['Domain'] = df['Email'].str.split(<a href="mailto:'@').str">'@').str</a>[1] df_Email_DNS_count = pd.DataFrame(df.groupby(['Domain']).size(), columns=['Total'])\ .sort_values(['Total'], ascending=False).reset_index()
Calculate percentages
Acording your needs, you can use the sum
from a Total field.
df['Percent'] = (df_SpecialtiesPerCountry_count['Total'] / df['Total'].sum()) * 100 df['Percent'] = df['Percent'].round(decimals=2)
Or the number of rows in your data-frame (useful if you work with a multi-valued field, see above the Analyze data from a multi-valued field chapter) .
df['Percent'] = (df['Total'] / df.shape[0]) * 100 df['Percent'] = df['Percent'].round(decimals=2)
Sum group by year
df['Year'] = df['DateInvoice'].dt.to_period('Y') df_AmountYear = df.groupby(['Year'], dropna=False)['AmountEuro'].sum().reset_index() df_AmountYear = df_AmountYear.sort_values(['Year'], ascending=False) print('\nAmount per year:') print(tabulate(df_AmountYear, headers='keys', tablefmt='psql', showindex=False))
Sort on a field
df.sort_values(by=['Field'], ascending=False, inplace=True) # Recreate the index if necessary (if you will use the index later) df = df.reset_index(drop=True)
Sort on a field containing both text and numerical values
df = df.loc[pd.to_numeric(df['My field'], errors='coerce').sort_values().index]
Sum and count
If you want sum an Amount field per country counting the number of lines:
df_AmountCountries = df.groupby(['Country'], dropna=False)\ .agg({'ID': 'count', 'AmountEuro': 'sum'})\ .reset_index()\ .rename(columns={'ID': 'Number'})
Replace, remove, edit with conditions
Replace empty cells
df['speciality'] = df['speciality'].fillna('Other')
Replace a value (the full value)
df['Your field'] = df['Your field'].replace(['Old value'],'New value')
Replace a part of string
df['Your field'] = df['Your field'].replace({'Old value': 'New value'}, regex=True)
Replace a string if it contains
df.loc[df['speciality'].str.contains('Researcher'), 'speciality'] = 'Research Scientist'
If not contains
Add ~
:
df.loc[~df['speciality'].str.contains('Researcher'), 'speciality'] = 'Research Scientist'
Localize and remove a row
ind_drop = df[df['Your field'].apply(lambda x: x == ('A value'))].index df = df.drop(ind_drop)
Localize and remove a row starting with ...
ind_drop = df[df['Your field'].apply(lambda x: x.startswith('A value'))].index df = df.drop(ind_drop)
Localize and remove a row ending with ...
ind_drop = df[df['Your field'].apply(lambda x: x.endswith('A value'))].index df = df.drop(ind_drop)
Localize and replace a full row
df.loc[(df['A field'] == 'TARGET')] = [[NewValue1, NewValue2, NewValue3]]
Localize a row according a regex and edit another field
df.loc[df[' Field1'].str.contains(pat='^place ', regex=True), 'Field2'] = 'Yes'
Replace a field with values from another field with a condition
df['Field1'] = np.where(df['Field1'] .apply(lambda x: x.startswith('A string in condition...')), df['Field2'], df['Field1'])
Remove some first characters
Here we delete the 2 first characters if the cell starts with a comma then a space.
df['Field'] = df['Field'].apply(lambda x: x[2:] if x.startswith(', ') else x)
Keep only some first characters
df['Field'] = df['Field'].apply(lambda x: x[:10])
Remove some last characters
df['DateInvoice'] = df['DateInvoice'].apply(lambda x: x[:-4] if x.endswith(' UTC') else x)
Remove the content from a field in another field
df['NewField'] = df.apply(lambda x : x['FieldToWork'].replace(str(x['FieldWithStringToRemove']), ''), axis=1)
Or with a regex, example to remove the content only if it is at the beginning of the field:
df['NewField'] = df.apply(lambda x : re.sub('^'+str(x['StringToRemove']), '', str(x['FieldToWork'])) if str(x['FieldToWork']).startswith(str(x['StringToRemove'])) else str(x['FieldToWork']), axis=1)
Edit with a condition
Increment a field if another field is empty.
df.loc[df['My field maybe empty'].notna(), 'Field to increment'] += 1
Fill a field if a field is greater or equal to another field.
df.loc[df['Field A'] >= df['Field B'], 'Field to fill'] = 'Yes'
Edit several fields in the same time.
df.loc[df['Field A'] >= df['Field B'], ['Field A to fill', 'Field B to fill']] = ['Yes', 'No']
Edit with several conditions
Condition "AND" (&
)
df.loc[(df['My field maybe empty'].notna()) & (df['An integer field'] == 1) & (df['An string field'] != 'OK'), 'Field to increment'] += 1
Please replace "&" with a simple &
.
Condition "OR" (|
)
df.loc[(df['My field maybe empty'].notna()) | (df['An integer field'] == 1) | (df['An string field'] != 'OK'), 'Field to fill'] = 'Yes'
Edit with IN
or NOT IN
condition (as SQL)
Just use isin
:
df.loc[df['Id field'].isin([531733,569732,652626]), 'Filed to edit'] = 'Yes'
And for NOT IN
:
df.loc[df['Id field'].isin([531733,569732,652626]) == False, 'Filed to edit'] = 'No'
Replace string beginning with
df['id_commune'] = df['id_commune'].str.replace(r'(^75.*$)', '75056', regex=True)
Not start with
~df[phone].str.startswith(('33'))
Remove letters
df['mobile'] = df['mobile'].str.extract('(\d+)', expand=False).fillna('')
Extract before or after a string
Example if Job='IT: DBA'
df['type'] = df['Job'].str.split(': ').str[0] df['speciality'] = df['Job'].str.split(': ').str[1]
Remove all after a string
df_Files['new field'] = df_Files['old field'].str.replace("(StringToRemoveWithAfterToo).*","", regex=True)
Remove all before a string
df_Files['file'] = df_Files['file'].str.replace("^.*?_","_", regex=True)
Get in title case
df['firstname'] = df['firstname'].str.title()
Remove if contains less of n character (lenght)
df.loc[df['mobile'].str.len() < 6, 'mobile'] = ''
Remove potential spaces before and after a string (trim)
Use .str.strip()
, example:
df.loc[df['My field'].astype(str).str.isdigit() == False, 'My field'] = df['My field'].astype(str).str.strip()
Remove with a function (def)
def MyDeletion(): # Eventually if your df does not exist when you create the function # global df ind_drop = df[df['My field'].apply(lambda x: x == ('My value'))].index df = df.drop(ind_drop) ... MyDeletion()
Decode HTML special char
import html df['My field'] = df['My field'].apply(lambda x: html.unescape(x))
Search
Search some character getting column and row:
for row in df_ExportFeuil1.index: for col in df_ExportFeuil1.columns: cell_value = str(df_ExportFeuil1.at[row, col]) if '\u2264' in cell_value: print(colored(f'Attention, signe inférieur ou égal à la ligne {row}, colonne {col} !', 'red'))
Filters
Field contains string
df2 = df1[df1['My field'].str.contains('AAA').fillna(False)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Field not contains a string
Just use ~
:
df2 = df1[~df1['My field'].str.contains('AAA').fillna(False)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Field not empty, not NaN, not null
df_notEmpty= df[ (df['My field'].notnull()) & (df['My field'].notna()) & (df['My field'] != '') & (df['My field'] != 'nan') ]
Field is a numeric value
# Function to extract numeric values def extractNumericValue(my_text): my_values = re.findall(r'^-?\d+\.\d+|-?\d+$', str(my_text)) if my_values: return my_values else: return None # Apply df['SearchNumericValue'] = df['My column'].apply(extractNumericValue) dfResult = df[ (df['SearchNumericValue'].notna()) & (~df['My column'].astype(str).str.contains('[a-zA-Z]', regex = True)) ] print(tabulate(dfResult.head(10), headers='keys', tablefmt='psql', showindex=False))
Field matches regex
Here we search where the field is a number with a comma as decimal separator.
dfTest = df_ImportTxt[df_ImportTxt['My field'].str.match(pat='^\d+,\d+$')==True]
Field contains element from a list
Example searching space, star and parenthesis:
MySearch = ['\*', ' ', '\(', '\)'] df2 = df1[df1['zone'].str.contains('|'.join(MySearch)).fillna(False)] print(tabulate(df2.head(40), headers='keys', tablefmt='psql', showindex=False))
Field equals element from a list
MySearch = ['Bbbb', 'XXX'] df2 = df1[df1['zone'].str.fullmatch('|'.join(MySearch)).fillna(False)] print(tabulate(df2.head(40), headers='keys', tablefmt='psql', showindex=False))
Field in digit
df2 = df1.loc[df1['zone'].astype(str).str.isdigit()] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Several condition
Below we get lines with a field with null
values and another field with empty values:
df2 = df1[(df1['My field'].isna()) | (df1['My other field'] == '')] print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))
Get lines searching if values are IN
a list
searchIn= ['AAA', 'BBB', 'CCC'] df2 = df1[df1['My field'].isin(searchIn)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))
Field not finishes with a concatenation from other fields
Suppose you have this dataframe df
:
Language | |
Python | docA-Python.pdf |
SQL | docF-SQ.pdf |
PHP | docS-PHP.pdf |
CSS | docG-CSS.pdf |
Javascript | docR-Javascript.pdf |
HTML | docW-HTML.pdf |
Now you would like extract lines where the PDF field does not finish with [Language field] + '.pdf'
:
Compare = df[['Language', 'PDF']].apply(lambda x: x['PDF'].endswith(x['Language'] + '.pdf'), axis=1) df_temp = pd.DataFrame(Compare) df_temp.columns =['True or false'] df_concat = pd.concat([df, df_temp], axis=1) df_filtered = df_concat[['Language', 'PDF']][df_concat['True or false'] == False] print(tabulate(df_filtered.head(5), headers='keys', tablefmt='psql', showindex=False))
Language | |
SQL | docF-SQ.pdf |
Field not contains an email 😀
... df_pb_Email = df[df['Email field'].str.contains(r'[^@]+@[^@]+\.[^@]+') == False] ...
Field not equals an email
... df_pb_Email = df[df['Email field'].str.fullmatch(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b') == False] ...
Field not contains a number
dfProblem = df[(df['My field'].astype(str).str.match(pat=r'^[-+]?[0-9]+$|^[-+]?[0-9]+\.[0-9]+$') == False)]
Field contains another field
df['A contains B'] = [x[0] in x[1] for x in zip(df['FieldB'], df['FieldA'])] df= df[df['A contains B'] == False]
Field ends like another field
df['A finish with B'] = [x[1].endswith(x[0]) for x in zip(df['FieldB'], df['FieldA'])] df= df[df['A finish with B'] == False]
Field with lenght equals a number
dfProbleme = df[ df['Field1'].str.len() != 7 ]
Count a string in a field
dfCount = df[(df['Substitutions'].str.count('\\,') > 1)]
Split column and get a element
Here, splitting with a comma (str.split(',')
) and get the first element (.str[0]
):
df['element'] = df['Myfield'].str.split(',').str[0]
Split column without to know how many column will result
df = pd.concat([ df['id'], df['Field1'], df['Field1'].str.split(';', expand=True).add_prefix('Field1_'), df['Field2'], df['Field2'].str.split(';', expand=True).add_prefix('Field2_') ], axis=1 )
Warnings
Avoid indexation problems after dataframes concat
Sometimes after a dataframes concatenation, the indexes are melted and can cause some problems. Just ignore the index in your new dataframe:
df_Selections = pd.concat([firstSelection, secondSelection], axis=0, ignore_index=True)
Avoid copy warning (example 1)
import warnings from pandas.core.common import SettingWithCopyWarning warnings.simplefilter(action='ignore', category=SettingWithCopyWarning)
Avoid copy warning (example 2)
# Below importations, above your code pd.options.mode.chained_assignment = None
Avoid user warning
import warnings warnings.simplefilter(action='ignore', category=UserWarning)
Avoid future warning
import warnings warnings.simplefilter(action='ignore', category=FutureWarning)
Avoid VisibleDeprecationWarning
import warnings import numpy warnings.filterwarnings('ignore', category=numpy.VisibleDeprecationWarning)
Miscellaneous
Where is Python? Which version?
Some possibilities, for example:
import platform print('Python version:\n', platform.python_version()) import sys print('\nWhere is Python:') print(sys.executable) import sys print('\nPython version:') print(sys.version)
Create a dataframe from a previous dataframe (copy)
df2 = df1.copy()
Copy data from a dataframe in the clipboard
Very convenient to copy-paste quickly in a text file, CSV, Excel ...
df.to_clipboard(sep=',', index=False, header=None)
Get list from field
MyList = df['My field'].drop_duplicates().dropna().sort_values(ascending=True).tolist() print(MyList)
Get column names in a list
listColumn = df.columns.tolist()
Dedupe
To dedupe a dataframe on an Email field keeping only the first duplicate record, use drop_duplicates
.
df['Email'] = df['Email'].str.lower() dfDeduped = df.drop_duplicates(subset=['Email'], keep='first')
Find duplicates
Use duplicated
:
df2 = df1[df1.duplicated(['My field'], keep=False)] print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))
Find no-duplicates
df2 = df1[~df1.duplicated(['My field'], keep=False)] print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))
Export a picture from a chart
And add a border.
from PIL import Image, ImageOps fig1.savefig('C:/Users/myplot.png', dpi=100) im = Image.open('C:/Users/myplot2.png') bordered = ImageOps.expand(im, border=1, fill=(0, 0, 0)) bordered.save('C:/Users/myplot2.png')
Re-organize columns
df = df[['Field 1', 'Adresse mail', 'Note algorithmique', 'Substitutions']]
Rename dataframe
New_df = pd.DataFrame(Old_df)
Rename columns
From their original name:
df.rename(columns={'Old name 1': 'New name 1', 'Old name 1': 'New name 1'}, inplace=True)
From their position:
df.columns.values[12] = 'New name'
Delete column
df.drop('My field', axis=1, inplace=True)
Format number with space thousand separator
number = df.shape[0] print(format(number, ',').replace(',', ' '), 'customers')
Or:
f"{number_of_rows:,}".replace(',', ' ')
Ranking
To rank according to a field, from highest to lowest:
df['My rank'] = df['My field to rank'].rank(ascending=False)
Avoid duplicate:
df['My rank'] = df['My field to rank'].rank(ascending=False, method='first')
Generate a unique random integer
If you want a unique random integer from 1 included to the number of records included.
howManyRecord = df.shape[0] df['Random number'] = np.random.choice(range(howManyRecord), howManyRecord, replace=False) df.loc[df['Random number'] == 0, 'Random number'] = howManyRecord
Select with condition
df = df.loc[df['Field to filter'] == 'Yes']
Prevent nan
Use keep_default_na=False
:
Data = pd.read_excel(inputExcelFile, sheet_name='Feuil1', engine='openpyxl', keep_default_na=False)
Get tuples from a file (in a SQL UPDATE
shape)
...df.itertuples()...
Very usefull to build SQL update queries from a file with the primary key in first column :
file1 = open(MyPath+'FixQueries.sql', 'w', encoding='cp1252') # ÉCRITURE DU FICHIER for i in df.itertuples(index=False): file1.write('UPDATE MyTable SET ' +\ str(i)[7:-1].split(', ', 1)[-1] +\ " WHERE id = '" + i[0] + "' ;\n")
Use index in concatenation
df.index.astype(str)
Reset index
df.reset_index(drop=True, inplace=True)
Drop index
To drop the index, you have to use another field as index.
df = df.set_index('my field')
Open a box to enter a string and use it as a variable
import tkinter as tk from tkinter import simpledialog window = tk.Tk() window.eval('tk::PlaceWindow . center') window.withdraw() repertoireCarto = simpledialog.askstring("Saisie", "Répertoire ?\t\t\t\t", parent=window) print('Le répertoire est', repertoireCarto)
Get some line indexes according condition
list_duplicate = df.index[df['My field'] == 'Text'].tolist()
Import external scripts
Your working script needs to know where to find the external scripts:
sys.path.append('C:\\Users\\Georges\\PycharmProjects\\Your_Directory') from YourScriptWithoutExtension import SomeVariables, ...
Subprocess
Not related to Pandas but very useful to run shell command, bash, console scripts...
import subprocess subprocess.run('psql -h localhost -p 5432 -d work -U postgres -c "SELECT NOW() ;"', shell=True)
And to use pgsql2shp
in subprocess
for shape export:
subprocess.run( \ '''pgsql2shp -f E:\\Path\\to\\MyShape -u postgres -h localhost work "SELECT * FROM MyTable ;"''', \ shell=False, \ executable='E:\\Path\\to\\pgsql2shp.exe' \ )
And to use PowerShell
or OGR2OGR
in subprocess
:
subprocess.run( \ '''OGR2OGR...''', \ shell=False, \ executable='E:\\Path\\to\\powershell.exe' \ )
Strings comparison
Compare the similarity of 2 strings, to get a indicator in a new field:
import difflib from difflib import SequenceMatcher df['indicator'] = df[['Field1', 'Field2']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)
Count time execution
import time ... print('Time for now is %s secondes' % round(time.process_time(), 1)) timeNow = time.process_time() ... print('Time now is %s secondes' % round(time.process_time() - timeNow, 1)) timeNow = time.process_time() ...
Include external Python code (without real importation)
... YourFileCode = r'C:/Users/Georges/PycharmProjects/WorkEMC/YourFileCode.py' ... exec(compile(open(YourFileCode, 'rb').read(), YourFileCode, 'exec'))
Concatenate fields without empty value
df['adresse'] = df[['Filed1', 'Filed2', 'Filed3']].apply(lambda x: ' '.join(x.dropna()), axis=1)
Force data type creating the dataframe
df = pd.read_csv(workDirectory + FileName, sep=';', usecols=fieldsExisting, dtype={'firstname': 'str', 'lastname': 'str'} )
- Tips:
- Sometimes when your data comes from an Excel file and contains percentages,
dtype
will be non-efficient because Excel stores percentages as numbers. There is a solution in the chapter Advanced read/write in Excel of this article, just search Fix Excel percentages.
Fix scientific notation for phone numbers in Pandas, Tabulate and Excel export
Example for phone numbers:
df['Mobile Phone'] = df['Mobile Phone'].astype(str) df['Mobile Phone'] = df['Mobile Phone'].fillna('') df['Mobile Phone'] = df['Mobile Phone'].replace(['nan'], '') df['Mobile Phone'] = df['Mobile Phone'].apply(lambda x: x[:-2] if x.endswith('.0') else x)
Get file/directory date creation
import os import time from time import gmtime, strftime myRep = 'C://Path/of/my/directory' myDateCreation = time.ctime(os.path.getctime(myRep)) print('\n' + str(myDateCreation)) myConvertDateCreation = time.strptime(myDateCreation) print("\nFull:", myConvertDateCreation) print("\nDay:", myConvertDateCreation.tm_mday) print("Month:", myConvertDateCreation.tm_mon) print("Year:", myConvertDateCreation.tm_year) print("Hour:", myConvertDateCreation.tm_hour) print("Minutes:", myConvertDateCreation.tm_min) print(strftime("\n%A %d %B %Y %H:%M", myConvertDateCreation ))
Use a dataframe not yet created in a function
Pass it as global
:
def MyDeletion(): global df ind_drop = df[df['My field'].apply(lambda x: x == ('My value'))].index df = df.drop(ind_drop)
Organize your screens
Attention: wait the end of the process for proper screens!
import subprocess import clipboard import pyautogui import pygetwindow as gw from screeninfo import get_monitors # Callback function to find file explorer windows def enum_windows_callback(hwnd, results): if win32gui.GetClassName(hwnd) == 'CabinetWClass': results.append(hwnd) return True # Get all File Explorer windows explorer_windows = [] win32gui.EnumWindows(enum_windows_callback, explorer_windows) # Close every File Explorer window for hwnd in explorer_windows: # Activate window win32gui.ShowWindow(hwnd, win32con.SW_SHOW) win32gui.SetForegroundWindow(hwnd) # Use pyautogui to send Alt+F4 key combination to close window pyautogui.hotkey('alt', 'f4') # Open File Explorer subprocess.call("start explorer", shell=True) # Wait a short while for File Explorer to fully open pyautogui.sleep(1) # Select the desired screen (for example, the first screen) monitors = get_monitors() screen = monitors[0]# Récupérer les coordonnées du coin supérieur gauche de l'écran left = screen.x top = screen.y # Find the File Explorer window explorer_window = gw.getWindowsWithTitle('Explorateur de fichiers')[0] # Desired position on the screen pyautogui.sleep(0.5) x_position = 10 # To the right y_position = 15 # To the down explorer_window.moveTo(x_position, y_position) # Size pyautogui.sleep(0.5) width = 600 height = 600 explorer_window.resizeTo(width, height) # Directory pyautogui.sleep(0.5) directory_path = 'C:\\Users\\Georges\\Downloads' clipboard.copy(directory_path) pyautogui.hotkey('alt', 'a') # pyautogui.typewrite(directory_path) pyautogui.hotkey('ctrl', 'v') pyautogui.press('enter') # pyautogui.click(button='right') # subprocess.call("start notepad", shell=True)C:\Users\Georges\Downloads
Search a string in database
from tabulate import tabulate as tab import pandas as pd from sqlalchemy import create_engine SearchWord = 'anyxxxtube' username = 'XXXXX' password = 'XXXXX' port = 3306 database = 'XXXXX' host = 'XXXXX' engine = create_engine('mysql+mysqldb://%s:%s@%s:%i/%s' % (username, password, host, port, database)) # SEARCH TABLES AND FIELDS sql = 'SELECT table_name, column_name FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name LIKE "prime_%%" ;' df = pd.read_sql_query(sql, engine) print(tab(df.head(5), headers='keys', tablefmt='psql', showindex=False)) print(df.shape[0]) # BUILD QUERIES Assemblage = list(zip(df['table_name'].tolist(), df['column_name'].tolist())) # for a, b in Assemblage: # print(a, ' - ', b) for a, b in Assemblage: sql = 'SELECT '' + b + '' FROM '' + a + '' WHERE '' + b + '' LIKE "%%' + SearchWord + '%%" ;' df = pd.read_sql_query(sql, engine) if df.shape[0] > 0: print('In table "' + a + '" :') print(tab(df.head(5), headers='keys', tablefmt='psql', showindex=False)) print(df.shape[0])
Name dataframes dynamically
In a loop for exampe, with globals()
:
ListFiles = ['Stats Name1.xlsx', 'Stats Name2.xlsx'] for f inListFiles: dfTemp = pd.read_excel(outputDir + f, sheet_name='Sexe', engine='openpyxl') name = f.replace('.xlsx', '').replace('Stats ', '') globals()[name] = dfTemp print('\n' + name + ' :') print(tabulate(globals()[name].head(35), headers='keys', tablefmt='psql', showindex=False))