Index de l'article

Pandas logoPandas is an excellent Python librarie to handle data. Barely started using it, it already solves some of my recurring needs. Matplotlib allows to create advanced charts and Openpyxl is very usefull to manage Excel files, read or generate them for example.

So here some tips for my memory and maybe for you.

Let's go learn data analysis with this 3 beautiful tools. First install or check Python and Pip, then the libraries.

pip install pandas
pip install matplotlib
pip install openpyxl

Read and display an Excel file

import pandas as pd
from tabulate import tabulate
 
df = pd.read_excel('C:/Users/Your file.xlsx',
                   sheet_name='Export', engine='openpyxl', usecols=['My first column', 'My second column'])
 
print(tabulate(df.head(35), headers='keys', tablefmt='psql'))

I set engine='openpyxl' because it is from an XLSX file. sheet_name specifies the sheet to import, specify your columns with usecols.

Then I use tabulate to display properly my data in a Postgres way (tablefmt='psql').

With head(35) I just display the 35 first lines, but the dataframe here named df contains all the data.

If you wan to re-organize the columns, please see the Miscellaneous chapter.

Read a CSV file

df_subscriber = pd.read_csv(workDirectory+'joo_acymailing_subscriber.csv', sep=';', usecols=['subid', 'source', 'email'])
print(tab(df_subscriber.head(10), headers='keys', tablefmt='psql', showindex=False))

Read a non-standard file

Sometime a file has a wrong structure, or you can not trust the future column names.

To read a file from a specific row for example, skipping the last rows.

import pandas as pd
from tabulate import tabulate
 
df = pd.read_excel(r'C:/Users/Georges/Downloads/temp_QGIS/test excel.xlsx',
sheet_name='Sheet1', engine='openpyxl', usecols=[0,1], header=5-1, skipfooter=1, names=['id', 'value'])
 
print(tabulate(df.head(35), headers='keys', tablefmt='psql', showindex=False))

Export a data-frame in an Excel file

df.to_excel("C:/Users/Your new file.xlsx", index=False, sheet_name='Tab 1', header=None)

index=False to not get the index column, header=None to skip the index row.

Export several dataframes in Excel sheets

If you have several data-frame to export in one Excel file, each in unique sheets, you should open an ExcelWriter. And close it.

writer = pd.ExcelWriter('C:/Users/Your new file.xlsx', engine='xlsxwriter')
 
df_Country.to_excel(writer, index=False, sheet_name='Country', header=['Country', 'Total'])
df_Emails.to_excel(writer, index=False, sheet_name='Emails', header=['Emails', 'Total'])
 
writer.save()

Export several dataframes in the same sheet

df.to_excel(writer, index=False, sheet_name='Registrations', startcol=3, startrow=34, header=['Filed', 'Total', '%'])

Add filters in all columns in one sheet in an Excel file

import openpyxl
from openpyxl.utils import get_column_letter
 
outputExcelFile = r'C:/User/Your file.xlsx'
workbook = openpyxl.load_workbook(outputExcelFile)
 
worksheet = workbook['Sheet name']
FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row)
worksheet.auto_filter.ref = FullRange
 
workbook.save(outputExcelFile)

Add filters in all columns in all sheets in an Excel file

import openpyxl
from openpyxl.utils import get_column_letter
 
outputExcelFile = r'C:/User/Your file.xlsx'
 
workbook = openpyxl.load_workbook(outputExcelFile)
sheetsLits = workbook.sheetnames
 
for sheet in sheetsLits:
    worksheet = workbook[sheet]
    FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row)
    worksheet.auto_filter.ref = FullRange
    workbook.save(outputExcelFile)

Add colors in all column names in all sheets in an Excel file

import openpyxl
 
from openpyxl.styles import PatternFill
 
outputExcelFile = r'C:/Users/Your File.xlsx'
 
workbook = openpyxl.load_workbook(outputExcelFile)
sheetsLits = workbook.sheetnames
 
for sheet in sheetsLits:
    worksheet = workbook[sheet]
    for cell in workbook[sheet][1]:
        worksheet[cell.coordinate].fill = PatternFill(fgColor = 'FFC6C1C1', fill_type = 'solid')

Modify a column width in one sheet in an Excel file

import openpyxl
 
outputExcelFile = r'C:/Users/Your File.xlsx'
 
workbook = openpyxl.load_workbook(outputExcelFile)
sheetsLits = workbook.sheetnames
 
workbook['Your sheet'].column_dimensions['A'].width = 75
workbook.save(outputExcelFile)

Modify all columns width in all sheets in an Excel file

The code below set the A column size to 30, and the others to 10.

import openpyxl
from openpyxl.utils import get_column_letter
 
outputExcelFile = r'C:/Users/Your File.xlsx'
 
workbook = openpyxl.load_workbook(outputExcelFile)
sheetsLits = workbook.sheetnames
 
for sheet in sheetsLits:
    for cell in workbook[sheet][1]:
        if get_column_letter(cell.column) == 'A':
            workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 30
        else:
            workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 10
 
        workbook.save(outputExcelFile)

Freeze the top row of an Excel file

worksheet.freeze_panes = 'A2'

Change font color and style in an Excel file

from openpyxl.styles import Font
 
worksheet['A1'].font = Font(color='FFFFFF', italic=False, bold=True)

Work on several Excel files

Here we freeze the top row, add filters, color and change the width of all columns from 3 Excel files. 

import openpyxl
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
 
 
File1 = r'D:/Work/My file 1.xlsx'
File2 = r'D:/Work/My file 2.xlsx'
File3 = r'D:/Work/My file 3.xlsx'
 
ExcelList = [File1, File2, File3]
 
for ExcelFile in ExcelList:
    workbook = openpyxl.load_workbook(ExcelFile)
    sheetsLits = workbook.sheetnames
 
    for sheet in sheetsLits:
        worksheet = workbook[sheet]
        FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row)
        worksheet.auto_filter.ref = FullRange
        worksheet.freeze_panes = 'A2'
        for cell in workbook[sheet][1]:
            worksheet[cell.coordinate].fill = PatternFill(fgColor='FFC6C1C1', fill_type='solid')
            if get_column_letter(cell.column) == 'A':
                workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 12
            else:
                workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 20
            workbook.save(ExcelFile)

Insert a picture in an Excel file

And delete them after.

img = openpyxl.drawing.image.Image('C:/Users/myplot.png')
img.anchor = 'E6'
 
workbook['Your sheet'].add_image(img)
workbook.save(outputExcelFile)
 
os.remove('C:/Users/myplot.png')

Open a box to search a file and get his path

Use Tkinter.

from tkinter import Tk
from tkinter.filedialog import askopenfilename
 
Tk().withdraw()
filename = askopenfilename()
print(filename)

Open a box to save an Excel file

from tkinter import Tk
from tkinter.filedialog import asksaveasfilename
 
fileName = asksaveasfilename(filetypes=(("Excel files", "*.xlsx"), ("All files", "*.*")), defaultextension='.xslx')
df.to_excel(fileName, index=False, sheet_name='Tab 1', header=None)

 


Analyze data from a multi-valued field

tab multivaluesSuppose you have a field containing values coming from a multi-valued drop-down list. But now you want to count how many times each value exists.

If values were uniques in each cell, a simple GROUP BY in SQL would do the job.

But here we first have to split each cell to recover unique values, then we will be able to count them.

#Pandas

Split the data

Start splitting the data with str.split specifying the separator and expanding, to create new columns to store each unique values.

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx',
                   sheet_name='Sheet1', engine='openpyxl')
 
serie = df['Your field'].str.split(', ', expand=True)
 
print(serie)

Great! Now we get virtual columns containing only one value, with as much columns as necessary.

Melt the virtual data

Now let's go to melt our virtual columns in just one.

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx',
                   sheet_name='Sheet1', engine='openpyxl')
 
serieMelt = pd.melt(df['Your field'].str.split(', ',expand=True))['value']
 
print(serieMelt)

str.split gives virtual column labels named value, that we can use.

Count unique values

Now just put our melted data in a new data frame, before to apply a sorted groupby by size() (a COUNT in SQL).

tab multivalues countThe full code will be:

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl')
 
temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value'])
 
df_Count = pd.DataFrame(temp_df.groupby(['value']).size(), columns=['Total']).reset_index()
 
print(df_Count)

We add a reset_index() to recover column names properly, to use them later.

Percentage

Ok, but suppose you want also the recurrence percentages of your unique values. The usual method will take the total from the melted data, giving strange results. Indeed the melted data cumulated each values from the multi-valued field in new rows, so a lot!

To get real percentages you have to use shape[0] to count the real number of rows from your original data-frame, and use it in the percentage calculation.

We start creating a new field in the results data-frame, then we calculate percentages, rounding them.

df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100
df_Count['Percent'] =df_Count['Percent'].round(decimals=2)

Of course, you can not add the percentage results, it will exceed 100% and it is normal.

Empty values management

Good. But about our empty values? Just add dropna=False to see them.

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl')
 
temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value'])
df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index()
 
df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100
df_Count['Percent'] =df_Count['Percent'].round(decimals=2)
 
print(df_Count)

Any problem? 

Arrgg! The groupby counted empty values from the melted data, and the melted data has added new columns as many times as necessary (based on the record with the highest number of unique values). Each them containing empty values ... It skews the count and percentage.

We have to cheat to fix that.

  1. import pandas as pd
  2.  
  3. df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl')
  4.  
  5. temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value'])
  6. df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index()
  7. df_Count = df_Count.fillna('AZERTY')
  8.  
  9. df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100
  10. df_Count['Percent'] =df_Count['Percent'].round(decimals=2)
  11.  
  12. valuesEmpty = df['Industries:Industries'].isna().sum()
  13. valuesEmptyPercent = round((valuesEmpty / df.shape[0]) * 100, 2)
  14.  
  15. df_Count.loc[(df_Count['value'] == 'AZERTY')] = [['Unknow', valuesEmpty, valuesEmptyPercent]]
  16. df_Count = df_Count.sort_values(['Total'], ascending=False)
  17.  
  18. print(df_Count)

The 7th line adds a specific value in the results data-frame to target the wrong empty values calculation (AZERTY).

The 12 and 13th count the empty values and its percentage aside, and store results in specific variables.

Then the 15th line replaces the empty values calculation with our new value and variables, using loc to identify the line to fix.

Finally we can sort our data-frame with sort_values(['Total'].


Work with dates

Group by months

If you have a data-frame df with a field Created at containing dates like:

2021-02-05 10:34:21 UTC

You can work with it, example to group and sort by month, you can cheat doing:

df_Temp = pd.DataFrame(df['Created at'].str[:7])
df_Created_count = pd.DataFrame(df_Temp.groupby(['Created at'], dropna=False).size(), columns=['Total']).sort_values(['Created at'], ascending=True).reset_index()

indeed with str[:7] we extract years and months from strings. Then a usual groupby.

Ok but it is more a pandas-way to convert your field in a true date format, in a new virtual field. Then you can extract real months and store them in a proper data-frame.

df['Created'] = pd.to_datetime(df['Created at'])
df['Created'] = df['Created'].dt.to_period("M")
df_TempCreated = pd.DataFrame(df['Created'])

Then easy to group and count them.

df_Created_count = pd.DataFrame(df_TempCreated.groupby(['Created'], dropna=False).size(), columns=['Total']).sort_values(['Created'], ascending=True).reset_index()

As now you use a real date format, you can display the full months with dt.strftime.

df_Created_count['Created'] = df_Created_count['Created'].dt.strftime('%B %Y')

Extract the previous month

PreviousMonth = today - pd.DateOffset(months=1)
print(PreviousMonth.strftime("%Y-%m"))

Pie Chart

Simple pie

If you have a data-frame with fields Categories and Total, you can create a pie chart.

import matplotlib.pyplot as plt
 
chartLabel = df_Categories_count['Categories'].tolist()
chartValue = df_Categories_count['Total'].tolist()
 
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
 
fig1 = plt.figure()
plt.pie(chartValue, labels=chartLabel, colors=colors, autopct=None, shadow=False, startangle=90)
plt.axis('equal')
plt.title('Categories', pad=20, fontsize=15)
 
plt.show()
plt.clf()

Display percentages in the chart

Use autopct.

plt.pie(chartValue, labels=chartLabel, colors=colors, autopct='%1.1f%%', shadow=False, startangle=90)

Change the labels font size

plt.rcParams['font.size'] = 5

Hide a label in chart

Just replace the value with an empty string.

chartLabel[-1] = ''

Add a legend in your chart

plt.legend(chartLabel, loc='best', fontsize=8)

Explode

To explode your chart, you need to pass values in explode, example if you have 4 values to plot:

plt.pie(myValue, labels=myLabels, colors=colors, explode=[0.05, 0.05, 0.05, 0.05], autopct=None, shadow=False, startangle=90)

If you want explode all part by default, just create a list before:

explodeValues = []
for i in myValue:
    explodeValues.append(0.05)
 
plt.pie(myValue, labels=myLabels, colors=colors, explode=explodeValues, autopct=None, shadow=False, startangle=90)

Add a legend with several fields

For example you want display labels and percentages in the legend. First calculate percentages in another field in your data-frame, then:

import matplotlib.pyplot as plt
 
chartLabel = df_Categories_count['Categories'].tolist()
chartLegendLabel = df_Categories_count['Categories'].tolist()
chartValue = df_Categories_count['Total'].tolist()
chartLegendPercent = df_Categories_count['Percent'].tolist()
 
legendLabels = []
for i, j in zip(chartLegendLabel, map(str, chartLegendPercent)):
    legendLabels.append(i + ' (' + j + ' %)')
 
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
 
fig1 = plt.figure()
plt.pie(chartValue, labels=chartLabel, colors=colors, autopct=None, shadow=False, startangle=90)
plt.axis('equal')
 
plt.title('Categories', pad=20, fontsize=15)
plt.legend(legendLabels, loc='best', fontsize=8)
 
plt.show()
plt.clf()

Bar chart

Vertical bar chart

chartLabel = df_Created_count['Labels field'].tolist()
chartValue = df_Created_count['Values field'].tolist()
 
fig5 = plt.figure(figsize=(13,6))
bar_plot = plt.bar(chartLabel, chartValue)
 
# plt.ylabel('yyy')
# plt.xlabel('xxx')
plt.xticks(rotation=30, ha='right')
 
# HIDE BORDERS
plt.gca().spines['left'].set_color('none')
plt.gca().spines['right'].set_color('none')
plt.gca().spines['top'].set_color('none')
 
# HIDE TICKS
plt.tick_params(axis='y', labelsize=0, length=0)
plt.yticks([])
 
# ADD VALUE ON THE END OF HORIZONTAL BARS
# for index, value in enumerate(chartValue):
#     plt.text(value, index, str(value))
 
# ADD VALUE ON THE TOP OF VERTICAL BARS
def autolabel(rects):
    for idx, rect in enumerate(bar_plot):
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width()/2, height, chartValue[idx], ha='center', va='bottom', rotation=0)
 
autolabel(bar_plot)
 
plt.title('Registrations by month', pad=20, fontsize=15)
 
plt.show()
plt.clf()

Work with shape files

Here we will use numpy and basemap.

pip install numpy
pip install basemap

To install basemap on WIndows, download a package here, according your versions, and install it with wheel.

Thanks to ramiro.org. I got the code below on his website, I just adapted it.

Suppose you have an Excel file with a country field, you get count and map them with a country shape. The country names in the Excel file and in the shape must be the same.

  1. import matplotlib as mpl
  2. import matplotlib.pyplot as plt
  3. import numpy as np
  4. import pandas as pd
  5.  
  6. from matplotlib.patches import Polygon
  7. from matplotlib.collections import PatchCollection
  8. from mpl_toolkits.basemap import Basemap
  9.  
  10. from PIL import Image, ImageOps
  11.  
  12. shp_simple_countries = r'C:/Users/Downloads/simple_countries/simple_countries'
  13.  
  14. inputExcelFile = r'C:/Users/Downloads/My file.xslx'
  15.  
  16. df = pd.read_excel(inputExcelFile, sheet_name='Export', engine='openpyxl', usecols=['ID', 'My Country Field'])
  17.  
  18. # Count the countries
  19. df_Country_count = pd.DataFrame(df.groupby(['My Country Field'], dropna=True).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index()
  20. df_Country_count = df_Country_count.fillna('Unknow')
  21.  
  22. df_Country_count['Percent'] = (df_Country_count['Total'] / df_Country_count['Total'].sum()) * 100
  23. df_Country_count['Percent'] = df_Country_count['Percent'].round(decimals=2)
  24.  
  25. # Set countries as index
  26. df_Country_count.set_index('My Country Field', inplace=True)
  27.  
  28. my_values = df_Country_count['Percent']
  29.  
  30. num_colors = 30
  31. cm = plt.get_cmap('Blues')
  32.  
  33. scheme = [cm(i / num_colors) for i in range(num_colors)]
  34. my_range = np.linspace(my_values.min(), my_values.max(), num_colors)
  35.  
  36. # -1 TO AVOID SEARCHS IN A PANDAS DATA-FRAME INCLUDING START AND STOP VALUE (I think ...)
  37. df_Country_count['Percent'] = np.digitize(my_values, my_range) - 1
  38.  
  39. map1 = plt.figure(figsize=(14, 8))
  40. ax = map1.add_subplot(111, frame_on=False)
  41.  
  42. map1.suptitle('Countries', fontsize=30, y=.95)
  43.  
  44. m = Basemap(lon_0=0, projection='robin')
  45. m.drawmapboundary(color='w')
  46.  
  47. m.readshapefile(shp_simple_countries, 'units', color='#444444', linewidth=.2, default_encoding='iso-8859-15')
  48.  
  49. # Create the chloro map
  50. for info, shape in zip(m.units_info, m.units):
  51. shp_ctry = info['COUNTRY_HB']
  52. if shp_ctry not in df_Country_count.index:
  53. color = '#dddddd'
  54. else:
  55. color = scheme[df_Country_count.loc[shp_ctry]['Percent']]
  56.  
  57. patches = [Polygon(np.array(shape), True)]
  58. pc = PatchCollection(patches)
  59. pc.set_facecolor(color)
  60. ax.add_collection(pc)
  61.  
  62. # Cover up Antarctica so legend can be placed over it
  63. ax.axhspan(0, 1000 * 1800, facecolor='w', edgecolor='w', zorder=2)
  64.  
  65. # Draw color legend
  66. ax_legend = map1.add_axes([0.2, 0.14, 0.6, 0.03], zorder=3)
  67. cmap = mpl.colors.ListedColormap(scheme)
  68.  
  69. cb = mpl.colorbar.ColorbarBase(ax_legend, cmap=cmap, ticks=my_range, boundaries=my_range, orientation='horizontal')
  70. cb.ax.set_xticklabels([str(round(i, 1)) for i in my_range])
  71. cb.ax.tick_params(labelsize=7)
  72. cb.set_label('Percentage', rotation=0)
  73. #cb.remove()
  74.  
  75. # Set the map footer
  76. # description = 'Bla bla bla'
  77. # plt.annotate(description, xy=(-.8, -3.2), size=14, xycoords='axes fraction')
  78.  
  79. map1.savefig('C:/Users/Downloads/mymap1.png', dpi=100, bbox_inches='tight')
  80.  
  81. plt.show()
  82. plt.clf()
  83.  
  84. im = Image.open('C:/Users/Downloads/mymap1.png')
  85. bordered = ImageOps.expand(im, border=1, fill=(0, 0, 0))
  86. bordered.save('C:/Users/Downloads/mymap1.png')

Relations

SQL JOIN LEFT

df_Join = pd.merge(df1, df2, left_on='Email', right_on='email', how='left')[['Email', 'source']]

SQL INNER JOIN

df_Join = pd.merge(df1, df2, on='Email')

Find difference

df_difference = pd.DataFrame(df_First[~df_First['Email'].isin(df_Second['email'])])

 


Group, count and sort

Count rows

number = df.shape[0]
print(number)

Count values (group by)

If you have a field named Country in a data-frame named df:

CountCountry = df.groupby(['Country']).size().sort_values(ascending=False)
print(CountCountry)[/code]

Count a specific value

print (len(df[df['Your field'] == 'United States']))

Count with condition

myCount = df[df['Field where search the condition'] == 'A value'].count()['Field to count']

Count empty values

number = df['Your field'].isna().sum()

Keep empty values in a group by and fill them

newDf = pd.DataFrame(df.groupby(['Your field'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index()
newDf = newDf.fillna('Unknow')

Count email domains (group by)

You have a Email field with clean emails. Thanks to Pandas it is easy to group and count their DNS.

First create a virtual field to split the emails and recover the 2nd part of the split. Then just group your new field.

df['Domain'] = df['Email'].str.split(<a href="mailto:'@').str">'@').str</a>[1]
df_Email_DNS_count = pd.DataFrame(df.groupby(['Domain']).size(), columns=['Total'])\
    .sort_values(['Total'], ascending=False).reset_index()[/code]

Calculate percentages

Acording your needs, you can use the sum from a Total field.

df['Percent'] = (df_SpecialtiesPerCountry_count['Total'] / df['Total'].sum()) * 100
df['Percent'] = df['Percent'].round(decimals=2)

Or the number of rows in your data-frame (useful if you work with a multi-valued field, see above the Analyze data from a multi-valued field chapter) .

df['Percent'] = (df['Total'] / df.shape[0]) * 100
df['Percent'] = df['Percent'].round(decimals=2)

Sort on a field 

df.sort_values(by=['Field'], inplace=True)

 


Replace, remove, edit with conditions

Replace a value (the full value)

df['Your field'] = df['Your field'].replace(['Old value'],'New value')

Replace a part of string

df['Your field'] = df['Your field'].replace({'Old value': 'New value'}, regex=True)

Replace a string if it contains

df.loc[df['speciality'].str.contains('Researcher'), 'speciality'] = 'Research Scientist'

Localize and remove a row

ind_drop = df[df['Your field'].apply(lambda x: x == ('A value'))].index
df = df.drop(ind_drop)

Localize and remove a row starting with ...

ind_drop = df[df['Your field'].apply(lambda x: x.startswith('A value'))].index
df = df.drop(ind_drop)

Localize and remove a row ending with ...

ind_drop = df[df['Your field'].apply(lambda x: x.endswith('A value'))].index
df = df.drop(ind_drop)

Localize and replace a row

df.loc[(df['A field'] == 'TARGET')] = [[NewValue1, NewValue2, NewValue3]]

Remove some first characters

Here we delete the 2 first characters if the cell starts with a comma then a space.

df['Field'] = df['Field'].apply(lambda x: x[2:] if x.startswith(', ') else x)

Edit with a condition

Increment a field if another field is empty.

df.loc[df['My field maybe empty'].notna(), 'Field to increment'] += 1

Fill a field if a field is greater or equal to another field.

df.loc[df['Field A'] >= df['Field B'], 'Field to fill'] = 'Yes'

Edit several fields in the same time.

df.loc[df['Field A'] >= df['Field B'], ['Field A to fill', 'Field B to fill']] = ['Yes', 'No']

Edit with several conditions

Condition "AND" (&)
df.loc[(df['My field maybe empty'].notna()) &amp; (df['An integer field'] == 1) &amp; (df['An string field'] != 'OK'), 'Field to increment'] += 1

Please replace "&amp;" with a simple &.

Condition "OR" (|)
df.loc[(df['My field maybe empty'].notna()) | (df['An integer field'] == 1) | (df['An string field'] != 'OK'), 'Field to fill'] = 'Yes'

Edit with IN or NOT IN condition (as SQL)

Just use isin:

df.loc[df['Id field'].isin([531733,569732,652626]), 'Filed to edit'] = 'Yes'

And for NOT IN:

df.loc[df['Id field'].isin([531733,569732,652626]) == False, 'Filed to edit'] = 'No'

 


Merge

Merge 2 lists in 2 columns and properly display them in the terminal

tabulate

import pandas as pd
from tabulate import tabulate
 
LanguageList = ['Python', 'SQL', 'PHP', 'CSS', 'Javascript', 'HTML']
LevelList = ['Fun', 'OK', 'Arghh', 'OK', 'Arghh', 'Easy']
 
LanguageLevel = list(zip(LanguageList, LevelList))
 
df = pd.DataFrame(LanguageLevel, columns =['Language', 'Level'])
 
print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))

The 7th line uses zip to link the values from our lists in tupleslist encloses our tuples in a list.

The 9th line creates a data-frame (pd.DataFrame) from our new list (now containing tuples). Specifying the column labels.

The print uses tabulate to display our fields with a beautiful Postgres style (psql).

Split a dataframe with conditions, and them merge the splits according other conditions

firstSelection = df.loc[df['Field'] == 'Yes'].head(100)
 
secondSelection = df.loc[df['Field'] == 'No'].head(200)
 
# Merge the 2 selections.
df_Selections = pd.concat([firstSelection, secondSelection], axis=0)

 


Warnings

Avoid indexation problems after dataframes concat

Sometimes after a dataframes concatenation, the indexes are melted and can cause some problems. Just ignore the index in your new dataframe:

df_Selections = pd.concat([firstSelection, secondSelection], axis=0, ignore_index=True)

 Avoid copy warning after a field concatenation

warnings.simplefilter(action='ignore', category=SettingWithCopyWarning)

 


Miscellaneous

Copy data from a dataframe in the clipboard

Very convenient to copy-paste quickly in a text file, CSV, Excel ...

df.to_clipboard(sep=',', index=False, header=None)

Dedupe

To dedupe a dataframe on an Email field keeping only the first duplicate record.

dfDeduped = df.drop_duplicates(subset=['Email'], keep='first')

Export a picture from a chart

And add a border.

from PIL import Image, ImageOps
 
fig1.savefig('C:/Users/myplot.png', dpi=100)
 
im = Image.open('C:/Users/myplot2.png')
bordered = ImageOps.expand(im, border=1, fill=(0, 0, 0))
bordered.save('C:/Users/myplot2.png')

Re-organize columns

df = df[['Field 1', 'Adresse mail', 'Note algorithmique', 'Substitutions']]

Re-name columns

df.rename(columns={'Old name 1': 'New name 1', 'Old name 1': 'New name 1'}, inplace=True)

Ranking

To rank according to a field, from highest to lowest:

df['My rank'] = df['My field to rank'].rank(ascending=False)

Avoid duplicate:

df['My rank'] = df['My field to rank'].rank(ascending=False, method='first')

Generate a unique random integer

If you want a unique random integer from 1 included to the number of records included.

howManyRecord = df.shape[0]
df['Random number'] = np.random.choice(range(howManyRecord), howManyRecord, replace=False)
df.loc[df['Random number'] == 0, 'Random number'] = howManyRecord

Select with condition

df = df.loc[df['Field to filter'] == 'Yes']