Imprimer
Catégorie : Astuces
18773

Pandas logoPandas is an excellent Python librarie to manage data. Barely started using it, it already solves some of my recurring needs. Matplotlib allows to create advanced charts and Openpyxl is very usefull to manage Excel files, read or generate them for example.

So here some tips for my memory and maybe for you.

Let's go learn data analysis with this 3 beautiful tools. First install or check Python and Pip, then the libraries.

pip install pandas
pip install matplotlib
pip install openpyxl

Work with data file (CSV, Excel, MySQL, Postgres, JSON...)

Read and display an Excel file

import pandas as pd
from tabulate import tabulate
 
df = pd.read_excel('C:/Users/Your file.xlsx',
                   sheet_name='Export', engine='openpyxl', usecols=['My first column', 'My second column'])
 
print(tabulate(df.head(35), headers='keys', tablefmt='psql', showindex=True))

I set engine='openpyxl' because it is from an XLSX file. sheet_name specifies the sheet to import, specify your columns with usecols.

Then I use tabulate to display properly my data in a Postgres way (tablefmt='psql').

With head(35) I just display the 35 first lines, but the dataframe here named df contains all the data.

If you wan to re-organize the columns, please see the Miscellaneous chapter.

Read a CSV file

import pandas as pd
from tabulate import tabulate
 
df_subscriber = pd.read_csv(workDirectory+'joo_acymailing_subscriber.csv', sep=';', usecols=['subid', 'source', 'email'])
 
print(tab(df_subscriber.head(10), headers='keys', tablefmt='psql', showindex=False))

Add textual column names from a CSV without column names

Use names=[] :

import pandas as pd
from tabulate import tabulate
 
df = pd.read_csv(inpuMyFilePath, names=['Mon Champ 1', 'Mon Champ 2', 'Mon Champ 3'], sep=',', keep_default_na=False).astype(str)
 
print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False))

Read a non-standard Excel file

Sometimes a file has wrong top lines, or you can not trust the future column names.

To read a file from a specific row for example, skipping the last rows.

import pandas as pd
from tabulate import tabulate
 
df = pd.read_excel(r'C:/Users/Georges/Downloads/test excel.xlsx',
sheet_name='Sheet1', engine='openpyxl', usecols=[0,1], header=5-1, skipfooter=1, names=['id', 'value'])
 
print(tabulate(df.head(35), headers='keys', tablefmt='psql', showindex=False))

Or skip a specific line (with skiprows=[...]):

...
df = pd.read_excel(my_file, sheet_name='Feuil1', engine='openpyxl', skiprows=[2])
...

Read from MySQL

Installating mysqlclient:

import MySQLdb
from tabulate import tabulate
import pandas as pd
 
mydb = MySQLdb.connect(user='root', password='', database='MyDb', host='localhost', port=3306)
query = 'SELECT name FROM users ;'
 
df = pd.read_sql(query, mydb)
mydb.close()
 
print(tabulate(df.head(10), headers='keys', tablefmt='psql'))
number = df.shape[0]
print(number)

Or below, installating mysql.connector:

import mysql.connector as connection
import pandas as pd
from tabulate import tabulate as tab
 
mydb = connection.connect(host='XXX.XX.XXX.XX', database='MyDb',user='Me', passwd='pwdXXX!456', use_pure=True)
 
try:
    query = "SELECT firstname, lastname FROM customer ;"
    df = pd.read_sql(query, mydb)
except Exception as e:
    mydb.close()
    print(str(e))
 
print(tab(df.head(5), headers='keys', tablefmt='psql'))

Read from Postgres

import pandas as pd
from tabulate import tabulate
import psycopg2
 
con = psycopg2.connect(database='mytransformations', user='postgres', password='', host='localhost', port='5432')
print("OK, we are connect to mytransformations")
 
# DISPLAY WITH PANDAS
df = pd.read_sql("SELECT * FROM myparcels ;", con)
df.drop('geom', axis=1, inplace=True)
 
print('\nResults:')
print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False))
print('Number of records:', df.shape[0])

Read from a JSON

import pandas as pd
from tabulate import tabulate
import json
import requests
 
response = requests.get(MON_URL)
my_file = json.loads(response.text)
my_json = my_file['data']
 
mon_df = pd.DataFrame.from_dict(my_json)

Export a data-frame in an Excel file

df.to_excel("C:/Users/Your new file.xlsx", index=False, sheet_name='Tab 1')

index=False to not get the index column.

Export a data-frame in an existing Excel sheet

Use mode='a' and if_sheet_exists in a real writer:

from openpyxl import load_workbook
import openpyxl
 
My_File_Path= 'C:/My/File.xlsx'
 
with pd.ExcelWriter(My_File_Path, 'openpyxl', mode='a', if_sheet_exists='replace') as writer:
    writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)
    df.to_excel(writer, 'My sheet', index=False)

Export several dataframes in Excel sheets

If you have several data-frame to export in one Excel file, each in unique sheets, you should open an ExcelWriter. And close it.

writer = pd.ExcelWriter('C:/Users/Your new file.xlsx', engine='xlsxwriter')
 
df_Country.to_excel(writer, index=False, sheet_name='Country', header=['Country', 'Total'])
df_Emails.to_excel(writer, index=False, sheet_name='Emails', header=['Emails', 'Total'])
 
writer.save()

Export several dataframes in the same sheet

df.to_excel(writer, index=False, sheet_name='Registrations', startcol=3, startrow=34, header=['Filed', 'Total', '%'])

Add filters in all columns in one sheet in an Excel file

import openpyxl
from openpyxl.utils import get_column_letter
 
outputExcelFile = r'C:/User/Your file.xlsx'
workbook = openpyxl.load_workbook(outputExcelFile)
 
worksheet = workbook['Sheet name']
FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row)
worksheet.auto_filter.ref = FullRange
 
workbook.save(outputExcelFile)

Add filters in all columns in all sheets in an Excel file

import openpyxl
from openpyxl.utils import get_column_letter
 
outputExcelFile = r'C:/User/Your file.xlsx'
 
workbook = openpyxl.load_workbook(outputExcelFile)
sheetsLits = workbook.sheetnames
 
for sheet in sheetsLits:
    worksheet = workbook[sheet]
    FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row)
    worksheet.auto_filter.ref = FullRange
    workbook.save(outputExcelFile)

Add colors in all column names in a sheet in an Excel file

import openpyxl
from openpyxl.styles import PatternFill
 
workbook = openpyxl.load_workbook(My_File_Path)
worksheet = workbook['My sheet']
 
for cell in worksheet[1]:
    worksheet[cell.coordinate].fill = PatternFill(fgColor='FFC6C1C1', fill_type='solid')
    workbook.save(My_File_Path)

Add colors in all column names in all sheets in an Excel file

import openpyxl
from openpyxl.styles import PatternFill
 
outputExcelFile = r'C:/Users/Your File.xlsx'
 
workbook = openpyxl.load_workbook(outputExcelFile)
sheetsLits = workbook.sheetnames
 
for sheet in sheetsLits:
    worksheet = workbook[sheet]
    for cell in workbook[sheet][1]:
        worksheet[cell.coordinate].fill = PatternFill(fgColor = 'FFC6C1C1', fill_type = 'solid')
        workbook.save(outputExcelFile)

Modify a column width in one sheet in an Excel file

import openpyxl
 
outputExcelFile = r'C:/Users/Your File.xlsx'
 
workbook = openpyxl.load_workbook(outputExcelFile)
sheetsLits = workbook.sheetnames
 
workbook['Your sheet'].column_dimensions['A'].width = 75
workbook.save(outputExcelFile)

Modify all columns width in all sheets in an Excel file

The code below set the A column size to 30, and the others to 10.

import openpyxl
from openpyxl.utils import get_column_letter
 
outputExcelFile = r'C:/Users/Your File.xlsx'
 
workbook = openpyxl.load_workbook(outputExcelFile)
sheetsLits = workbook.sheetnames
 
for sheet in sheetsLits:
    for cell in workbook[sheet][1]:
        if get_column_letter(cell.column) == 'A':
            workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 30
        else:
            workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 10
 
        workbook.save(outputExcelFile)

Add color to some line recovering their indexes from condition

list_duplicate = df.index[df['Ligne en doublon'] == 'DOUBLON !'].tolist()
 
for index in list_duplicate:
    for cell in worksheet[index+2]:
        worksheet[cell.coordinate].fill = PatternFill(fgColor='FF6666', fill_type='solid')
        workbook.save(My_File_Path)

Freeze the top row of an Excel file

worksheet.freeze_panes = 'A2'

Change font color and style in an Excel file

from openpyxl.styles import Font
 
worksheet['A1'].font = Font(color='FFFFFF', italic=False, bold=True)

Work on several Excel files

Here we freeze the top row, add filters, color and change the width of all columns from 3 Excel files. 

import openpyxl
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
 
 
File1 = r'D:/Work/My file 1.xlsx'
File2 = r'D:/Work/My file 2.xlsx'
File3 = r'D:/Work/My file 3.xlsx'
 
ExcelList = [File1, File2, File3]
 
for ExcelFile in ExcelList:
    workbook = openpyxl.load_workbook(ExcelFile)
    sheetsLits = workbook.sheetnames
 
    for sheet in sheetsLits:
        worksheet = workbook[sheet]
        FullRange = 'A1:' + get_column_letter(worksheet.max_column) + str(worksheet.max_row)
        worksheet.auto_filter.ref = FullRange
        worksheet.freeze_panes = 'A2'
        for cell in workbook[sheet][1]:
            worksheet[cell.coordinate].fill = PatternFill(fgColor='FFC6C1C1', fill_type='solid')
            if get_column_letter(cell.column) == 'A':
                workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 12
            else:
                workbook[sheet].column_dimensions[get_column_letter(cell.column)].width = 20
            workbook.save(ExcelFile)

Insert a picture in an Excel file

And delete them after.

img = openpyxl.drawing.image.Image('C:/Users/myplot.png')
img.anchor = 'E6'
 
workbook['Your sheet'].add_image(img)
workbook.save(outputExcelFile)
 
os.remove('C:/Users/myplot.png')

Open a box to search a file and get his path

Use Tkinter.

from tkinter import Tk
from tkinter.filedialog import askopenfilename
 
Tk().withdraw()
filename = askopenfilename()
print(filename)

Open a box to save an Excel file

from tkinter import Tk
from tkinter.filedialog import asksaveasfilename
 
fileName = asksaveasfilename(filetypes=(("Excel files", "*.xlsx"), ("All files", "*.*")), defaultextension='.xslx')
df.to_excel(fileName, index=False, sheet_name='Tab 1', header=None)

Use first row as label column

my_labels = df.iloc[0]
df = df[1:]
df.columns = my_labels

Check if a file exists

And use it only if it exists:

import os
 
my_file = 'C:/_gh/0/export.xlsx'
if os.path.isfile(my_file) == True:
    df_ExportFeuil1 = pd.read_excel(my_file, sheet_name='Feuil1', engine='openpyxl')

 


Excel conditional formatting

If a number is equal to, in just one cell

Here we add in one cell (B3) a rule to fill it in blue if it equals to 2:

import openpyxl
from openpyxl import styles, formatting
 
myFile = 'C:/Path/test.xlsx'
 
myDF.to_excel(myFile, index=False, sheet_name='Tab 1')
 
workbook = openpyxl.load_workbook(myFile)
worksheet = workbook['Sheet1']
 
myBlue = styles.PatternFill(start_color='0088cc', end_color='0088cc', fill_type='solid')
worksheet.conditional_formatting.add('B3', formatting.rule.CellIsRule(operator='equal', formula=['2'], fill=myBlue))
 
workbook.save(myFile)

If a number is greater than, in whole column

Same as above, we just change the operator with greaterThan and we target the whole column B as Excel did, with  $1:$1048576:

import openpyxl
from openpyxl import styles, formatting
 
myFile = 'C:/Path/test.xlsx'
 
myDF.to_excel(myFile, index=False, sheet_name='Tab 1')
 
workbook = openpyxl.load_workbook(myFile)
worksheet = workbook['Sheet1']
 
myBlue = styles.PatternFill(start_color='0088cc', end_color='0088cc', fill_type='solid')
worksheet.conditional_formatting.add('B1:B1048576', formatting.rule.CellIsRule(operator='greaterThan', formula=['2'], fill=myBlue))
 
workbook.save(myFile)

Operators

The above method supports several operators:

{'beginsWith', 'endsWith', 'notBetween', 'notContains', 'lessThan', 'greaterThanOrEqual', 'notEqual', 'between', 'lessThanOrEqual', 'containsText', 'greaterThan', 'equal'}

If contains a string

For match with strings it is a little specific, with redundant code and building the column_slot ourself, using the size of our dataframe +1 :

Here for example we fill in red the cells from A column when contain the string CSS.

import pandas as pd
import openpyxl
from openpyxl import styles, formatting
from openpyxl.styles.differential import DifferentialStyle
from openpyxl.formatting.rule import Rule
 
myFile = 'C:/Path/test.xlsx'
 
myDF.to_excel(myFile, index=False, sheet_name='Tab 1')
 
workbook = openpyxl.load_workbook(myFile)
worksheet = workbook['Sheet1']
 
myRed = styles.PatternFill(start_color='ffc7ce', end_color='ffc7ce', fill_type='solid')
 
column_slot = 'A2:A'+str(myDF.shape[0]+1)
 
styleRed = DifferentialStyle(fill=myRed)
rule = Rule(type='containsText', operator='containsText', text='css', dxf=styleRed)
rule.formula = ['SEARCH("css",' + column_slot + ")"]
worksheet.conditional_formatting.add(column_slot, rule)
 
workbook.save(myFile)

 


Split file

Here below is not right about Pandas, but usefull in some Pandas contexts.

First we count the number of lines in the file (from the recurrence of \n), then we split it in files with 50,000 lines if it exceeds 50,000 lines.

with open(workDirectory+'FixQueries.sql', 'r') as myfile:
data = myfile.read()
taille_fichier = data.count("\n")
 
lines_max = 50000
numero_fichier = 0
if taille_fichier > lines_max:
    print('Attention : le fichier de sortie fait '+str(taille_fichier)+' lignes ! Veuillez patienter...')
    smallfile = None
    with open(workDirectory+'FixQueries.sql') as bigfile:
        for lineno, line in enumerate(bigfile):
            if lineno % lines_max == 0:
                if smallfile:
                    smallfile.close()
                numero_fichier += 1
                small_filename = workDirectory + 'FixQueries {}.sql'.format(numero_fichier)
                smallfile = open(small_filename, "w")
            smallfile.write(line)
        if smallfile:
            smallfile.close()
            print('Nous l\'avons découpé en ', ceil(taille_fichier/lines_max), 'fichiers !\n')
    file1.close()
    os.remove(workDirectory+'FixQueries.sql')

 

And then merge the files:

filenames = ['C:/_gh/0/file_25000.txt', 'C:/_gh/0/file_50000.txt', 'C:/_gh/0/file_75000.txt', 'C:/_gh/0/file_100000.txt', 'C:/_gh/0/file_125000.txt']
 
with open('C:/_gh/0/CUMUL1.txt', 'w') as outfile:
    for names in filenames:
        with open(names) as infile:
            outfile.write(infile.read())

 

Maybe you will need to list the names of the files before to merge them, with PowerShell (Alt+F+R):

get-childitem | select-object -expandproperty name > _files_list.txt

Be careful to keep the order of the file, to keep the last line empty.

 


Advanced write in a TXT file

This is not right related to Pandas, but it is very useful to store some special note to remember during a data process, and automate it.

Create/replace a TXT file and write something into

Use the Write option (w):

MyTxtFile = 'C:/_gh/0/My beautiful file.txt'
 
with open(MyTxtFile, 'w') as file:
    file.write('Blablabla...\n')
    file.write(MyVariable)

Add some lines in an existing file

Use the Append option (a):

MyTxtFile = 'C:/_gh/0/My beautiful file.txt'
 
with open(MyTxtFile, 'a') as file:
    file.write('Blablabla...\n')
    file.write(MyVariable)

Replace some lines in an existing file

Here we first Read (r option) and store the content of the file, and then re-create it removing the lines we would delete:

MyTxtFile = 'C:/_gh/0/My beautiful file.txt'
 
with open(MyTxtFile, 'r') as f:
    lines = f.readlines()
 
with open(MyTxtFile, 'w') as f:
    for line in lines:
        if not line.strip('\n').startswith('First line to delete...') :
            if not line.strip('\n').startswith('Second line to delete...') :
                if not line.strip('\n').startswith('Third line to delete...') :
                    f.write(line)
 
with open(MyTxtFile, 'a') as file:
    file.write('New line 1')
    file.write('New line 2')
    file.write('New line 3')

Replace a simple string in a TXT file

MyTxtFile = 'C:/_gh/0/My beautiful file.txt'
 
with open(MyTxtFile, 'r') as file :
    filedata = file.read()
 
filedata = filedata.replace('String 1', 'String 2')
 
with open(MyTxtFile, 'w') as file:
    file.write(filedata)

Replace a string with a regular expression (regex) in a TXT file

FichierTest = 'C:/_gh/0/FichierTest.txt'
 
import re
 
# Find number : r'[0-9]'
# Find line break : r'[\n]'
# Find double line break : r'[\r\n]{2,}'
 
with open(FichierTest, 'r+') as file:
    text = file.read()
    text = re.sub(r'[\r\n]{2,}', '\n\n', text)
    file.seek(0, 0) # seek to beginning
    file.write(text)
    file.truncate() # get rid of any trailing characters

Test if a comma is used as decimal separator in a TXT file

with open(MyTxtFile, 'r') as myfile:
    data = myfile.read()
    pb_regex = re.findall(r'(?:\t|^)\d+\K,(?=\d+(?:\t|$))', data)
    if pb_regex:
        print(colored('\nAttention, comma is used as a decimal separator in numeric value!\n' +
        'Find them in Notepad with:\n(?:\\t|^)\d+\K,(?=\d+(?:\\t|$))', 'yellow'))
    else:
        pass

 

 


Analyze data from a multi-valued field

tab multivaluesSuppose you have a field containing values coming from a multi-valued drop-down list. But now you want to count how many times each value exists.

If values were uniques in each cell, a simple GROUP BY in SQL would do the job.

But here we first have to split each cell to recover unique values, then we will be able to count them.

#Pandas

Split the data

Start splitting the data with str.split specifying the separator and expanding, to create new columns to store each unique values.

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx',
                   sheet_name='Sheet1', engine='openpyxl')
 
serie = df['Your field'].str.split(', ', expand=True)
 
print(serie)

Great! Now we get virtual columns containing only one value, with as much columns as necessary.

Melt the virtual data

Now let's go to melt our virtual columns in just one.

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx',
                   sheet_name='Sheet1', engine='openpyxl')
 
serieMelt = pd.melt(df['Your field'].str.split(', ',expand=True))['value']
 
print(serieMelt)

str.split gives virtual column labels named value, that we can use.

Count unique values

Now just put our melted data in a new data frame, before to apply a sorted groupby by size() (a COUNT in SQL).

tab multivalues countThe full code will be:

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl')
 
temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value'])
 
df_Count = pd.DataFrame(temp_df.groupby(['value']).size(), columns=['Total']).reset_index()
 
print(df_Count)

We add a reset_index() to recover column names properly, to use them later.

Percentage

Ok, but suppose you want also the recurrence percentages of your unique values. The usual method will take the total from the melted data, giving strange results. Indeed the melted data cumulated each values from the multi-valued field in new rows, so a lot!

To get real percentages you have to use shape[0] to count the real number of rows from your original data-frame, and use it in the percentage calculation.

We start creating a new field in the results data-frame, then we calculate percentages, rounding them.

df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100
df_Count['Percent'] =df_Count['Percent'].round(decimals=2)

Of course, you can not add the percentage results, it will exceed 100% and it is normal.

Empty values management

Good. But about our empty values? Just add dropna=False to see them.

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl')
 
temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value'])
df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index()
 
df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100
df_Count['Percent'] =df_Count['Percent'].round(decimals=2)
 
print(df_Count)

Any problem? 

Arrgg! The groupby counted empty values from the melted data, and the melted data has added new columns as many times as necessary (based on the record with the highest number of unique values). Each them containing empty values ... It skews the count and percentage.

We have to cheat to fix that.

  1. import pandas as pd
  2.  
  3. df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl')
  4.  
  5. temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value'])
  6. df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index()
  7. df_Count = df_Count.fillna('AZERTY')
  8.  
  9. df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100
  10. df_Count['Percent'] =df_Count['Percent'].round(decimals=2)
  11.  
  12. valuesEmpty = df['Industries:Industries'].isna().sum()
  13. valuesEmptyPercent = round((valuesEmpty / df.shape[0]) * 100, 2)
  14.  
  15. df_Count.loc[(df_Count['value'] == 'AZERTY')] = [['Unknow', valuesEmpty, valuesEmptyPercent]]
  16. df_Count = df_Count.sort_values(['Total'], ascending=False)
  17.  
  18. print(df_Count)

The 7th line adds a specific value in the results data-frame to target the wrong empty values calculation (AZERTY).

The 12 and 13th count the empty values and its percentage aside, and store results in specific variables.

Then the 15th line replaces the empty values calculation with our new value and variables, using loc to identify the line to fix.

Finally we can sort our data-frame with sort_values(['Total'].


Work with dates

Group by months

If you have a data-frame df with a field Created at containing dates like:

2021-02-05 10:34:21 UTC

You can work with it, example to group and sort by month, you can cheat doing:

df_Temp = pd.DataFrame(df['Created at'].str[:7])
df_Created_count = pd.DataFrame(df_Temp.groupby(['Created at'], dropna=False).size(), columns=['Total']).sort_values(['Created at'], ascending=True).reset_index()

indeed with str[:7] we extract years and months from strings. Then a usual groupby.

Ok but it is more a pandas-way to convert your field in a true date format, in a new virtual field. Then you can extract real months and store them in a proper data-frame.

df['Created'] = pd.to_datetime(df['Created at'])
df['Created'] = df['Created'].dt.to_period("M")
df_TempCreated = pd.DataFrame(df['Created'])

Then easy to group and count them.

df_Created_count = pd.DataFrame(df_TempCreated.groupby(['Created'], dropna=False).size(), columns=['Total']).sort_values(['Created'], ascending=True).reset_index()

As now you use a real date format, you can display the full months with dt.strftime.

df_Created_count['Created'] = df_Created_count['Created'].dt.strftime('%B %Y')

Format date

If you get date like 2022-10-24 and you would 24/10/2022:

df['My date'] = pd.to_datetime(df['My date'], format='%Y-%m-%d')
df['My date'] = df['My date'].dt.strftime("%d/%m/%Y")

Get simple date (DDDDMMDD)

# SIMPLE DATE
df_DataUnion['date_simple'] = df_DataUnion['date_submit'].astype(str).str[:10]
df_DataUnion['date_simple'] = df_DataUnion['date_simple'].replace({'-': ''}, regex=True)

Extract the previous month

PreviousMonth = today - pd.DateOffset(months=1)
print(PreviousMonth.strftime("%Y-%m"))

Convert false date field from Excel

Excel can concatenate true date format and string in the same column, arrgh! Later when you would use it in Pandas...

I use this tip to manage these kind of malicious fields in pandas:

df.loc[~df_GDC['date'].str.contains(' ', na=False), 'date'] = df['date'].astype(str).str.split(expand=True)[0]
df.loc[~df_GDC['date'].str.contains(' ', na=False), 'date'] = \
    df['date'].astype(str).str.split('-', expand=True)[2] + '/' +\
    df['date'].astype(str).str.split('-', expand=True)[1] + '/' +\
    df['date'].astype(str).str.split('-', expand=True)[0]

Eventually I put the code above in a try:... except:...

Eventually I convert all my field in string before, with astype(str)

Manage numeric date from Excel

Sometimes dates from Excel come as integer, arrgghhh!

import xlrd
 
def read_date(date):
    return xlrd.xldate.xldate_as_datetime(date, 0)
 
df.loc[df['My date'].astype(str).str.isdigit() == True, 'My date'] = \
pd.to_datetime(df['My date'].loc[df['My date'].astype(str).str.isdigit()].astype(int).apply(read_date), errors='coerce')

Check date format

Sometimes (especilally when it comes from Excel...) it is good to check your dates after treatments.

Below we search dates not starting with DD/MM/YYYY:

df2 = df[~df['my date'].str.contains('^[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]', regex=True)]
 
print(tabulate(df2, headers='keys', tablefmt='psql', showindex=False))

 


Pie Chart

Simple pie

If you have a data-frame with fields Categories and Total, you can create a pie chart.

import matplotlib.pyplot as plt
 
chartLabel = df_Categories_count['Categories'].tolist()
chartValue = df_Categories_count['Total'].tolist()
 
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
 
fig1 = plt.figure()
plt.pie(chartValue, labels=chartLabel, colors=colors, autopct=None, shadow=False, startangle=90)
plt.axis('equal')
plt.title('Categories', pad=20, fontsize=15)
 
plt.show()
plt.clf()

Display percentages in the chart

Use autopct.

plt.pie(chartValue, labels=chartLabel, colors=colors, autopct='%1.1f%%', shadow=False, startangle=90)

Change the labels font size

plt.rcParams['font.size'] = 5

Hide a label in chart

Just replace the value with an empty string.

chartLabel[-1] = ''

Add a legend in your chart

plt.legend(chartLabel, loc='best', fontsize=8)

Explode

To explode your chart, you need to pass values in explode, example if you have 4 values to plot:

plt.pie(myValue, labels=myLabels, colors=colors, explode=[0.05, 0.05, 0.05, 0.05], autopct=None, shadow=False, startangle=90)

If you want explode all part by default, just create a list before:

explodeValues = []
for i in myValue:
    explodeValues.append(0.05)
 
plt.pie(myValue, labels=myLabels, colors=colors, explode=explodeValues, autopct=None, shadow=False, startangle=90)

Add a legend with several fields

For example you want display labels and percentages in the legend. First calculate percentages in another field in your data-frame, then:

import matplotlib.pyplot as plt
 
chartLabel = df_Categories_count['Categories'].tolist()
chartLegendLabel = df_Categories_count['Categories'].tolist()
chartValue = df_Categories_count['Total'].tolist()
chartLegendPercent = df_Categories_count['Percent'].tolist()
 
legendLabels = []
for i, j in zip(chartLegendLabel, map(str, chartLegendPercent)):
    legendLabels.append(i + ' (' + j + ' %)')
 
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
 
fig1 = plt.figure()
plt.pie(chartValue, labels=chartLabel, colors=colors, autopct=None, shadow=False, startangle=90)
plt.axis('equal')
 
plt.title('Categories', pad=20, fontsize=15)
plt.legend(legendLabels, loc='best', fontsize=8)
 
plt.show()
plt.clf()

Bar chart

Vertical bar chart

chartLabel = df_Created_count['Labels field'].tolist()
chartValue = df_Created_count['Values field'].tolist()
 
fig5 = plt.figure(figsize=(13,6))
bar_plot = plt.bar(chartLabel, chartValue)
 
# plt.ylabel('yyy')
# plt.xlabel('xxx')
plt.xticks(rotation=30, ha='right')
 
# HIDE BORDERS
plt.gca().spines['left'].set_color('none')
plt.gca().spines['right'].set_color('none')
plt.gca().spines['top'].set_color('none')
 
# HIDE TICKS
plt.tick_params(axis='y', labelsize=0, length=0)
plt.yticks([])
 
# ADD VALUE ON THE END OF HORIZONTAL BARS
# for index, value in enumerate(chartValue):
#     plt.text(value, index, str(value))
 
# ADD VALUE ON THE TOP OF VERTICAL BARS
def autolabel(rects):
    for idx, rect in enumerate(bar_plot):
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width()/2, height, chartValue[idx], ha='center', va='bottom', rotation=0)
 
autolabel(bar_plot)
 
plt.title('Registrations by month', pad=20, fontsize=15)
 
plt.show()
plt.clf()

Work with shape files

Here we will use numpy and basemap.

pip install numpy
pip install basemap

To install basemap on WIndows, download a package here, according your versions, and install it with wheel.

Thanks to ramiro.org. I got the code below on his website, I just adapted it.

Suppose you have an Excel file with a country field, you get count and map them with a country shape. The country names in the Excel file and in the shape must be the same.

  1. import matplotlib as mpl
  2. import matplotlib.pyplot as plt
  3. import numpy as np
  4. import pandas as pd
  5.  
  6. from matplotlib.patches import Polygon
  7. from matplotlib.collections import PatchCollection
  8. from mpl_toolkits.basemap import Basemap
  9.  
  10. from PIL import Image, ImageOps
  11.  
  12. shp_simple_countries = r'C:/Users/Downloads/simple_countries/simple_countries'
  13.  
  14. inputExcelFile = r'C:/Users/Downloads/My file.xslx'
  15.  
  16. df = pd.read_excel(inputExcelFile, sheet_name='Export', engine='openpyxl', usecols=['ID', 'My Country Field'])
  17.  
  18. # Count the countries
  19. df_Country_count = pd.DataFrame(df.groupby(['My Country Field'], dropna=True).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index()
  20. df_Country_count = df_Country_count.fillna('Unknow')
  21.  
  22. df_Country_count['Percent'] = (df_Country_count['Total'] / df_Country_count['Total'].sum()) * 100
  23. df_Country_count['Percent'] = df_Country_count['Percent'].round(decimals=2)
  24.  
  25. # Set countries as index
  26. df_Country_count.set_index('My Country Field', inplace=True)
  27.  
  28. my_values = df_Country_count['Percent']
  29.  
  30. num_colors = 30
  31. cm = plt.get_cmap('Blues')
  32.  
  33. scheme = [cm(i / num_colors) for i in range(num_colors)]
  34. my_range = np.linspace(my_values.min(), my_values.max(), num_colors)
  35.  
  36. # -1 TO AVOID SEARCHS IN A PANDAS DATA-FRAME INCLUDING START AND STOP VALUE (I think ...)
  37. df_Country_count['Percent'] = np.digitize(my_values, my_range) - 1
  38.  
  39. map1 = plt.figure(figsize=(14, 8))
  40. ax = map1.add_subplot(111, frame_on=False)
  41.  
  42. map1.suptitle('Countries', fontsize=30, y=.95)
  43.  
  44. m = Basemap(lon_0=0, projection='robin')
  45. m.drawmapboundary(color='w')
  46.  
  47. m.readshapefile(shp_simple_countries, 'units', color='#444444', linewidth=.2, default_encoding='iso-8859-15')
  48.  
  49. # Create the chloro map
  50. for info, shape in zip(m.units_info, m.units):
  51. shp_ctry = info['COUNTRY_HB']
  52. if shp_ctry not in df_Country_count.index:
  53. color = '#dddddd'
  54. else:
  55. color = scheme[df_Country_count.loc[shp_ctry]['Percent']]
  56.  
  57. patches = [Polygon(np.array(shape), True)]
  58. pc = PatchCollection(patches)
  59. pc.set_facecolor(color)
  60. ax.add_collection(pc)
  61.  
  62. # Cover up Antarctica so legend can be placed over it
  63. ax.axhspan(0, 1000 * 1800, facecolor='w', edgecolor='w', zorder=2)
  64.  
  65. # Draw color legend
  66. ax_legend = map1.add_axes([0.2, 0.14, 0.6, 0.03], zorder=3)
  67. cmap = mpl.colors.ListedColormap(scheme)
  68.  
  69. cb = mpl.colorbar.ColorbarBase(ax_legend, cmap=cmap, ticks=my_range, boundaries=my_range, orientation='horizontal')
  70. cb.ax.set_xticklabels([str(round(i, 1)) for i in my_range])
  71. cb.ax.tick_params(labelsize=7)
  72. cb.set_label('Percentage', rotation=0)
  73. #cb.remove()
  74.  
  75. # Set the map footer
  76. # description = 'Bla bla bla'
  77. # plt.annotate(description, xy=(-.8, -3.2), size=14, xycoords='axes fraction')
  78.  
  79. map1.savefig('C:/Users/Downloads/mymap1.png', dpi=100, bbox_inches='tight')
  80.  
  81. plt.show()
  82. plt.clf()
  83.  
  84. im = Image.open('C:/Users/Downloads/mymap1.png')
  85. bordered = ImageOps.expand(im, border=1, fill=(0, 0, 0))
  86. bordered.save('C:/Users/Downloads/mymap1.png')

Joins

SQL JOIN LEFT

df_Join = pd.merge(df1, df2, left_on='Email', right_on='email', how='left')[['Email', 'source']]

Another example:

df_Merged = df1.merge(df2, on='id', how='left')

Another example to find the non-matching records on 2 fields:

df_join = pd.merge(df1, df2, left_on=['id1', 'field1'], right_on=['id2', 'field2'], how='left')[['id1', 'field1', 'field2']]
df_NoMatch = df_join[df_join['field2'].isna()]

 

SQL INNER JOIN

df_Join = pd.merge(df1, df2, on='Email')

Or:

df_Merged = df1.merge(df2, on='id', how='inner')

Find difference

df_difference = pd.DataFrame(df_First[~df_First['Email'].isin(df_Second['email'])])

 


Group, count and sort

Count rows

number = df.shape[0]
print(number)

Count values (group by)

If you have a field named Country in a data-frame named df:

CountCountry = df.groupby(['Country'], dropna=False).size().sort_values(ascending=False)
print(CountCountry)

Count values (group by) and get it in a dataframe

CountCommune = pd.DataFrame(df_GR_Reglement.groupby(['id_commune'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index()
 
print(tabulate(CountCommune.head(10), headers='keys', tablefmt='psql', showindex=False))

Count values grouping by several fields to find duplicates

df_DuplicatePages = pd.DataFrame(df_GR_Reglement.groupby(['id_commune', 'PAGES'], dropna=False).size(), columns=['Total'])\
.sort_values(['Total'], ascending=False).reset_index()
df_DuplicatePages = df_DuplicatePages[(df_DuplicatePages['Total'] > 1)]
 
print(tabulate(df_DuplicatePages.head(10), headers='keys', tablefmt='psql', showindex=False))

Group concat

df = df.groupby(['country'])['city'].apply(lambda x: ', '.join(x)).reset_index()

Count a specific value

print (len(df[df['Your field'] == 'United States']))

Count with condition

myCount = df[df['Field where search the condition'] == 'A value'].count()['Field to count']

Count empty values

number = df['Your field'].isna().sum()

Keep empty values in a group by and fill them

newDf = pd.DataFrame(df.groupby(['Your field'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index()
newDf = newDf.fillna('Unknow')

Count email domains (group by)

You have a Email field with clean emails. Thanks to Pandas it is easy to group and count their DNS.

First create a virtual field to split the emails and recover the 2nd part of the split. Then just group your new field.

df['Domain'] = df['Email'].str.split(<a href="mailto:'@').str">'@').str</a>[1]
df_Email_DNS_count = pd.DataFrame(df.groupby(['Domain']).size(), columns=['Total'])\
    .sort_values(['Total'], ascending=False).reset_index()

Calculate percentages

Acording your needs, you can use the sum from a Total field.

df['Percent'] = (df_SpecialtiesPerCountry_count['Total'] / df['Total'].sum()) * 100
df['Percent'] = df['Percent'].round(decimals=2)

Or the number of rows in your data-frame (useful if you work with a multi-valued field, see above the Analyze data from a multi-valued field chapter) .

df['Percent'] = (df['Total'] / df.shape[0]) * 100
df['Percent'] = df['Percent'].round(decimals=2)

Sort on a field 

df.sort_values(by=['Field'], ascending=False, inplace=True)
# Recreate the index if necessary (if you will use the index later)
df = df.reset_index(drop=True)

Sort on a field containing both text and numerical values

df = df.loc[pd.to_numeric(df['My field'], errors='coerce').sort_values().index]

 


Replace, remove, edit with conditions

Replace empty cells

df['speciality'] = df['speciality'].fillna('Other')

Replace a value (the full value)

df['Your field'] = df['Your field'].replace(['Old value'],'New value')

Replace a part of string

df['Your field'] = df['Your field'].replace({'Old value': 'New value'}, regex=True)

Replace a string if it contains

df.loc[df['speciality'].str.contains('Researcher'), 'speciality'] = 'Research Scientist'

If not contains

Add ~:

df.loc[~df['speciality'].str.contains('Researcher'), 'speciality'] = 'Research Scientist'

Localize and remove a row

ind_drop = df[df['Your field'].apply(lambda x: x == ('A value'))].index
df = df.drop(ind_drop)

Localize and remove a row starting with ...

ind_drop = df[df['Your field'].apply(lambda x: x.startswith('A value'))].index
df = df.drop(ind_drop)

Localize and remove a row ending with ...

ind_drop = df[df['Your field'].apply(lambda x: x.endswith('A value'))].index
df = df.drop(ind_drop)

Localize and replace a row

df.loc[(df['A field'] == 'TARGET')] = [[NewValue1, NewValue2, NewValue3]]

Remove some first characters

Here we delete the 2 first characters if the cell starts with a comma then a space.

df['Field'] = df['Field'].apply(lambda x: x[2:] if x.startswith(', ') else x)

Edit with a condition

Increment a field if another field is empty.

df.loc[df['My field maybe empty'].notna(), 'Field to increment'] += 1

Fill a field if a field is greater or equal to another field.

df.loc[df['Field A'] >= df['Field B'], 'Field to fill'] = 'Yes'

Edit several fields in the same time.

df.loc[df['Field A'] >= df['Field B'], ['Field A to fill', 'Field B to fill']] = ['Yes', 'No']

Edit with several conditions

Condition "AND" (&)
df.loc[(df['My field maybe empty'].notna()) &amp; (df['An integer field'] == 1) &amp; (df['An string field'] != 'OK'), 'Field to increment'] += 1

Please replace "&amp;" with a simple &.

Condition "OR" (|)
df.loc[(df['My field maybe empty'].notna()) | (df['An integer field'] == 1) | (df['An string field'] != 'OK'), 'Field to fill'] = 'Yes'

Edit with IN or NOT IN condition (as SQL)

Just use isin:

df.loc[df['Id field'].isin([531733,569732,652626]), 'Filed to edit'] = 'Yes'

And for NOT IN:

df.loc[df['Id field'].isin([531733,569732,652626]) == False, 'Filed to edit'] = 'No'

Replace string beginning with

df['id_commune'] = df['id_commune'].str.replace(r'(^75.*$)', '75056', regex=True)

 Remove letters

df['mobile'] = df['mobile'].str.extract('(\d+)', expand=False).fillna('')

Extract before or after a string

Example if Job='IT: DBA'

df['type'] = df['Job'].str.split(': ').str[0]
df['speciality'] = df['Job'].str.split(': ').str[1]

Get in title case

df['firstname'] = df['firstname'].str.title()

Remove if contains less of n character (lenght)

df.loc[df['mobile'].str.len() < 6, 'mobile'] = ''

 


Filters

Get lines with a field containing a string

df2 = df1[df1['My field'].str.contains('AAA').fillna(False)]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))

Get lines with a field not containing a string

Just use ~:

df2 = df1[~df1['My field'].str.contains('AAA').fillna(False)]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))

Get lines with a field matching a regex

Here we search where the field is a number with a comma as decimal separator.

dfTest = df_ImportTxt[df_ImportTxt['My field'].str.match(pat='^\d+,\d+$')==True]

Get lines searching several strings in a field

Example searching space, star and parenthesis:

MySearch = ['\*', ' ', '\(', '\)']
 
df2 = df1[df1['zone'].str.contains('|'.join(MySearch)).fillna(False)]
 
print(tabulate(df2.head(40), headers='keys', tablefmt='psql'))

Get lines with a field in digit

df2 = df1.loc[df1['zone'].astype(str).str.isdigit()]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))

Get lines according several condition

Below we get lines with a field with null values and another field with empty values:

df2 = df1[(df1['My field'].isna()) | (df1['My other field'] == '')]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))

Get lines searching if values are IN a list

searchIn= ['AAA', 'BBB', 'CCC']
 
df2 = df1[df1['My field'].isin(searchIn)]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))

Get lines with a field not finishing with a concatenation from other fields

Suppose you have this dataframe df:

Language PDF
Python docA-Python.pdf 
SQL  docF-SQ.pdf 
 PHP  docS-PHP.pdf
CSS  docG-CSS.pdf 
 Javascript  docR-Javascript.pdf
 HTML  docW-HTML.pdf

 

Now you would like extract lines where the PDF field does not finish with [Language field] + '.pdf':

Compare = df[['Language', 'PDF']].apply(lambda x: x['PDF'].endswith(x['Language'] + '.pdf'), axis=1)
 
df_temp = pd.DataFrame(Compare)
df_temp.columns =['True or false']
 
df_concat = pd.concat([df, df_temp], axis=1)
df_filtered = df_concat[['Language', 'PDF']][df_concat['True or false'] == False]
 
print(tabulate(df_filtered.head(5), headers='keys', tablefmt='psql', showindex=False))
 Language  PDF
 SQL  docF-SQ.pdf

 


Merge/concatenate

Merge 2 lists in 2 columns and properly display them in the terminal

tabulate

import pandas as pd
from tabulate import tabulate
 
LanguageList = ['Python', 'SQL', 'PHP', 'CSS', 'Javascript', 'HTML']
LevelList = ['Fun', 'OK', 'Arghh', 'OK', 'Arghh', 'Easy']
 
LanguageLevel = list(zip(LanguageList, LevelList))
 
df = pd.DataFrame(LanguageLevel, columns =['Language', 'Level'])
 
print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))

The 7th line uses zip to link the values from our lists in tupleslist encloses our tuples in a list.

The 9th line creates a data-frame (pd.DataFrame) from our new list (now containing tuples). Specifying the column labels.

The print uses tabulate to display our fields with a beautiful Postgres style (psql).

Split a dataframe with conditions, and them merge the splits according other conditions

firstSelection = df.loc[df['Field'] == 'Yes'].head(100)
 
secondSelection = df.loc[df['Field'] == 'No'].head(200)
 
# Merge the 2 selections.
df_Selections = pd.concat([firstSelection, secondSelection], axis=0)

 


Warnings

Avoid indexation problems after dataframes concat

Sometimes after a dataframes concatenation, the indexes are melted and can cause some problems. Just ignore the index in your new dataframe:

df_Selections = pd.concat([firstSelection, secondSelection], axis=0, ignore_index=True)

 Avoid copy warning

import warnings
from pandas.core.common import SettingWithCopyWarning
 
warnings.simplefilter(action='ignore', category=SettingWithCopyWarning)

Avoid user warning 

import warnings
 
warnings.simplefilter(action='ignore', category=UserWarning)

 


Miscellaneous

Create a dataframe from a previous dataframe (copy)

df2 = df1.copy()

Copy data from a dataframe in the clipboard

Very convenient to copy-paste quickly in a text file, CSV, Excel ...

df.to_clipboard(sep=',', index=False, header=None)

Create a dataframe from a list

import pandas as pd
from tabulate import tabulate
 
MyList = ['7878', 'Bbbb', 'azerfg', '545', 'XXX']
 
df = pd.DataFrame(MyList)
df.columns =['My field']
number = df.shape[0]
 
print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=False))
print(number)

Dedupe

To dedupe a dataframe on an Email field keeping only the first duplicate record, use drop_duplicates.

df['Email'] = df['Email'].str.lower()
dfDeduped = df.drop_duplicates(subset=['Email'], keep='first')

Find duplicates

Use duplicated:

df2 = df1[df1.duplicated(['My field'], keep=False)]
 
print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))

Export a picture from a chart

And add a border.

from PIL import Image, ImageOps
 
fig1.savefig('C:/Users/myplot.png', dpi=100)
 
im = Image.open('C:/Users/myplot2.png')
bordered = ImageOps.expand(im, border=1, fill=(0, 0, 0))
bordered.save('C:/Users/myplot2.png')

Re-organize columns

df = df[['Field 1', 'Adresse mail', 'Note algorithmique', 'Substitutions']]

Re-name columns

From their original name:

df.rename(columns={'Old name 1': 'New name 1', 'Old name 1': 'New name 1'}, inplace=True)

From their position:

df.columns.values[12] = 'New name'

Delete column

df.drop('My field', axis=1, inplace=True)

Format number with space thousand separator

number = df.shape[0]
print(format(number, ',').replace(',', ' '), 'customers')

Or:

f"{number_of_rows:,}".replace(',', ' ')

Ranking

To rank according to a field, from highest to lowest:

df['My rank'] = df['My field to rank'].rank(ascending=False)

Avoid duplicate:

df['My rank'] = df['My field to rank'].rank(ascending=False, method='first')

Generate a unique random integer

If you want a unique random integer from 1 included to the number of records included.

howManyRecord = df.shape[0]
df['Random number'] = np.random.choice(range(howManyRecord), howManyRecord, replace=False)
df.loc[df['Random number'] == 0, 'Random number'] = howManyRecord

Select with condition

df = df.loc[df['Field to filter'] == 'Yes']

Prevent nan

Use keep_default_na=False :

Data = pd.read_excel(inputExcelFile, sheet_name='Feuil1', engine='openpyxl', keep_default_na=False)

Get tuples from a file (in a SQL UPDATE shape)

...df.itertuples()...

Very usefull to build SQL update queries from a file with the primary key in first column :

file1 = open(MyPath+'FixQueries.sql', 'w', encoding='cp1252')
 
# ÉCRITURE DU FICHIER
for i in df.itertuples(index=False):
    file1.write('UPDATE MyTable SET ' +\
        str(i)[7:-1].split(', ', 1)[-1] +\
        " WHERE id = '" + i[0] + "' ;\n")

Start the index to a special value

Vey usefull to then use it in Excel formulas (starting with 2 ...).

df.index = df.index + 2

Use index in concatenation

df.index.astype(str)

Reset index

df.reset_index(drop=True, inplace=True)

Open a box to enter a string and use it as a variable

import tkinter as tk
from tkinter import simpledialog
 
window = tk.Tk()
window.eval('tk::PlaceWindow . center')
window.withdraw()
repertoireCarto = simpledialog.askstring("Saisie", "Répertoire ?\t\t\t\t", parent=window)
print('Le répertoire est', repertoireCarto)

 Get some line indexes according condition

list_duplicate = df.index[df['My field'] == 'Text'].tolist()

Import external scripts

Your working script needs to know where to find the external scripts:

sys.path.append('C:\\Users\\Georges\\PycharmProjects\\Your_Directory')
 
from YourScriptWithoutExtension import SomeVariables, ...

Subprocess

Not related to Pandas but very useful!

subprocess.run('psql -h localhost -p 5432 -d work -U postgres -c "SELECT NOW() ;"', shell=True)

 


Lists

Here below is not right about Pandas, but usefull in some Pandas contexts.

Search if elements from a list are in another list

Chek if elements from List2 are in List1:

import collections
 
List1 = ['Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh']
List2 = ['Eeee', 'Cccc']
 
print(List1)
print(List2)
 
check = all(item in List1 for item in List2)
 
print(check)

 Or:

import collections
 
List1 = ['Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh']
List2 = ['Bbbb', 'Aaa', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Hhhh', 'Gggg']
 
print(List1)
print(List2)
 
def iCompare(c1, c2):
    if(collections.Counter(c1)==collections.Counter(c2)):
        return 'Yes'
    else:
        return 'No'
 
info = iCompare(List1, List2)
 
print(info)

 Extract common elements from several lists

List1 = ['Aaa', 'Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh']
List2 = ['Eeee', 'Cccc', 'Cccc']
List3 = ['Cccc', 'Bbbb', 'Eeee']
 
print('List1:', List1)
print('List2:', List2)
print('List3:', List3)
 
Compare = sorted(list(set(List1) &amp; set(List2) &amp; set(List3) ))
 
print('\nCompare:', Compare)

 Extract no-common elements from several lists

List1 = ['Aaa', 'Aaa', 'Bbbb', 'Cccc', 'Dddd', 'Eeee', 'Ffff', 'Gggg', 'Hhhh']
List2 = ['Eeee', 'Cccc', 'Cccc']
 
print('List1:', List1)
print('List2:', List2)
 
Compare = sorted(list(set(List1) - set(List2) ))
 
print('\nCompare:', Compare)

Concatenate 2 fields in list/sentence

Assemblage = list(zip(df['Field 1'].tolist(), df['Field 2'].tolist()))
b = []
for a in Assemblage:
    b.append(str(list(a)[0]) + ' (' + list(a)[1] + ')')
print(str(b).replace('[', '').replace(']', '').replace("'", ''), '\n')

 Another example:

Assemblage = list(zip(df['field1'].tolist(), df['field2'].tolist()))
b = []
for a in Assemblage:
    b.append('Blablabla...  ' + str(list(a)[0]) + ' ...Blablabla... ' + list(a)[1] + ' ...Blablabla.')
for i in b:
    print(i)