Index de l'article

Work with dates

Get true date format from string data

Example, you have dates like 21/09/2022 but you would a true date format to work with it:

df['TrueDate'] = pd.to_datetime(df['OriginalDate'], infer_datetime_format=True).dt.strftime('%Y-%m-%d')

Group by months

If you have a data-frame df with a field Created at containing dates like:

2021-02-05 10:34:21 UTC

You can work with it, example to group and sort by month, you can cheat doing:

df_Temp = pd.DataFrame(df['Created at'].str[:7])
df_Created_count = pd.DataFrame(df_Temp.groupby(['Created at'], dropna=False).size(), columns=['Total']).sort_values(['Created at'], ascending=True).reset_index()

indeed with str[:7] we extract years and months from strings. Then a usual groupby.

Ok but it is more a pandas-way to convert your field in a true date format, in a new virtual field. Then you can extract real months and store them in a proper data-frame.

df['Created'] = pd.to_datetime(df['Created at'])
df['Created'] = df['Created'].dt.to_period("M")
df_TempCreated = pd.DataFrame(df['Created'])

Then easy to group and count them.

df_Created_count = pd.DataFrame(df_TempCreated.groupby(['Created'], dropna=False).size(), columns=['Total']).sort_values(['Created'], ascending=True).reset_index()

As now you use a real date format, you can display the full months with dt.strftime.

df_Created_count['Created'] = df_Created_count['Created'].dt.strftime('%B %Y')

Format date

If you get date like 2022-10-24 and you would 24/10/2022:

df['My date'] = pd.to_datetime(df['My date'], format='%Y-%m-%d')
df['My date'] = df['My date'].dt.strftime("%d/%m/%Y")

Format date from french format

If you get date like 11/05/2023 and you would 2023-05-11:

df['My date'] = df['French date'].str.split('/').str[2] + '-' + df['French date'].str.split('/').str[1] + '-' + df['French date'].str.split('/').str[0]

Get simple date (DDDDMMDD)

# SIMPLE DATE
df_DataUnion['date_simple'] = df_DataUnion['date_submit'].astype(str).str[:10]
df_DataUnion['date_simple'] = df_DataUnion['date_simple'].replace({'-': ''}, regex=True)

Extract the previous month

PreviousMonth = today - pd.DateOffset(months=1)
print(PreviousMonth.strftime("%Y-%m"))

Convert false date field from Excel

Excel can concatenate true date format and string in the same column, arrgh! Later when you would use it in Pandas...

I use this tip to manage these kind of malicious fields in pandas:

df.loc[~df_GDC['date'].str.contains(' ', na=False), 'date'] = df['date'].astype(str).str.split(expand=True)[0]
df.loc[~df_GDC['date'].str.contains(' ', na=False), 'date'] = \
    df['date'].astype(str).str.split('-', expand=True)[2] + '/' +\
    df['date'].astype(str).str.split('-', expand=True)[1] + '/' +\
    df['date'].astype(str).str.split('-', expand=True)[0]

Eventually I put the code above in a try:... except:...

Eventually I convert all my field in string before, with astype(str)

Manage numeric date from Excel

Sometimes dates from Excel come as integer, arrgghhh!

import xlrd
 
def read_date(date):
    return xlrd.xldate.xldate_as_datetime(date, 0)
 
df.loc[df['My date'].astype(str).str.isdigit() == True, 'My date'] = \
pd.to_datetime(df['My date'].loc[df['My date'].astype(str).str.isdigit()].astype(int).apply(read_date), errors='coerce')

Check date format

Sometimes (especilally when it comes from Excel...) it is good to check your dates after treatments.

Below we search dates not starting with DD/MM/YYYY:

df2 = df[~df['my date'].str.contains('^[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]', regex=True)]
 
print(tabulate(df2, headers='keys', tablefmt='psql', showindex=False))

Check if a date is in the future

There are many ways, but here converting in number:

from datetime import date, datetime
 
today = date.today()
 
df_ProblemeDate = df[['My date']][pd.to_numeric(df['My date'].str.split('/').str[2] + df['My date'].str.split('/').str[1] + df['My date'].str.split('/').str[0]) > int(str(today).replace('-', ''))]

Most recent date

mostRecentDate = df['Date'].max()

Most old date

mostOldDate = df['Date'].min()

Period between 2 dates

periodFromToday = today - mostOldDate
print(periodFromToday.days)

Count opening days with the same date column between the 2 following lines 

Imagine each line of your dataframe is a delivery, with a Date field:

import pandas as pd
from tabulate import tabulate
 
from datetime import date
import numpy as np
import datetime as dt
 
...
df['Year'] = df['Date'].str.split('/').str[2]
df['Month'] = df['Date'].str.split('/').str[1]
df['Day'] = df['Date'].str.split('/').str[0]
 
df['Opening days'] = ''
 
df.reset_index(drop=True, inplace=True)
 
for i, row in df.iterrows():
    dateDeliveryYear = int(row['Year'])
    dateDeliveryMonth = int(row['Month'])
    dateDeliveryDay = int(row['Day'])
 
    if i == 0:
        df.at[i, 'Opening days'] = 'First delivery'
 
    if i > 0:
        beforeDateDeliveryYear = int(df.iloc[[i-1]]['Year'].item())
        beforeDateDeliveryMonth = int(df.iloc[[i-1]]['Month'].item())
        beforeDateDeliveryDay = int(df.iloc[[i-1]]['Day'].item())
 
        start = dt.date(beforeDateDeliveryYear,beforeDateDeliveryMonth,beforeDateDeliveryDay)
        end = dt.date(dateDeliveryYear,dateDeliveryMonth,dateDeliveryDay)
 
        df.at[i, 'Opening days'] = np.busday_count(start, end)
 
print('\n' + u)
print(tabulate(df.head(10), headers='keys', tablefmt='psql', showindex=True))
print(df.shape[0])

 

Liens ou pièces jointes
Télécharger ce fichier (France-Departements-Deformation.zip)France-Departements-Deformation.zip[France-Departements-Deformation]335 Ko
Télécharger ce fichier (simple_countries.zip)simple_countries.zip[simple_countries]1880 Ko