Page 18 sur 25
Group, count and sort
List unique values
print(df_Deduped.devise.unique())
Count rows
number = df.shape[0] print(number)
Count values (group by)
If you have a field named Country in a data-frame named df
:
CountCountry = df.groupby(['Country'], dropna=False).size().sort_values(ascending=False) print(CountCountry)
Count values (group by) and get it in a dataframe
CountCommune = pd.DataFrame(df.groupby(['id_commune'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index() print(tabulate(CountCommune.head(10), headers='keys', tablefmt='psql', showindex=False))
Count values grouping by several fields to find duplicates
df_Duplicates = pd.DataFrame(df.groupby(['id_commune', 'PAGES'], dropna=False).size(), columns=['Total'])\ .sort_values(['Total'], ascending=False).reset_index() df_Duplicates = df_Duplicates[(df_Duplicates['Total'] > 1)] print(tabulate(df_Duplicates.head(10), headers='keys', tablefmt='psql', showindex=False))
Group concat
The field to concat (here below city) have to be as text (so maybe do a little .astype(str)
juste before).
df = df.groupby(['country'])['city'].apply(lambda x: ', '.join(x)).reset_index()
And to sort your field, just do a little sorted(x)
in the lambda
function.
Count a specific value
print (len(df[df['Your field'] == 'United States']))
Count with condition
myCount = df[df['Field where search the condition'] == 'A value'].count()['Field to count']
Count empty values
number = df['Your field'].isna().sum()
Keep empty values in a group by and fill them
newDf = pd.DataFrame(df.groupby(['Your field'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index() newDf = newDf.fillna('Unknow')
Count email domains (group by)
You have a Email field with clean emails. Thanks to Pandas it is easy to group and count their DNS.
First create a virtual field to split the emails and recover the 2nd part of the split. Then just group your new field.
df['Domain'] = df['Email'].str.split(<a href="mailto:'@').str">'@').str</a>[1] df_Email_DNS_count = pd.DataFrame(df.groupby(['Domain']).size(), columns=['Total'])\ .sort_values(['Total'], ascending=False).reset_index()
Count the recurrence of a character in a field
df['my_count'] = df['my_field'].str.count(';')
Calculate percentages
Acording your needs, you can use the sum
from a Total field.
df['Percent'] = (df['Total'] / df['Total'].sum()) * 100 df['Percent'] = df['Percent'].round(decimals=2)
Or the number of rows in your data-frame (useful if you work with a multi-valued field, see above the Analyze data from a multi-valued field chapter) .
df['Percent'] = (df['Total'] / df.shape[0]) * 100 df['Percent'] = df['Percent'].round(decimals=2)
Sum group by year
df['Year'] = df['DateInvoice'].dt.to_period('Y') df_AmountYear = df.groupby(['Year'], dropna=False)['AmountEuro'].sum().reset_index() df_AmountYear = df_AmountYear.sort_values(['Year'], ascending=False) print('\nAmount per year:') print(tabulate(df_AmountYear, headers='keys', tablefmt='psql', showindex=False))
Sort on a field
df.sort_values(by=['Field'], ascending=False, inplace=True) # Recreate the index if necessary (if you will use the index later) df = df.reset_index(drop=True)
Sort on a field containing both text and numerical values
df = df.loc[pd.to_numeric(df['My field'], errors='coerce').sort_values().index]
Sum and count
If you want sum an Amount field per country counting the number of lines:
df_AmountCountries = df.groupby(['Country'], dropna=False)\ .agg({'ID': 'count', 'AmountEuro': 'sum'})\ .reset_index()\ .rename(columns={'ID': 'Number'})