Index de l'article

Analyze data from a multi-valued field

tab multivaluesSuppose you have a field containing values coming from a multi-valued drop-down list. But now you want to count how many times each value exists.

If values were uniques in each cell, a simple GROUP BY in SQL would do the job.

But here we first have to split each cell to recover unique values, then we will be able to count them.

#Pandas

Split the data

Start splitting the data with str.split specifying the separator and expanding, to create new columns to store each unique values.

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx',
                   sheet_name='Sheet1', engine='openpyxl')
 
serie = df['Your field'].str.split(', ', expand=True)
 
print(serie)

Or just for add splits to the current data:

df_Split = df.join(df['Your field'].str.split(', ', expand=True))

Great! Now we get virtual columns containing only one value, with as much columns as necessary.

Melt the virtual data

Now let's go to melt our virtual columns in just one.

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx',
                   sheet_name='Sheet1', engine='openpyxl')
 
serieMelt = pd.melt(df['Your field'].str.split(', ',expand=True))['value']
 
print(serieMelt)

str.split gives virtual column labels named value, that we can use.

Count unique values

Now just put our melted data in a new data frame, before to apply a sorted groupby by size() (a COUNT in SQL).

tab multivalues countThe full code will be:

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl')
 
temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value'])
 
df_Count = pd.DataFrame(temp_df.groupby(['value']).size(), columns=['Total']).reset_index()
 
print(df_Count)

We add a reset_index() to recover column names properly, to use them later.

Percentage

Ok, but suppose you want also the recurrence percentages of your unique values. The usual method will take the total from the melted data, giving strange results. Indeed the melted data cumulated each values from the multi-valued field in new rows, so a lot!

To get real percentages you have to use shape[0] to count the real number of rows from your original data-frame, and use it in the percentage calculation.

We start creating a new field in the results data-frame, then we calculate percentages, rounding them.

df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100
df_Count['Percent'] =df_Count['Percent'].round(decimals=2)

Of course, you can not add the percentage results, it will exceed 100% and it is normal.

Empty values management

Good. But about our empty values? Just add dropna=False to see them.

import pandas as pd
 
df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl')
 
temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value'])
df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index()
 
df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100
df_Count['Percent'] =df_Count['Percent'].round(decimals=2)
 
print(df_Count)

Any problem? 

Arrgg! The groupby counted empty values from the melted data, and the melted data has added new columns as many times as necessary (based on the record with the highest number of unique values). Each them containing empty values ... It skews the count and percentage.

We have to cheat to fix that.

  1. import pandas as pd
  2.  
  3. df = pd.read_excel('C:/Users/Georges/Downloads/My Excel file.xlsx', sheet_name='Sheet1', engine='openpyxl')
  4.  
  5. temp_df = pd.DataFrame(pd.melt(df['Your field'].str.split(', ', expand=True))['value'])
  6. df_Count = pd.DataFrame(temp_df.groupby(['value'], dropna=False).size(), columns=['Total']).reset_index()
  7. df_Count = df_Count.fillna('AZERTY')
  8.  
  9. df_Count['Percent'] = (df_Count['Total'] / df.shape[0]) * 100
  10. df_Count['Percent'] =df_Count['Percent'].round(decimals=2)
  11.  
  12. valuesEmpty = df['Industries:Industries'].isna().sum()
  13. valuesEmptyPercent = round((valuesEmpty / df.shape[0]) * 100, 2)
  14.  
  15. df_Count.loc[(df_Count['value'] == 'AZERTY')] = [['Unknow', valuesEmpty, valuesEmptyPercent]]
  16. df_Count = df_Count.sort_values(['Total'], ascending=False)
  17.  
  18. print(df_Count)

The 7th line adds a specific value in the results data-frame to target the wrong empty values calculation (AZERTY).

The 12 and 13th count the empty values and its percentage aside, and store results in specific variables.

Then the 15th line replaces the empty values calculation with our new value and variables, using loc to identify the line to fix.

Finally we can sort our data-frame with sort_values(['Total'].

Liens ou pièces jointes
Télécharger ce fichier (France-Departements-Deformation.zip)France-Departements-Deformation.zip[France-Departements-Deformation]335 Ko
Télécharger ce fichier (simple_countries.zip)simple_countries.zip[simple_countries]1880 Ko