Page 21 sur 25
Filter/Extract
Field contains string
df2 = df1[df1['My field'].str.contains('AAA').fillna(False)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Field not contains a string
Just use ~
:
df2 = df1[~df1['My field'].str.contains('AAA').fillna(False)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Field not empty, not NaN, not null
df_notEmpty= df[ (df['My field'].notnull()) & (df['My field'].notna()) & (df['My field'] != '') & (df['My field'] != 'nan') ]
Filed starts with
df2 = df1[df1['My field'].str.startswith('A')] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Or:
df2 = df1[df1['My field'].str.startswith(('A', 'B'))] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Field is a numeric value
# Function to extract numeric values def extractNumericValue(my_text): my_values = re.findall(r'^-?\d+\.\d+|-?\d+$', str(my_text)) if my_values: return my_values else: return None # Apply df['SearchNumericValue'] = df['My column'].apply(extractNumericValue) dfResult = df[ (df['SearchNumericValue'].notna()) & (~df['My column'].astype(str).str.contains('[a-zA-Z]', regex = True)) ] print(tabulate(dfResult.head(10), headers='keys', tablefmt='psql', showindex=False))
Field contains element from a list
Example searching space, star and parenthesis:
MySearch = ['\*', ' ', '\(', '\)'] df2 = df1[df1['zone'].str.contains('|'.join(MySearch)).fillna(False)] print(tabulate(df2.head(40), headers='keys', tablefmt='psql', showindex=False))
Field equals element from a list
MySearch = ['Bbbb', 'XXX'] df2 = df1[df1['zone'].str.fullmatch('|'.join(MySearch)).fillna(False)] print(tabulate(df2.head(40), headers='keys', tablefmt='psql', showindex=False))
Or like a SQL IN
:
searchIn= ['AAA', 'BBB', 'CCC'] df2 = df1[df1['My field'].isin(searchIn)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))
Field in digit
df2 = df1.loc[df1['zone'].astype(str).str.isdigit()] print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))
Several condition
Below we get lines with a field with null
values and another field with empty values:
df2 = df1[(df1['My field'].isna()) | (df1['My other field'] == '')] print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))
Get lines searching if values are IN
a list
searchIn= ['AAA', 'BBB', 'CCC'] df2 = df1[df1['My field'].isin(searchIn)] print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))
Field not finishes with a concatenation from other fields
Suppose you have this dataframe df
:
Language | |
Python | docA-Python.pdf |
SQL | docF-SQ.pdf |
PHP | docS-PHP.pdf |
CSS | docG-CSS.pdf |
Javascript | docR-Javascript.pdf |
HTML | docW-HTML.pdf |
Now you would like extract lines where the PDF field does not finish with [Language field] + '.pdf'
:
Compare = df[['Language', 'PDF']].apply(lambda x: x['PDF'].endswith(x['Language'] + '.pdf'), axis=1) df_temp = pd.DataFrame(Compare) df_temp.columns =['True or false'] df_concat = pd.concat([df, df_temp], axis=1) df_filtered = df_concat[['Language', 'PDF']][df_concat['True or false'] == False] print(tabulate(df_filtered.head(5), headers='keys', tablefmt='psql', showindex=False))
Language | |
SQL | docF-SQ.pdf |
Field not contains an email 😀
... df_pb_Email = df[df['Email field'].str.contains(r'[^@]+@[^@]+\.[^@]+') == False] ...
Field not equals an email
... df_pb_Email = df[df['Email field'].str.fullmatch(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b') == False] ...
Field not contains a number
dfProblem = df[(df['My field'].astype(str).str.match(pat=r'^[-+]?[0-9]+$|^[-+]?[0-9]+\.[0-9]+$') == False)]
Field contains another field
df['A contains B'] = [x[0] in x[1] for x in zip(df['FieldB'], df['FieldA'])] df= df[df['A contains B'] == False]
Field ends like another field
df['A finish with B'] = [x[1].endswith(x[0]) for x in zip(df['FieldB'], df['FieldA'])] df= df[df['A finish with B'] == False]
Field with lenght equals a number
dfProbleme = df[ df['Field1'].str.len() != 7 ]
Filter counting a string
How many times it appears in a cell:
dfCount = df[(df['Substitutions'].str.count('\\,') > 1)]
Regex
If the field is equal to (^...$
) a number with a comma as decimal separator:
dfTest = df[df['My field'].str.match(pat='^\d+,\d+$')==True]
If the field starts (^
) with different strings (...|...
), using themselves different shapes ([...]
):
dfTest = df[ (df['My field'].str.contains(pat=r'^[tT]he animal')==True) | (df['My field'].str.contains(pat=r'^[aA]ny animal')==True) | (df['My field'].str.contains(pat=r'^([sS]ad|[hH]appy) animal')==True) ]
Extract a specific string
# Delete what is before SpecificString df['NewField'] = df['OldField'].replace({'.*(?=SpecificString)': ''}, regex=True) # Delete what is after SpecificString df['NewField'] = df['NewField'].replace({'(?<=SpecificString).*': ''}, regex=True)
Extract names from email
from email_decomposer.EmailDecomposer import EmailDecomposer email_decomposer = EmailDecomposer() df[['first_name', 'last_name', 'host']] = df['email'].apply( lambda x: pd.Series(email_decomposer.decompose(data=x, get_host=True)))