Data Cleaning Part 2 Notes 4 String Parsing Advance

October 02, 2020

More String Parsing

Sometimes we want to do analysis on numbers that are hidden within string values. We can use regex to extract this numerical data from the strings they are trapped in. Suppose we had this DataFrame df representing a workout regimen:

date	exerciseDescription
10/18/2018	“lunges - 30 reps”
10/18/2018	“squats - 20 reps”
10/18/2018	“deadlifts - 25 reps”
10/18/2018	“jumping jacks - 30 reps”
10/19/2018	“lunges - 40 reps”
10/19/2018	“chest flyes - 15 reps”
…	…

It would be helpful to separate out data like "30 lunges" into 2 columns with the number of reps, "30", and the type of exercise, "lunges". Then, we could compare the increase in the number of lunges done over time, for example.

To extract the numbers from the string we can use pandas’ .str.split() function:

split_df = df['exerciseDescription'].str.split('(\d+)', expand=True)

which would result in this DataFrame split_df:

* *	0	1	2
0	“lunges - “	“30”	“reps”
1	“squats - “	“20”	“reps”
2	“deadlifts - “	“25”	“reps”
3	“jumping jacks - “	“30”	“reps”
4	“lunges - “	“40”	“reps”
5	“chest flyes - “	“15”	“reps”
…	…	…	…

Then, we can assign columns from this DataFrame to the original df:

df.reps = pd.to_numeric(split_df[1])
df.exercise = split_df[2].replace('[\- ]', '', regex=True)





print(students.grade.head())
students['grade'] = students['grade'].str.split('(\d+)',expand=True)[1]

students.grade = pd.to_numeric(students.grade)
print(students.dtypes)
avg_grade = students.grade.mean()
print(avg_grade)
print(students.head())

Missing Values
We often have data with missing elements, as a result of a problem with the data collection process or errors in the way the data was stored. The missing elements normally show up as NaN (or Not a Number) values:
day bill tip num_guests
“Mon” 10.1 1 1
“Mon” 20.75 5.5 2
“Tue” 19.95 5.5 NaN
“Wed” 44.10 15 3
“Wed” NaN 1 1

The num_guests value for the 3rd row is missing, and the bill value for the 5th row is missing. Some calculations we do will just skip the NaN values, but some calculations or visualizations we try to perform will break when a NaN is encountered.
Most of the time, we use one of two methods to deal with missing values.
Method 1: drop all of the rows with a missing value
We can use .dropna() to do this:
bill_df = bill_df.dropna()
This command will result in the DataFrame without the incomplete rows:
day bill tip num_guests
“Mon” 10.1 1 1
“Mon” 20.75 5.5 2
“Wed” 44.10 15 3

If we wanted to remove every row with a NaN value in the num_guests column only, we could specify a subset:
bill_df = bill_df.dropna(subset=['num_guests'])
Method 2: fill the missing values with the mean of the column, or with some other aggregate value.
We can use .fillna() to do this:
bill_df = bill_df.fillna(value={"bill":bill_df.bill.mean(), "num_guests":bill_df.num_guests.mean()})

import codecademylib3_seaborn
import pandas as pd
from students import students

score_mean = students.score.mean()

print(score_mean)

students = students.fillna(0)

score_mean_2 = students.score.mean()

print(score_mean_2)

Search This Blog

Python Data Science Coding Reference

Data Cleaning Part 2 Notes 4 String Parsing Advance

Method 1: drop all of the rows with a missing value

Method 2: fill the missing values with the mean of the column, or with some other aggregate value.

Comments

Post a Comment

Popular posts from this blog

Binomial Test in Python

Slicing and Indexing in Python Pandas

Python Syntax and Functions Part2 (Summary Statistics)

day	bill	tip	num_guests
“Mon”	10.1	1	1
“Mon”	20.75	5.5	2
“Tue”	19.95	5.5	NaN
“Wed”	44.10	15	3
“Wed”	NaN	1	1