Data Cleaning Part 2 Notes 4 String Parsing Advance
More String Parsing
Sometimes we want to do analysis on numbers that are hidden within string values. We can use regex to extract this numerical data from the strings they are trapped in. Suppose we had this DataFrame df
representing a workout regimen:
date | exerciseDescription |
---|---|
10/18/2018 | “lunges - 30 reps” |
10/18/2018 | “squats - 20 reps” |
10/18/2018 | “deadlifts - 25 reps” |
10/18/2018 | “jumping jacks - 30 reps” |
10/19/2018 | “lunges - 40 reps” |
10/19/2018 | “chest flyes - 15 reps” |
… | … |
It would be helpful to separate out data like "30 lunges"
into 2 columns with the number of reps, "30"
, and the type of exercise, "lunges"
. Then, we could compare the increase in the number of lunges done over time, for example.
To extract the numbers from the string we can use pandas’ .str.split()
function:
split_df = df['exerciseDescription'].str.split('(\d+)', expand=True)
which would result in this DataFrame split_df
:
* * | 0 | 1 | 2 |
---|---|---|---|
0 | “lunges - “ | “30” | “reps” |
1 | “squats - “ | “20” | “reps” |
2 | “deadlifts - “ | “25” | “reps” |
3 | “jumping jacks - “ | “30” | “reps” |
4 | “lunges - “ | “40” | “reps” |
5 | “chest flyes - “ | “15” | “reps” |
… | … | … | … |
Then, we can assign columns from this DataFrame to the original df
:
df.reps = pd.to_numeric(split_df[1])
df.exercise = split_df[2].replace('[\- ]', '', regex=True)
print(students.grade.head())students['grade'] = students['grade'].str.split('(\d+)',expand=True)[1]
students.grade = pd.to_numeric(students.grade)print(students.dtypes)avg_grade = students.grade.mean()print(avg_grade)print(students.head())
Missing ValuesWe often have data with missing elements, as a result of a problem with the data collection process or errors in the way the data was stored. The missing elements normally show up as NaN
(or Not a Number) values:
day bill tip num_guests “Mon” 10.1 1 1 “Mon” 20.75 5.5 2 “Tue” 19.95 5.5 NaN “Wed” 44.10 15 3 “Wed” NaN 1 1
The num_guests
value for the 3rd row is missing, and the bill
value for the 5th row is missing. Some calculations we do will just skip the NaN
values, but some calculations or visualizations we try to perform will break when a NaN
is encountered.
Most of the time, we use one of two methods to deal with missing values.
Method 1: drop all of the rows with a missing value
We can use .dropna()
to do this:
bill_df = bill_df.dropna()
This command will result in the DataFrame without the incomplete rows:
day bill tip num_guests “Mon” 10.1 1 1 “Mon” 20.75 5.5 2 “Wed” 44.10 15 3
If we wanted to remove every row with a NaN
value in the num_guests column only, we could specify a subset
:
bill_df = bill_df.dropna(subset=['num_guests'])
Method 2: fill the missing values with the mean of the column, or with some other aggregate value.
We can use .fillna()
to do this:
bill_df = bill_df.fillna(value={"bill":bill_df.bill.mean(), "num_guests":bill_df.num_guests.mean()})
import codecademylib3_seabornimport pandas as pdfrom students import students
score_mean = students.score.mean()
print(score_mean)
students = students.fillna(0)
score_mean_2 = students.score.mean()
print(score_mean_2)
Comments
Post a Comment