Data Cleaning Part 2 Notes 4 String Parsing Advance

 More String Parsing

Sometimes we want to do analysis on numbers that are hidden within string values. We can use regex to extract this numerical data from the strings they are trapped in. Suppose we had this DataFrame df representing a workout regimen:

dateexerciseDescription
10/18/2018“lunges - 30 reps”
10/18/2018“squats - 20 reps”
10/18/2018“deadlifts - 25 reps”
10/18/2018“jumping jacks - 30 reps”
10/19/2018“lunges - 40 reps”
10/19/2018“chest flyes - 15 reps”

It would be helpful to separate out data like "30 lunges" into 2 columns with the number of reps, "30", and the type of exercise, "lunges". Then, we could compare the increase in the number of lunges done over time, for example.

To extract the numbers from the string we can use pandas’ .str.split() function:

split_df = df['exerciseDescription'].str.split('(\d+)', expand=True)

which would result in this DataFrame split_df:

* *012
0“lunges - ““30”“reps”
1“squats - ““20”“reps”
2“deadlifts - ““25”“reps”
3“jumping jacks - ““30”“reps”
4“lunges - ““40”“reps”
5“chest flyes - ““15”“reps”

Then, we can assign columns from this DataFrame to the original df:

df.reps = pd.to_numeric(split_df[1]) df.exercise = split_df[2].replace('[\- ]', '', regex=True)





print(students.grade.head())
students['grade'] = students['grade'].str.split('(\d+)',expand=True)[1]

students.grade = pd.to_numeric(students.grade)
print(students.dtypes)
avg_grade = students.grade.mean()
print(avg_grade)
print(students.head())

Missing Values

We often have data with missing elements, as a result of a problem with the data collection process or errors in the way the data was stored. The missing elements normally show up as NaN (or Not a Number) values:

daybilltipnum_guests
“Mon”10.111
“Mon”20.755.52
“Tue”19.955.5NaN
“Wed”44.10153
“Wed”NaN11

The num_guests value for the 3rd row is missing, and the bill value for the 5th row is missing. Some calculations we do will just skip the NaN values, but some calculations or visualizations we try to perform will break when a NaN is encountered.

Most of the time, we use one of two methods to deal with missing values.

Method 1: drop all of the rows with a missing value

We can use .dropna() to do this:

bill_df = bill_df.dropna()

This command will result in the DataFrame without the incomplete rows:

daybilltipnum_guests
“Mon”10.111
“Mon”20.755.52
“Wed”44.10153

If we wanted to remove every row with a NaN value in the num_guests column only, we could specify a subset:

bill_df = bill_df.dropna(subset=['num_guests'])

Method 2: fill the missing values with the mean of the column, or with some other aggregate value.

We can use .fillna() to do this:

bill_df = bill_df.fillna(value={"bill":bill_df.bill.mean(), "num_guests":bill_df.num_guests.mean()})

import codecademylib3_seaborn
import pandas as pd
from students import students

score_mean = students.score.mean()

print(score_mean)

students = students.fillna(0)

score_mean_2 = students.score.mean()

print(score_mean_2)

Comments

Popular posts from this blog

Binomial Test in Python

Slicing and Indexing in Python Pandas

Python Syntax and Functions Part2 (Summary Statistics)