Data Cleaning Part 2 Notes 3 Splitting the Character

 Splitting by Character

Let’s say we have a column called “type” with data entries in the format "admin_US" or "user_Kenya". Just like we saw before, this column actually contains two types of data. One seems to be the user type (with values like “admin” or “user”) and one seems to be the country this user is in (with values like “US” or “Kenya”).

We can no longer just split along the first 4 characters because admin and user are of different lengths. Instead, we know that we want to split along the "_". Using that, we can split this column into two separate, cleaner columns:

# Create the 'str_split' column df['str_split'] = df.type.str.split('_') # Create the 'usertype' column df['usertype'] = df.str_split.str.get(0) # Create the 'country' column df['country'] = df.str_split.str.get(1)

This would transform a table like:

idtype
1011“user_Kenya”
1112“admin_US”
1113“moderator_UK”

into a table like:

idtypecountryusertype
1011“user_Kenya”“Kenya”“user”
1112“admin_US”“US”“admin”
1113“moderator_UK”“UK”“moderator”


Looking at Types

Each column of a DataFrame can hold items of the same data type or dtype. The dtypes that pandas uses are: float, int, bool, datetime, timedelta, category and object. Often, we want to convert between types so that we can do better analysis. If a numerical category like "num_users" is stored as a Series of objects instead of ints, for example, it makes it more difficult to do something like make a line graph of users over time.

To see the types of each column of a DataFrame, we can use:

print(df.dtypes)

For a DataFrame like this:

itempricecalories
“banana”“$1”105
“apple”“$0.75”95
“peach”“$3”55
“clementine”“$2.5”35

the .dtypes attribute would be:

item object price object calories int64 dtype: object

We can see that the dtype of the dtypes attribute itself is an object! It is a Series object, which you have already worked with. Series objects compose all DataFrames.

We can see that the price column is made up of objects, which will probably make our analysis of price more difficult. We’ll look at how to convert columns into numeric values in the next few exercises.

name_split = students['full_name'].str.split(" ")
students['first_name'] = name_split.str.get(0)
students['last_name'] = name_split.str.get(1)

print(students.head())

String Parsing

Sometimes we need to modify strings in our DataFrames to help us transform them into more meaningful metrics. For example, in our fruits table from before:

itempricecalories
“banana”“$1”105
“apple”“$0.75”95
“peach”“$3”55
“peach”“$4”55
“clementine”“$2.5”35

We can see that the 'price' column is actually composed of strings representing dollar amounts. This column could be much better represented in floats, so that we could take the mean, calculate other aggregate statistics, or compare different fruits to one another in terms of price.

First, we can use what we know of regex to get rid of all of the dollar signs:

fruit.price = fruit['price'].replace('[\$,]', '', regex=True)

Then, we can use the pandas function .to_numeric() to convert strings containing numerical values to integers or floats:

fruit.price = pd.to_numeric(fruit.price)

Comments

Popular posts from this blog

Binomial Test in Python

Slicing and Indexing in Python Pandas

Python Syntax and Functions Part2 (Summary Statistics)