Data Cleaning Part 2 Notes 3 Splitting the Character
Splitting by Character
Let’s say we have a column called “type” with data entries in the format "admin_US"
or "user_Kenya"
. Just like we saw before, this column actually contains two types of data. One seems to be the user type (with values like “admin” or “user”) and one seems to be the country this user is in (with values like “US” or “Kenya”).
We can no longer just split along the first 4 characters because admin
and user
are of different lengths. Instead, we know that we want to split along the "_"
. Using that, we can split this column into two separate, cleaner columns:
# Create the 'str_split' column
df['str_split'] = df.type.str.split('_')
# Create the 'usertype' column
df['usertype'] = df.str_split.str.get(0)
# Create the 'country' column
df['country'] = df.str_split.str.get(1)
This would transform a table like:
id | type |
---|---|
1011 | “user_Kenya” |
1112 | “admin_US” |
1113 | “moderator_UK” |
into a table like:
id | type | country | usertype |
---|---|---|---|
1011 | “user_Kenya” | “Kenya” | “user” |
1112 | “admin_US” | “US” | “admin” |
1113 | “moderator_UK” | “UK” | “moderator” |
Each column of a DataFrame can hold items of the same data type or dtype. The dtypes that pandas uses are: float, int, bool, datetime, timedelta, category and object. Often, we want to convert between types so that we can do better analysis. If a numerical category like "num_users"
is stored as a Series of object
s instead of int
s, for example, it makes it more difficult to do something like make a line graph of users over time.
To see the types of each column of a DataFrame, we can use:
print(df.dtypes)
For a DataFrame like this:
item | price | calories |
---|---|---|
“banana” | “$1” | 105 |
“apple” | “$0.75” | 95 |
“peach” | “$3” | 55 |
“clementine” | “$2.5” | 35 |
the .dtypes
attribute would be:
item object
price object
calories int64
dtype: object
We can see that the dtype
of the dtypes
attribute itself is an object
! It is a Series
object, which you have already worked with. Series objects compose all DataFrames.
We can see that the price
column is made up of object
s, which will probably make our analysis of price more difficult. We’ll look at how to convert columns into numeric values in the next few exercises.
Sometimes we need to modify strings in our DataFrames to help us transform them into more meaningful metrics. For example, in our fruits table from before:
item | price | calories |
---|---|---|
“banana” | “$1” | 105 |
“apple” | “$0.75” | 95 |
“peach” | “$3” | 55 |
“peach” | “$4” | 55 |
“clementine” | “$2.5” | 35 |
We can see that the 'price'
column is actually composed of strings representing dollar amounts. This column could be much better represented in floats, so that we could take the mean, calculate other aggregate statistics, or compare different fruits to one another in terms of price.
First, we can use what we know of regex to get rid of all of the dollar signs:
fruit.price = fruit['price'].replace('[\$,]', '', regex=True)
Then, we can use the pandas function .to_numeric()
to convert strings containing numerical values to integers or floats:
fruit.price = pd.to_numeric(fruit.price)
Comments
Post a Comment