Data Cleaning Part 2 Notes 2 Duplicated and Str
Dealing with Duplicates
Often we see duplicated rows of data in the DataFrames we are working with. This could happen due to errors in data collection or in saving and loading the data.
To check for duplicates, we can use the pandas function .duplicated()
, which will return a Series telling us which rows are duplicate rows.
We can use the pandas .drop_duplicates()
function to remove all rows that are duplicates of another row.
fruits.drop_duplicates()
, we would get the table:If we wanted to remove every row with a duplicate value in the item column, we could specify a subset
:
fruits = fruits.drop_duplicates(subset=['item'])Splitting by IndexIn trying to get clean data, we want to make sure each column represents one type of measurement. Often, multiple measurements are recorded in the same column, and we want to separate these out so that we can do individual analysis on each variable.
Let’s say we have a column “birthday” with data formatted in MMDDYYYY format. In other words, “11011993” represents a birthday of November 1, 1993. We want to split this data into day, month, and year so that we can use these columns as separate features.
In this case, we know the exact structure of these strings. The first two characters will always correspond to the month, the second two to the day, and the rest of the string will always correspond to year. We can easily break the data into three separate columns by splitting the strings using .str
:
# Create the 'month' column
df['month'] = df.birthday.str[0:2]
# Create the 'day' column
df['day'] = df.birthday.str[2:4]
# Create the 'year' column
df['year'] = df.birthday.str[4:]
The first command takes the first two characters of each value in the birthday
column and puts it into a month
column. The second command takes the second two characters of each value in the birthday
column and puts it into a day
column. The third command takes the rest of each value in the birthday
column and puts it into a year
column.
This would transform a table like:
id birthday 1011 “12241989” 1112 “10311966” 1113 “01052011”
into a table like:
id birthday month day year 1011 “12241989” “12” “24” “1989” 1112 “10311966” “10” “31” “1966” 1113 “01052011” “01” “05” “2011”
We will practice changing string columns into numerical columns (like converting "10"
to 10
) in a future exercise.
print(students.columns)print(students.head())students['gender'] = students.gender_age.str[0]students['age'] = students.gender_age.str[1:]
students = students[['full_name','gender','age','grade','exam','score']]print(students.columns)print(students.head())
Comments
Post a Comment