Data Cleaning Part 2 Notes 2 Duplicated and Str

October 02, 2020

Dealing with Duplicates

Often we see duplicated rows of data in the DataFrames we are working with. This could happen due to errors in data collection or in saving and loading the data.

To check for duplicates, we can use the pandas function .duplicated(), which will return a Series telling us which rows are duplicate rows.

We can use the pandas .drop_duplicates() function to remove all rows that are duplicates of another row.

If we call fruits.drop_duplicates(), we would get the table:

fruits = df

If we wanted to remove every row with a duplicate value in the item column, we could specify a subset:

fruits = fruits.drop_duplicates(subset=['item'])
Splitting by Index
In trying to get clean data, we want to make sure each column represents one type of measurement. Often, multiple measurements are recorded in the same column, and we want to separate these out so that we can do individual analysis on each variable.
Let’s say we have a column “birthday” with data formatted in MMDDYYYY format. In other words, “11011993” represents a birthday of November 1, 1993. We want to split this data into day, month, and year so that we can use these columns as separate features.
In this case, we know the exact structure of these strings. The first two characters will always correspond to the month, the second two to the day, and the rest of the string will always correspond to year. We can easily break the data into three separate columns by splitting the strings using .str:
# Create the 'month' column
df['month'] = df.birthday.str[0:2]

# Create the 'day' column
df['day'] = df.birthday.str[2:4]

# Create the 'year' column
df['year'] = df.birthday.str[4:]
The first command takes the first two characters of each value in the birthday column and puts it into a month column. The second command takes the second two characters of each value in the birthday column and puts it into a day column. The third command takes the rest of each value in the birthday column and puts it into a year column.
This would transform a table like:
id birthday
1011 “12241989”
1112 “10311966”
1113 “01052011”

into a table like:
id birthday month day year
1011 “12241989” “12” “24” “1989”
1112 “10311966” “10” “31” “1966”
1113 “01052011” “01” “05” “2011”

We will practice changing string columns into numerical columns (like converting "10" to 10) in a future exercise.

print(students.columns)
print(students.head())
students['gender'] = students.gender_age.str[0]
students['age'] = students.gender_age.str[1:]


students = students[['full_name','gender','age','grade','exam','score']]
print(students.columns)
print(students.head())

Search This Blog

Python Data Science Coding Reference

Data Cleaning Part 2 Notes 2 Duplicated and Str

Comments

Post a Comment

Popular posts from this blog

Binomial Test in Python

Slicing and Indexing in Python Pandas

Python Syntax and Functions Part2 (Summary Statistics)

id	birthday	month	day	year
1011	“12241989”	“12”	“24”	“1989”
1112	“10311966”	“10”	“31”	“1966”
1113	“01052011”	“01”	“05”	“2011”