Data Cleaning Part 2 Notes 1
You’ve seen most of the functions we often use to diagnose a dataset for cleaning. Some of the most useful ones are:
.head()
— display the first 5 rows of the table.info()
— display a summary of the table.describe()
— display the summary statistics of the table.columns
— display the column names of the table.value_counts()
— display the distinct values for a column
Often, you have the same data separated out into multiple files.
Let’s say that we have a ton of files following the filename structure: 'file1.csv'
, 'file2.csv'
, 'file3.csv'
, and so on. The power of pandas is mainly in being able to manipulate large amounts of structured data, so we want to be able to get all of the relevant information into one table so that we can analyze the aggregate data.
We can combine the use of glob
, a Python library for working with files, with pandas to organize this data better. glob
can open multiple files by using regex matching to get the filenames:
import codecademylib3_seabornimport pandas as pdimport globdf_list = []student_files = glob.glob("exams*.csv")
for files in student_files: data = pd.read_csv(files) df_list.append(data)
students = pd.concat(df_list)
print(students)print(len(students))
This code goes through any file that starts with 'file'
and has an extension of .csv
. It opens each file, reads the data into a DataFrame, and then concatenates all of those DataFrames together.
Since we want
- Each variable as a separate column
- Each row as a separate observation
We can use pd.melt()
to do this transformation. .melt()
takes in a DataFrame, and the columns to unpack:
pd.melt(frame=df, id_vars='name', value_vars=['Checking','Savings'], value_name="Amount", var_name="Account Type")
The parameters you provide are:
frame
: the DataFrame you want tomelt
id_vars
: the column(s) of the old DataFrame to preservevalue_vars
: the column(s) of the old DataFrame that you want to turn into variablesvalue_name
: what to call the column of the new DataFrame that stores the valuesvar_name
: what to call the column of the new DataFrame that stores the variables
The default names may work in certain situations, but it’s best to always have data that is self-explanatory. Thus, we often use .columns()
to rename the columns after melting:
df.columns(["Account", "Account Type", "Amount"])
students = pd.melt(frame=students,id_vars=['full_name','gender_age','grade'],value_vars=['fractions','probability'],value_name='score',var_name='exam')print(students.head())print(students.exam.value_counts())
Comments
Post a Comment