Data Cleaning Part 2 Notes 1

 You’ve seen most of the functions we often use to diagnose a dataset for cleaning. Some of the most useful ones are:

  • .head() — display the first 5 rows of the table
  • .info() — display a summary of the table
  • .describe() — display the summary statistics of the table
  • .columns — display the column names of the table
  • .value_counts() — display the distinct values for a column
Dealing with Multiple Files

Often, you have the same data separated out into multiple files.

Let’s say that we have a ton of files following the filename structure: 'file1.csv''file2.csv''file3.csv', and so on. The power of pandas is mainly in being able to manipulate large amounts of structured data, so we want to be able to get all of the relevant information into one table so that we can analyze the aggregate data.

We can combine the use of glob, a Python library for working with files, with pandas to organize this data better. glob can open multiple files by using regex matching to get the filenames:

import codecademylib3_seaborn
import pandas as pd
import glob
df_list = []
student_files = glob.glob("exams*.csv")

for files in student_files:
  data = pd.read_csv(files)
  df_list.append(data)


students = pd.concat(df_list)

print(students)
print(len(students))

This code goes through any file that starts with 'file' and has an extension of .csv. It opens each file, reads the data into a DataFrame, and then concatenates all of those DataFrames together.


Reshaping your Data

Since we want

  • Each variable as a separate column
  • Each row as a separate observation

We can use pd.melt() to do this transformation. .melt() takes in a DataFrame, and the columns to unpack:

pd.melt(frame=df, id_vars='name', value_vars=['Checking','Savings'], value_name="Amount", var_name="Account Type")

The parameters you provide are:

  • frame: the DataFrame you want to melt
  • id_vars: the column(s) of the old DataFrame to preserve
  • value_vars: the column(s) of the old DataFrame that you want to turn into variables
  • value_name: what to call the column of the new DataFrame that stores the values
  • var_name: what to call the column of the new DataFrame that stores the variables

The default names may work in certain situations, but it’s best to always have data that is self-explanatory. Thus, we often use .columns() to rename the columns after melting:

df.columns(["Account", "Account Type", "Amount"])

students = pd.melt(frame=students,id_vars=
['full_name','gender_age','grade'],value_vars=['fractions','probability'],value_name='score',var_name='exam')
print(students.head())
print(students.exam.value_counts())

Comments

Popular posts from this blog

Binomial Test in Python

Slicing and Indexing in Python Pandas

Python Syntax and Functions Part2 (Summary Statistics)