Data Cleaning Part 2 Notes 1

October 01, 2020

You’ve seen most of the functions we often use to diagnose a dataset for cleaning. Some of the most useful ones are:

.head() — display the first 5 rows of the table
.info() — display a summary of the table
.describe() — display the summary statistics of the table
.columns — display the column names of the table
.value_counts() — display the distinct values for a column

Dealing with Multiple Files

Often, you have the same data separated out into multiple files.

Let’s say that we have a ton of files following the filename structure: 'file1.csv', 'file2.csv', 'file3.csv', and so on. The power of pandas is mainly in being able to manipulate large amounts of structured data, so we want to be able to get all of the relevant information into one table so that we can analyze the aggregate data.

We can combine the use of glob, a Python library for working with files, with pandas to organize this data better. glob can open multiple files by using regex matching to get the filenames:

import codecademylib3_seaborn
import pandas as pd
import glob
df_list = []
student_files = glob.glob("exams*.csv")

for files in student_files:
  data = pd.read_csv(files)
  df_list.append(data)


students = pd.concat(df_list)

print(students)
print(len(students))

This code goes through any file that starts with 'file' and has an extension of .csv. It opens each file, reads the data into a DataFrame, and then concatenates all of those DataFrames together.

Reshaping your Data

Since we want

Each variable as a separate column
Each row as a separate observation

We can use pd.melt() to do this transformation. .melt() takes in a DataFrame, and the columns to unpack:

pd.melt(frame=df, id_vars='name', value_vars=['Checking','Savings'], value_name="Amount", var_name="Account Type")

The parameters you provide are:

frame: the DataFrame you want to melt
id_vars: the column(s) of the old DataFrame to preserve
value_vars: the column(s) of the old DataFrame that you want to turn into variables
value_name: what to call the column of the new DataFrame that stores the values
var_name: what to call the column of the new DataFrame that stores the variables

The default names may work in certain situations, but it’s best to always have data that is self-explanatory. Thus, we often use .columns() to rename the columns after melting:

df.columns(["Account", "Account Type", "Amount"])

students = pd.melt(frame=students,id_vars=
['full_name','gender_age','grade'],value_vars=['fractions','probability'],value_name='score',var_name='exam')
print(students.head())
print(students.exam.value_counts())

Search This Blog

Python Data Science Coding Reference

Data Cleaning Part 2 Notes 1

Comments

Post a Comment

Popular posts from this blog

Binomial Test in Python

Slicing and Indexing in Python Pandas

Python Syntax and Functions Part2 (Summary Statistics)