A cartoon panda is getting a bubble bath.MidJourney 5
Getting Started
Before we begin our interactive session, please follow these steps to set up your Jupyter Notebook:
Open JupyterLab and create a new notebook:
Click on the + button in the top left corner
Select Python 3.10.0 from the Notebook options
Rename your notebook:
Right-click on the Untitled.ipynb tab
Select “Rename”
Name your notebook with the format: Session_XY_Topic.ipynb (Replace X with the day number and Y with the session number)
Add a title cell:
In the first cell of your notebook, change the cell type to “Markdown”
Add the following content (replace the placeholders with the actual information):
# Day X: Session Y - [Session Topic][Link to session webpage]Date: [Current Date]
Add a code cell:
Below the title cell, add a new cell
Ensure it’s set as a “Code” cell
This will be where you start writing your Python code for the session
Throughout the session:
Take notes in Markdown cells
Copy or write code in Code cells
Run cells to test your code
Ask questions if you need clarification
Caution
Remember to save your work frequently by clicking the save icon or using the keyboard shortcut (Ctrl+S or Cmd+S).
Let’s begin our interactive session!
Introduction to Data Cleaning
Data cleaning is a crucial step in the data science workflow. It involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets to ensure the quality and reliability of your analysis.
In this session, we’ll explore common issues in dataframes and learn how to address them using pandas.
Instructions
We will work through this material together, writing a new notebook as we go.
✏️ This symbol designates code you should add to your notebook and run.
🤓 Where useful, this session contains links to Pandas Tutor, which helps you to visualize the chained functions in the accompanying code block.
Let’s start by importing pandas and creating a sample dataframe with some issues we’ll need to clean:
✏️ Try it. Add the cell below to your notebook and run it.
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
3 Oak 5.2 20.0 Park A 2020-01-15
4 Pine 15.0 40.0 Park B 2018-11-30
5 None 8.1 NaN Park C 2021-07-05
Handling Missing Values
Identifying Missing Values
First, let’s check for missing values in our dataframe. For this we use the isnull() method on the dataframe.
✏️ Try it. Add the cell below to your notebook and run it.
You can see that the isnull() command returns a Booelan (True or False) value for each item in the dataframe. If the location (row, column) is empty, then the isnull() command will return True, otherwise it returns False.
We can apply the sum() method to the result of df.isnull() to see what columns have empty values in them.
Important
The axis argument is often used in pandas and numpy to indicate how an aggregation (e.g. sum()) should be applied. You should read this argument as an answer to the question:
What should I apply this aggregation across, rows (axis 0) or columns (axis 1)?
df.sum(axis=0) adds up all the rows and returns a single sum for each column.
df.sum(axis=1) adds up all the columns and returns a single sum for each row.
Generally, aggregations over all rows are more useful than aggregations across all columns, so the default for pandas and numpy aggregations is to apply aggregations and dataframe operations assuming axis=0. However, as we’ll see, other commands default to axis=1.
Some commands allow you to use alias string arguments (rows and columns), but this isn’t universal across the libarary.
✏️ Try it. Add the cell below to your notebook and run it.
Code
# Use the axis argument as an integer (0)null_values = df.isnull()print("Using `axis=0`:\n", null_values.sum(axis=0))# Show that this is the same as using the `axis='rows'` argument:print("\nUsing `axis='rows':\n", null_values.sum(axis='rows'))# And that this is the same as the default behavior:print("\nUsing default arguments:\n", null_values.sum())
Using `axis=0`:
species 1
height_m 0
diameter_cm 1
location 0
date_planted 0
dtype: int64
Using `axis='rows':
species 1
height_m 0
diameter_cm 1
location 0
date_planted 0
dtype: int64
Using default arguments:
species 1
height_m 0
diameter_cm 1
location 0
date_planted 0
dtype: int64
As we requested, this command sums up all the rows in each column of null_values. Any False is a 0 and any True is a 1, so the result is the number of null values in each column of the dataframe.
Method chaining allows us to do both the finding of null values and the summing of values for all rows in each column with a single line of code
✏️ Try it. Add the cell below to your notebook and run it.
Code
# Use method chaining to make our code more concise.df.isnull().sum(axis='rows')
We can drop rows with missing values using the dropna() function:
Code
df_dropped = df.dropna()print(df_dropped)
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
3 Oak 5.2 20.0 Park A 2020-01-15
4 Pine 15.0 40.0 Park B 2018-11-30
Notice how we didn’t need to specify an axis - by default, dropna() operates on each row and removes and rows that are any missing values (the default is axis='rows').
Filling Missing Values (Imputation)
When you drop data (using methods like dropna() or drop()), you’re permanently removing information from your dataset.
This can potentially lead to:
Loss of important insights
Biased results
Reduced statistical power
Smaller sample size, which can affect the reliability of your analysis
Imputation is the process of replacing missing values with substituted values. Instead of dropping rows or columns with missing data, you fill in the gaps.
Common imputation techniques include:
Mean/median/mode imputation
Forward fill or backward fill
Interpolation
Using machine learning models to predict missing values
Other techniques:
Creating a “missing” category for categorical variables
Using algorithms that can handle missing data (like some decision tree-based methods)
Multiple imputation for more rigorous statistical analysis
When to consider alternatives:
When missing data is not completely at random (MCAR)
When you have a small dataset and can’t afford to lose samples
When the missing data might contain important information about your problem
You can use any of the pandas Series aggregation commands to fill missing values instead of dropping the data.
✏️ Try it. Add the cell below to your notebook and run it.
Code
# Fill missing values with a specific valuedf['species'] = df['species'].fillna('unknown')print(df)
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
3 Oak 5.2 20.0 Park A 2020-01-15
4 Pine 15.0 40.0 Park B 2018-11-30
5 unknown 8.1 NaN Park C 2021-07-05
We can even use aggregations to fill with values derived from our dataframe.
For example, let’s replace missing values of diameter_cm with the average value across all the rows.
✏️ Try it. Add the cell below to your notebook and run it.
Code
# Fill missing numeric values with the mean of the columndf['diameter_cm'] = df['diameter_cm'].fillna(df['diameter_cm'].mean())
It looks like row 3 is a duplicate (it is the same as row 0). As before, we can see how many rows are duplicated by applying the sum command to the result of df.duplicated()
✏️ Try it. Add the cell below to your notebook and run it.
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
4 Pine 15.0 40.0 Park B 2018-11-30
5 unknown 8.1 28.0 Park C 2021-07-05
The extra entry for Oak no longer appears in df_no_duplicates.
What if we wanted to simply get rid of the duplicates in our original df without having to make an entirely new dataframe? the inplace option allows for this with many pandas methods:
df.drop_duplicates(inplace=True)
Important
While inplace=True can be useful when making changes to a dataframe without having to worry about creating a copy, you can’t do method chaining when using this argument.
✏️ Try it. Add the cell below to your notebook and run it.
Code
# Make a copy of our dataframedf2 = df.copy()# Remove the duplicates from df2 without making a new dataframe (save results back into df2)df2.drop_duplicates(inplace=True)print(df2)
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
4 Pine 15.0 40.0 Park B 2018-11-30
5 unknown 8.1 28.0 Park C 2021-07-05
Handling Duplicates Based on Specific Columns
We can also remove duplicates based on specific columns, in this case removing any rows that share the same species and location.
✏️ Try it. Add the cell below to your notebook and run it.
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
5 unknown 8.1 28.0 Park C 2021-07-05
Although our two Pines weren’t duplciates (their height_m, diameter_cm, and date_planted were different), we still dropped them from the dataframe based on the subset of columns (species and location)
The object datatype is a generic term meaning “something, I don’t know what” in python (remember, in python everything is an object).
Generally, we want our data types to be something more specific, like a floating point number, an integer, a string, or a date. We can use the astype() method to coerce our data into a specific kind of thing.
In older versions of pandas, string columns were always still listed as type object. They are functionally str objects, but pandas isn’t storing them in any special “pandas” way, so they are just generic python objects. Newer versions of pandas allow you to create string (note: not the same as str) data types. They are optimized for use in pandas, although you will rarely see any difference in performance, it’s good practice to use them when you can.
Let’s convert height_m to float.
✏️ Try it. Add the cell below to your notebook and run it.
Code
# Convert 'height_m' to floatdf['height_m'] = df['height_m'].astype(float)print(df.dtypes)
Converting generic objects to datetime is more complicated. In fact, we’ll have an entire session later this class on working with dates. Pandas has a helper function - pd.to_datetime() - that tries to infer dates from values in columns.
✏️ Try it. Add the cell below to your notebook and run it.
Code
# Convert 'date_planted' to datetimedf['date_planted'] = pd.to_datetime(df['date_planted'])print(df.dtypes)
We can use string methods to clean text data. We access these methods using the .str attribute that is part of every pandas Series.
Note
Remember, every column in a DataFrame is a Series
✏️ Try it. Add the cell below to your notebook and run it.
Code
print("'unkown' should be capitalized")print(df)# Capitalize species names (unknown -> Unknown)df['species'] = df['species'].str.capitalize()print("\nFixed it!")print(df)
'unkown' should be capitalized
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
3 Oak 5.2 20.0 Park A 2020-01-15
4 Pine 15.0 40.0 Park B 2018-11-30
5 unknown 8.1 28.0 Park C 2021-07-05
Fixed it!
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
3 Oak 5.2 20.0 Park A 2020-01-15
4 Pine 15.0 40.0 Park B 2018-11-30
5 Unknown 8.1 28.0 Park C 2021-07-05
✏️ Try it. Add the cell below to your notebook and run it.
Code
print("'Park C ' should be 'Park C'")print(df)# Remove leading/trailing whitespace from location# "Park C " -> "Park C"df['location'] = df['location'].str.strip()print("\nFixed it!")print(df)
'Park C ' should be 'Park C'
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
3 Oak 5.2 20.0 Park A 2020-01-15
4 Pine 15.0 40.0 Park B 2018-11-30
5 Unknown 8.1 28.0 Park C 2021-07-05
Fixed it!
species height_m diameter_cm location date_planted
0 Oak 5.2 20.0 Park A 2020-01-15
1 Pine 12.0 35.0 Park B 2019-05-20
2 Maple 7.5 25.0 Park A 2020-03-10
3 Oak 5.2 20.0 Park A 2020-01-15
4 Pine 15.0 40.0 Park B 2018-11-30
5 Unknown 8.1 28.0 Park C 2021-07-05
Wrap-up and Best Practices
In this session, we’ve covered essential techniques for cleaning dataframes in pandas: - Handling missing values - Dealing with duplicates - Converting data types - String manipulation and formatting
Remember these best practices:
Always examine your data before and after cleaning steps.
Remember that the default for most operations is to act across all rows (axis=0).
Document your cleaning steps for reproducibility.
Be cautious when dropping data - sometimes imputation or other techniques might be more appropriate.