Interactive Session 5B

A cartoon panda is getting a bubble bath. MidJourney 5

Getting Started

Before we begin our interactive session, please follow these steps to set up your Jupyter Notebook:

Open JupyterLab and create a new notebook:
- Click on the + button in the top left corner
- Select Python 3.10.0 from the Notebook options
Rename your notebook:
- Right-click on the Untitled.ipynb tab
- Select “Rename”
- Name your notebook with the format: Session_XY_Topic.ipynb (Replace X with the day number and Y with the session number)
Add a title cell:
- In the first cell of your notebook, change the cell type to “Markdown”
- Add the following content (replace the placeholders with the actual information):

# Day X: Session Y - [Session Topic]

[Link to session webpage]

Date: [Current Date]

Add a code cell:
- Below the title cell, add a new cell
- Ensure it’s set as a “Code” cell
- This will be where you start writing your Python code for the session
Throughout the session:
- Take notes in Markdown cells
- Copy or write code in Code cells
- Run cells to test your code
- Ask questions if you need clarification

Caution

Remember to save your work frequently by clicking the save icon or using the keyboard shortcut (Ctrl+S or Cmd+S).

Let’s begin our interactive session!

Introduction to Data Cleaning

Data cleaning is a crucial step in the data science workflow. It involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets to ensure the quality and reliability of your analysis.

In this session, we’ll explore common issues in dataframes and learn how to address them using pandas.

Instructions

We will work through this material together, writing a new notebook as we go.

✏️ This symbol designates code you should add to your notebook and run.

🤓 Where useful, this session contains links to Pandas Tutor, which helps you to visualize the chained functions in the accompanying code block.

Let’s start by importing pandas and creating a sample dataframe with some issues we’ll need to clean:

✏️ Try it. Add the cell below to your notebook and run it.

Code

import pandas as pd
import numpy as np

# Create a sample dataframe with issues
data = {
    'species': ['Oak', 'Pine', 'Maple', 'Oak', 'Pine', None],
    'height_m': [5.2, 12.0, '7.5', 5.2, 15.0, 8.1],
    'diameter_cm': [20, 35, 25, 20, 40, np.nan],
    'location': ['Park A', 'Park B', 'Park A', 'Park A', 'Park B', 'Park C '],
    'date_planted': ['2020-01-15', '2019-05-20', '2020-03-10', '2020-01-15', '2018-11-30', '2021-07-05']
}

df = pd.DataFrame(data)
print(df)

  species height_m  diameter_cm location date_planted
0     Oak      5.2         20.0   Park A   2020-01-15
1    Pine     12.0         35.0   Park B   2019-05-20
2   Maple      7.5         25.0   Park A   2020-03-10
3     Oak      5.2         20.0   Park A   2020-01-15
4    Pine     15.0         40.0   Park B   2018-11-30
5    None      8.1          NaN  Park C    2021-07-05

Handling Missing Values

Identifying Missing Values

First, let’s check for missing values in our dataframe. For this we use the isnull() method on the dataframe.

✏️ Try it. Add the cell below to your notebook and run it.

Code

print(df.isnull())

   species  height_m  diameter_cm  location  date_planted
0    False     False        False     False         False
1    False     False        False     False         False
2    False     False        False     False         False
3    False     False        False     False         False
4    False     False        False     False         False
5     True     False         True     False         False

You can see that the isnull() command returns a Booelan (True or False) value for each item in the dataframe. If the location (row, column) is empty, then the isnull() command will return True, otherwise it returns False.

We can apply the sum() method to the result of df.isnull() to see what columns have empty values in them.

Important

The axis argument is often used in pandas and numpy to indicate how an aggregation (e.g. sum()) should be applied. You should read this argument as an answer to the question:

What should I apply this aggregation across, rows (axis 0) or columns (axis 1)?

df.sum(axis=0) adds up all the rows and returns a single sum for each column.

df.sum(axis=1) adds up all the columns and returns a single sum for each row.

Generally, aggregations over all rows are more useful than aggregations across all columns, so the default for pandas and numpy aggregations is to apply aggregations and dataframe operations assuming axis=0. However, as we’ll see, other commands default to axis=1.

Some commands allow you to use alias string arguments (rows and columns), but this isn’t universal across the libarary.

✏️ Try it. Add the cell below to your notebook and run it.

Code

# Use the axis argument as an integer (0)
null_values = df.isnull()
print("Using `axis=0`:\n",
        null_values.sum(axis=0))

# Show that this is the same as using the `axis='rows'` argument:
print("\nUsing `axis='rows':\n",
        null_values.sum(axis='rows'))

# And that this is the same as the default behavior:
print("\nUsing default arguments:\n",
        null_values.sum())

Using `axis=0`:
 species         1
height_m        0
diameter_cm     1
location        0
date_planted    0
dtype: int64

Using `axis='rows':
 species         1
height_m        0
diameter_cm     1
location        0
date_planted    0
dtype: int64

Using default arguments:
 species         1
height_m        0
diameter_cm     1
location        0
date_planted    0
dtype: int64

As we requested, this command sums up all the rows in each column of null_values. Any False is a 0 and any True is a 1, so the result is the number of null values in each column of the dataframe.

Method chaining allows us to do both the finding of null values and the summing of values for all rows in each column with a single line of code

✏️ Try it. Add the cell below to your notebook and run it.

Code

# Use method chaining to make our code more concise.
df.isnull().sum(axis='rows')

species         1
height_m        0
diameter_cm     1
location        0
date_planted    0
dtype: int64

🤓 Pandas Tutor

Dropping Missing Values

We can drop rows with missing values using the dropna() function:

Code

df_dropped = df.dropna()
print(df_dropped)

  species height_m  diameter_cm location date_planted
0     Oak      5.2         20.0   Park A   2020-01-15
1    Pine     12.0         35.0   Park B   2019-05-20
2   Maple      7.5         25.0   Park A   2020-03-10
3     Oak      5.2         20.0   Park A   2020-01-15
4    Pine     15.0         40.0   Park B   2018-11-30

Notice how we didn’t need to specify an axis - by default, dropna() operates on each row and removes and rows that are any missing values (the default is axis='rows').

Filling Missing Values (Imputation)

When you drop data (using methods like dropna() or drop()), you’re permanently removing information from your dataset.

This can potentially lead to:

Loss of important insights
Biased results
Reduced statistical power
Smaller sample size, which can affect the reliability of your analysis

Imputation is the process of replacing missing values with substituted values. Instead of dropping rows or columns with missing data, you fill in the gaps.

Common imputation techniques include:

Mean/median/mode imputation
Forward fill or backward fill
Interpolation
Using machine learning models to predict missing values

Other techniques:

Creating a “missing” category for categorical variables
Using algorithms that can handle missing data (like some decision tree-based methods)
Multiple imputation for more rigorous statistical analysis

When to consider alternatives:

When missing data is not completely at random (MCAR)
When you have a small dataset and can’t afford to lose samples
When the missing data might contain important information about your problem

You can use any of the pandas Series aggregation commands to fill missing values instead of dropping the data.

✏️ Try it. Add the cell below to your notebook and run it.

Code

# Fill missing values with a specific value
df['species'] = df['species'].fillna('unknown')

print(df)

   species height_m  diameter_cm location date_planted
0      Oak      5.2         20.0   Park A   2020-01-15
1     Pine     12.0         35.0   Park B   2019-05-20
2    Maple      7.5         25.0   Park A   2020-03-10
3      Oak      5.2         20.0   Park A   2020-01-15
4     Pine     15.0         40.0   Park B   2018-11-30
5  unknown      8.1          NaN  Park C    2021-07-05

We can even use aggregations to fill with values derived from our dataframe.

For example, let’s replace missing values of diameter_cm with the average value across all the rows.

✏️ Try it. Add the cell below to your notebook and run it.

Code

# Fill missing numeric values with the mean of the column
df['diameter_cm'] = df['diameter_cm'].fillna(df['diameter_cm'].mean())

🤓 Pandas Tutor

Dealing with Duplicates

Identifying and Removing Duplicate Rows

Let’s check for and remove duplicate rows:

✏️ Try it. Add the cell below to your notebook and run it.

Code

# Check for duplicates
print(df.duplicated())

0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

It looks like row 3 is a duplicate (it is the same as row 0). As before, we can see how many rows are duplicated by applying the sum command to the result of df.duplicated()

✏️ Try it. Add the cell below to your notebook and run it.

Code

df.duplicated().sum()

np.int64(1)

🤓 Pandas Tutor

The drop_duplicates() method returns a new dataframe that only contains the first row of any duplicated rows.

✏️ Try it. Add the cell below to your notebook and run it.

Code

df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

   species height_m  diameter_cm location date_planted
0      Oak      5.2         20.0   Park A   2020-01-15
1     Pine     12.0         35.0   Park B   2019-05-20
2    Maple      7.5         25.0   Park A   2020-03-10
4     Pine     15.0         40.0   Park B   2018-11-30
5  unknown      8.1         28.0  Park C    2021-07-05

The extra entry for Oak no longer appears in df_no_duplicates.

What if we wanted to simply get rid of the duplicates in our original df without having to make an entirely new dataframe? the inplace option allows for this with many pandas methods:

df.drop_duplicates(inplace=True)

Important

While inplace=True can be useful when making changes to a dataframe without having to worry about creating a copy, you can’t do method chaining when using this argument.

✏️ Try it. Add the cell below to your notebook and run it.

Code

# Make a copy of our dataframe
df2 = df.copy()
# Remove the duplicates from df2 without making a new dataframe (save results back into df2)
df2.drop_duplicates(inplace=True)
print(df2)

   species height_m  diameter_cm location date_planted
0      Oak      5.2         20.0   Park A   2020-01-15
1     Pine     12.0         35.0   Park B   2019-05-20
2    Maple      7.5         25.0   Park A   2020-03-10
4     Pine     15.0         40.0   Park B   2018-11-30
5  unknown      8.1         28.0  Park C    2021-07-05

Handling Duplicates Based on Specific Columns

We can also remove duplicates based on specific columns, in this case removing any rows that share the same species and location.

✏️ Try it. Add the cell below to your notebook and run it.

Code

df_unique_species = df.drop_duplicates(subset=['species', 'location'])
print(df_unique_species)

   species height_m  diameter_cm location date_planted
0      Oak      5.2         20.0   Park A   2020-01-15
1     Pine     12.0         35.0   Park B   2019-05-20
2    Maple      7.5         25.0   Park A   2020-03-10
5  unknown      8.1         28.0  Park C    2021-07-05

Although our two Pines weren’t duplciates (their height_m, diameter_cm, and date_planted were different), we still dropped them from the dataframe based on the subset of columns (species and location)

🤓 Pandas Tutor

Data Type Conversion and Consistency

Checking and Changing Data Types

Often datasets - especially those collected by surveys or forms - contain a mixture of data types (i.e. some strings mixed in with mostly numbers)

Let’s check the data types of our columns and convert them as needed:

✏️ Try it. Add the cell below to your notebook and run it.

Code

print(df.dtypes)

species          object
height_m         object
diameter_cm     float64
location         object
date_planted     object
dtype: object

The object datatype is a generic term meaning “something, I don’t know what” in python (remember, in python everything is an object).

Generally, we want our data types to be something more specific, like a floating point number, an integer, a string, or a date. We can use the astype() method to coerce our data into a specific kind of thing.

In older versions of pandas, string columns were always still listed as type object. They are functionally str objects, but pandas isn’t storing them in any special “pandas” way, so they are just generic python objects. Newer versions of pandas allow you to create string (note: not the same as str) data types. They are optimized for use in pandas, although you will rarely see any difference in performance, it’s good practice to use them when you can.

Let’s convert height_m to float.

✏️ Try it. Add the cell below to your notebook and run it.

Code

# Convert 'height_m' to float
df['height_m'] = df['height_m'].astype(float)
print(df.dtypes)

species          object
height_m        float64
diameter_cm     float64
location         object
date_planted     object
dtype: object

Converting generic objects to datetime is more complicated. In fact, we’ll have an entire session later this class on working with dates. Pandas has a helper function - pd.to_datetime() - that tries to infer dates from values in columns.

✏️ Try it. Add the cell below to your notebook and run it.

Code

# Convert 'date_planted' to datetime
df['date_planted'] = pd.to_datetime(df['date_planted'])

print(df.dtypes)

species                 object
height_m               float64
diameter_cm            float64
location                object
date_planted    datetime64[ns]
dtype: object

String Manipulation and Formatting

We can use string methods to clean text data. We access these methods using the .str attribute that is part of every pandas Series.

Note

Remember, every column in a DataFrame is a Series

✏️ Try it. Add the cell below to your notebook and run it.

Code

print("'unkown' should be capitalized")
print(df)

# Capitalize species names (unknown -> Unknown)
df['species'] = df['species'].str.capitalize()

print("\nFixed it!")
print(df)

'unkown' should be capitalized
   species  height_m  diameter_cm location date_planted
0      Oak       5.2         20.0   Park A   2020-01-15
1     Pine      12.0         35.0   Park B   2019-05-20
2    Maple       7.5         25.0   Park A   2020-03-10
3      Oak       5.2         20.0   Park A   2020-01-15
4     Pine      15.0         40.0   Park B   2018-11-30
5  unknown       8.1         28.0  Park C    2021-07-05

Fixed it!
   species  height_m  diameter_cm location date_planted
0      Oak       5.2         20.0   Park A   2020-01-15
1     Pine      12.0         35.0   Park B   2019-05-20
2    Maple       7.5         25.0   Park A   2020-03-10
3      Oak       5.2         20.0   Park A   2020-01-15
4     Pine      15.0         40.0   Park B   2018-11-30
5  Unknown       8.1         28.0  Park C    2021-07-05

✏️ Try it. Add the cell below to your notebook and run it.

Code

print("'Park C ' should be 'Park C'")
print(df)

# Remove leading/trailing whitespace from location
# "Park C " -> "Park C"
df['location'] = df['location'].str.strip()

print("\nFixed it!")
print(df)

'Park C ' should be 'Park C'
   species  height_m  diameter_cm location date_planted
0      Oak       5.2         20.0   Park A   2020-01-15
1     Pine      12.0         35.0   Park B   2019-05-20
2    Maple       7.5         25.0   Park A   2020-03-10
3      Oak       5.2         20.0   Park A   2020-01-15
4     Pine      15.0         40.0   Park B   2018-11-30
5  Unknown       8.1         28.0  Park C    2021-07-05

Fixed it!
   species  height_m  diameter_cm location date_planted
0      Oak       5.2         20.0   Park A   2020-01-15
1     Pine      12.0         35.0   Park B   2019-05-20
2    Maple       7.5         25.0   Park A   2020-03-10
3      Oak       5.2         20.0   Park A   2020-01-15
4     Pine      15.0         40.0   Park B   2018-11-30
5  Unknown       8.1         28.0   Park C   2021-07-05

Wrap-up and Best Practices

In this session, we’ve covered essential techniques for cleaning dataframes in pandas: - Handling missing values - Dealing with duplicates - Converting data types - String manipulation and formatting

Remember these best practices:

Always examine your data before and after cleaning steps.
Remember that the default for most operations is to act across all rows (axis=0).
Document your cleaning steps for reproducibility.
Be cautious when dropping data - sometimes imputation or other techniques might be more appropriate.

For more advanced cleaning techniques and in-depth explanations, refer to the pandas documentation Pandas Documentation, the master Pandas Cheat Sheet, or our class Cleaning Data Cheatsheet

End interactive session 5B