EDS 217 - Lecture

Missing Data in Pandas: To Drop or to Impute?

Introduction

Data cleaning is crucial in data analysis
Missing data is a common challenge
Two main approaches:
1. Dropping missing data
2. Imputation
Understanding the nature of missingness is key

Types of Missing Data

Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing Not at Random (MNAR)

Missing Completely at Random (MCAR)

No relationship between missingness and any values
Example: Survey responses lost due to a computer glitch
Least problematic type of missing data
Dropping MCAR data is generally safe but reduces sample size

MCAR Example (Assigning nan randomly)

import pandas as pd
import numpy as np

# Create sample data with MCAR
np.random.seed(42)
df = pd.DataFrame({'A': np.random.rand(100), 'B': np.random.rand(100)})
df.loc[np.random.choice(df.index, 10, replace=False), 'A'] = np.nan
print(df.isnull().sum())

A    10
B     0
dtype: int64

Missing at Random (MAR)

Missingness is related to other observed variables
Example: Older participants more likely to skip income questions
More common in real-world datasets
Dropping MAR data can introduce bias

MAR Example (Assigning nan randomly, filtered on column value)

# Create sample data with MAR
np.random.seed(42)
df = pd.DataFrame({
    'Age': np.random.randint(18, 80, 100),
    'Income': np.random.randint(20000, 100000, 100)
})
df.loc[df['Age'] > 60, 'Income'] = np.where(
    np.random.rand(len(df[df['Age'] > 60])) < 0.3, 
    np.nan, 
    df.loc[df['Age'] > 60, 'Income']
)
print(df[df['Age'] > 60]['Income'].isnull().sum() / len(df[df['Age'] > 60]))

0.2972972972972973

Missing Not at Random (MNAR)

Missingness is related to the missing values themselves
Example: People with high incomes more likely to skip income questions
Most problematic type of missing data
Neither dropping nor simple imputation may be appropriate

Dropping Missing Data

Pros:

Simple and quick
Maintains the distribution of complete cases
Appropriate for MCAR data

Cons:

Reduces sample size
Can introduce bias for MAR or MNAR data
May lose important information

Drop Example

# Dropping missing data
df_dropped = df.dropna()
print(f"Original shape: {df.shape}, After dropping: {df_dropped.shape}")

Original shape: (100, 2), After dropping: (89, 2)

Imputation

Pros:

Preserves sample size
Can reduce bias for MAR data
Allows use of all available information

Cons:

Can introduce bias if done incorrectly
May underestimate variability
Can be computationally intensive for complex methods

Imputation Example

# Simple mean imputation
df_imputed = df.fillna(df.mean())
print(f"Original missing: {df['Income'].isnull().sum()}, After imputation: {df_imputed['Income'].isnull().sum()}")

Original missing: 11, After imputation: 0

Imputation Methods

Simple imputation:
- Mean, median, mode
- Last observation carried forward (LOCF)
Advanced imputation:
- Multiple Imputation
- K-Nearest Neighbors (KNN)
- Regression imputation

Best Practices

Understand your data and the missingness mechanism
Visualize patterns of missingness
Consider the impact on your analysis
Use appropriate methods based on the type of missingness
Conduct sensitivity analyses
Document your approach and assumptions

Conclusion

Understanding the nature of missingness is crucial
Both dropping and imputation have pros and cons
Choose the appropriate method based on:
- Type of missingness (MCAR, MAR, MNAR)
- Sample size
- Analysis goals
Always document your approach and conduct sensitivity analyses

Questions?

Thank you for your attention!