Efficient Data Workflows in Pandas

An 8-Step Data Science Workflows in Pandas

When doing data science, you usually conduct at least 3 or 4 of the following steps:

Load ➡️ Clean ➡️ Transform ➡️ Combine ➡️ Group/Aggregate ➡️ Summarize ➡️ Export ➡️ Visualize

The pandas library contains many different functions corresponding to each of these steps.

As you learn data science, it helps to think about

what types of functions are most essential for each step.

The pandas library contains many different functions corresponding to each of these steps.

As you encounter new pandas functions, it will also be helpful to think about

which of these steps the function was created to assist with.

Let’s walk through each of these eight steps, take a look at the main pandas functions used in each one, and some examples of how they are used.

We will use Pandas Tutor links to get deeper insight into how these functions affect our dataframes. The “🐍🧠” links in our slides will take us to a Pandas Tutor explanation of the code.

At the end of this lecture, we’ll learn about a concept called “method chaining” that allows you to combine these steps as part of your pandas data science workflow.

1. Data Loading

Importance

Loading data properly is essential! It serves as the foundation for all subsequent operations.

Common Operations

  • read_csv()
  • read_excel()
  • read_sql()

Example: Loading a Dataframe

import pandas as pd

# URL of the CSV file (using parentheses to span multiple lines)
url = ('https://archive.ics.uci.edu/ml/machine-learning-databases'+
       '/wine-quality/winequality-red.csv')

# Use read_csv to load the data directly from the URL
# Note: The dataset uses ';' as a separator
df = pd.read_csv(url, sep=';')
# Display the first few rows of the DataFrame
print(df.head())
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5  
3      9.8        6  
4      9.4        5  

2. Data Cleaning

Importance

Data cleaning is crucial for correcting or removing inaccurate, corrupted, or irrelevant data from the dataset.

Common Operations

  • dropna(): Removes rows or columns with missing values.
  • fillna(): Fills missing values.
  • astype(): Converts column data types.
  • rename(): Renames columns or index names.
  • drop(): Removes rows or columns that match given labels.

An example DataFrame

import pandas as pd

data = {
    'type': ['Dog', 'Cat', 'Dog', 'Cat', 'Dog'],
    'name': ['Rex', 'Whiskers', 'Buddy', 'Mittens', 'Spot'],
    'age': [5, 5, 2, 4, 6],
    'weight': [20, None, 15, 4, 25]
}

df = pd.DataFrame(data)

Example: Renaming a Column

Code:

cleaned = df.rename(columns={'weight': 'weight_kg'})

🐍🧠

Result:

  type      name  age  weight_kg
0  Dog       Rex    5       20.0
1  Cat  Whiskers    5        NaN
2  Dog     Buddy    2       15.0
3  Cat   Mittens    4        4.0
4  Dog      Spot    6       25.0

Example: Filling missing data

Code:

cleaned = cleaned.fillna({'weight_kg': 7})

🐍🧠

Result:

  type      name  age  weight_kg
0  Dog       Rex    5       20.0
1  Cat  Whiskers    5        7.0
2  Dog     Buddy    2       15.0
3  Cat   Mittens    4        4.0
4  Dog      Spot    6       25.0

3. Data Transformation

Importance

Data transformation involves modifying data to prepare it for analysis, which may include filtering, sorting, or adding new columns.

Common Operations

  • query(): Filter DataFrame using a query expression string.
  • assign(): Add new columns or overwrite existing ones.
  • apply(): Apply a function to rows or columns.
  • sort_values(): Sort by the values of columns.

Transformation Example: Adding a New Column for Age in Months

Code:

transformed = cleaned.assign(age_months=cleaned['age'] * 12)

🐍🧠

Result:

  type      name  age  weight_kg  age_months
0  Dog       Rex    5       20.0          60
1  Cat  Whiskers    5        7.0          60
2  Dog     Buddy    2       15.0          24
3  Cat   Mittens    4        4.0          48
4  Dog      Spot    6       25.0          72

4. Combining Data

Importance

Combining data is essential when you need to enrich or expand your dataset through additions from other data sources.

Common Operations

  • merge(): Combines DataFrames based on keys.
  • join(): Joins DataFrames using index or key.
  • concat(): Concatenates DataFrames along an axis.

Combining Example: Merging with a Second Dataframe

Code:

toys = pd.DataFrame({
    'name': ['Spot', 'Mittens', 'Buddy', 'Whiskers', 'Rex'],
    'number_of_toys': [6, 3, 5, 8, 2]
})

combined = transformed.merge(toys, on='name', how='left')

🐍🧠

Result:

  type      name  age  weight_kg  age_months  number_of_toys
0  Dog       Rex    5       20.0          60               2
1  Cat  Whiskers    5        7.0          60               8
2  Dog     Buddy    2       15.0          24               5
3  Cat   Mittens    4        4.0          48               3
4  Dog      Spot    6       25.0          72               6

5. Grouping and Aggregation

Importance

Grouping and aggregation are critical for summarizing data, which can help in identifying patterns or performing segment-wise analysis.

Common Operations

  • groupby(): Group data by columns for aggregation.
  • sum(): Sum values across rows/columns.
  • mean(): Calculate mean of values across rows/columns.
  • aggregate(): Apply functions to groups, reducing dimensions.

Grouping Example: Grouping by Type

Code:

grouped = combined.groupby('type')

🐍🧠

Result:

Note: GroupBy objects cannot be directly visualized.

Group: Cat
  type      name  age  weight_kg  age_months  number_of_toys
1  Cat  Whiskers    5        7.0          60               8
3  Cat   Mittens    4        4.0          48               3


Group: Dog
  type   name  age  weight_kg  age_months  number_of_toys
0  Dog    Rex    5       20.0          60               2
2  Dog  Buddy    2       15.0          24               5
4  Dog   Spot    6       25.0          72               6

Aggregation Example: Who Has More Toys? Cats or Dogs?

aggregated = grouped.agg(average_toys=('number_of_toys', 'mean'))

🐍🧠

Result:

      average_toys
type              
Cat       5.500000
Dog       4.333333

6. Data Summarization

Importance

Data summarization provides a quick look into the dataset, which is helpful for initial analyses and decision-making.

Common Operations

  • describe()
  • value_counts()

Example: Descriptive Statistics, describe()

Code

summary_stats = combined.describe()

Result

            age  weight_kg  age_months  number_of_toys
count  5.000000   5.000000    5.000000        5.000000
mean   4.400000  14.200000   52.800000        4.800000
std    1.516575   8.757854   18.198901        2.387467
min    2.000000   4.000000   24.000000        2.000000
25%    4.000000   7.000000   48.000000        3.000000
50%    5.000000  15.000000   60.000000        5.000000
75%    5.000000  20.000000   60.000000        6.000000
max    6.000000  25.000000   72.000000        8.000000

Example: Descriptive Statistics, value_counts()

Code

counts = combined['type'].value_counts()

Result

type
Dog    3
Cat    2
Name: count, dtype: int64

7. Output/Export

Importance

The final step in any data analysis workflow is to save or export your results, making them available for sharing or further processing.

Common Operations

  • to_csv()
  • to_excel()
  • to_json()

Example: Exporting to CSV

combined.to_csv('processed_pets_data.csv')

Combining the Workflow with Method Chaining

Introduction to Method Chaining

Method chaining allows combining multiple operations into a single, coherent expression. It enhances readability and efficiency.

A Complete Workflow Example Using “Method Chaining”

1. Setting up our data (you should already have created these)

import pandas as pd
from pandas import DataFrame, Series

data = {
    'type': ['Dog', 'Cat', 'Dog', 'Cat', 'Dog'],
    'name': ['Rex', 'Whiskers', 'Buddy', 'Mittens', 'Spot'],
    'age': [5, 5, 2, 4, 6],
    'weight': [20, None, 15, 4, 25]
}

toys = DataFrame({
    'name': ['Spot', 'Whiskers', 'Buddy', 'Mittens', 'Rex'],
    'number_of_toys': [6, 3, 5, 8, 2]
})

Complete Workflow Example Using Method Chaining

Using method chaining to combine operations

df = DataFrame(data) # read    
processed = (df        
                .rename(columns={'weight': 'weight_kg'})     # clean
                .fillna({'weight_kg': 10})                   # clean
                .assign(age_months=df['age'] * 12)           # transform
                .merge(toys, on='name', how='left')          # combine
                .groupby('type')                             # group
                .agg(average_toys=('number_of_toys','mean')) # aggregate
            )

🐍🧠

Result

      average_toys
type              
Cat       5.500000
Dog       4.333333