EDS 217 Cheatsheet

Grouping Data

Grouping data allows you to split your DataFrame into groups based on one or more columns.

Code

import pandas as pd
import numpy as np

# Sample DataFrame
df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B', 'A'],
    'value': [1, 2, 3, 4, 5]
})
print(df)

  category  value
0        A      1
1        B      2
2        A      3
3        B      4
4        A      5

Creating a groupby object:

Code

# Group by 'category'
grouped = df.groupby('category')

Aggregating Data

After grouping, you can apply various aggregation functions to summarize the data within each group.

Basic aggregation

Code

# Basic aggregations
print(grouped['value'].mean())
print(grouped['value'].sum())

category
A    3.0
B    3.0
Name: value, dtype: float64
category
A    9
B    6
Name: value, dtype: int64

Doing multiple aggregations at the same time using `agg()`

Code

# Multiple aggregations
print(grouped['value'].agg(['mean', 'sum', 'count']))

          mean  sum  count
category                  
A          3.0    9      3
B          3.0    6      2

Aggregation using a custom function

Code

# Custom aggregation function
def range_func(x):
    return x.max() - x.min()

print(grouped['value'].agg(range_func))

category
A    4
B    2
Name: value, dtype: int64

Common Aggregation Functions

mean(): Average
sum(): Sum of values
count(): Count of non-null values
min(), max(): Minimum and maximum values
median(): Median value
std(), var(): Standard deviation and variance
first(), last(): First and last values in the group

Grouped Operations

You can apply operations to each group separately using transform() or apply().

Using `transform()` to alter each group in a group by object

Code

# Transform: apply function to each group, return same-sized DataFrame
def normalize(x):
    return (x - x.mean()) / x.std()

df['value_normalized'] = grouped['value'].transform(normalize)

Using `apply()` to alter each group in a group by object

Code

# Apply: apply function to each group, return a DataFrame or Series
def group_range(x):
    return x['value'].max() - x['value'].min()

result = grouped.apply(group_range)

/var/folders/bs/x9tn9jz91cv6hb3q6p4djbmw0000gn/T/ipykernel_81127/114114075.py:5: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  result = grouped.apply(group_range)

Pivot Tables

Pivot tables are a powerful tool for reorganizing and summarizing data. They allow you to transform your data from a long format to a wide format, making it easier to analyze and visualize patterns.

Working with Pivot Tables

Code

# Sample DataFrame
df = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'product': ['A', 'B', 'A', 'B'],
    'sales': [100, 150, 120, 180]
})
print(df)

         date product  sales
0  2023-01-01       A    100
1  2023-01-01       B    150
2  2023-01-02       A    120
3  2023-01-02       B    180

Pivot tables with a single aggregation function

Code

# Create a pivot table
pivot_table = pd.pivot_table(df, values='sales', index='date', columns='product', aggfunc='sum')
print(pivot_table)

product       A    B
date                
2023-01-01  100  150
2023-01-02  120  180

Pivot tables with multiple aggregation

Code

# Pivot table with multiple aggregation functions
pivot_multi = pd.pivot_table(df, values='sales', index='date', columns='product', 
                             aggfunc=[np.sum, np.mean])
print(pivot_multi)

            sum        mean       
product       A    B      A      B
date                              
2023-01-01  100  150  100.0  150.0
2023-01-02  120  180  120.0  180.0

/var/folders/bs/x9tn9jz91cv6hb3q6p4djbmw0000gn/T/ipykernel_81127/1326309547.py:2: FutureWarning: The provided callable <function sum at 0x10f78cb80> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  pivot_multi = pd.pivot_table(df, values='sales', index='date', columns='product',
/var/folders/bs/x9tn9jz91cv6hb3q6p4djbmw0000gn/T/ipykernel_81127/1326309547.py:2: FutureWarning: The provided callable <function mean at 0x10f78df80> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  pivot_multi = pd.pivot_table(df, values='sales', index='date', columns='product',

Key Pivot Table Parameters

values: Column(s) to aggregate
index: Column(s) to use as row labels
columns: Column(s) to use as column labels
aggfunc: Function(s) to use for aggregation (default is mean)
fill_value: Value to use for missing data
margins: Add row/column with subtotals (default is False)

For more detailed information on grouping, aggregating, and pivot tables in Pandas, refer to the official Pandas documentation.

Grouping Data

Creating a groupby object:

Aggregating Data

Basic aggregation

Doing multiple aggregations at the same time using agg()

Aggregation using a custom function

Common Aggregation Functions

Grouped Operations

Using transform() to alter each group in a group by object

Using apply() to alter each group in a group by object

Pivot Tables

Working with Pivot Tables

Pivot tables with a single aggregation function

Pivot tables with multiple aggregation

Key Pivot Table Parameters

Doing multiple aggregations at the same time using `agg()`

Using `transform()` to alter each group in a group by object

Using `apply()` to alter each group in a group by object