Interactive Session 4C

Quick Tour: The Data Science Workflow

Every data science project follows the same systematic approach. Today we’ll take a quick tour through all 9 steps using simple examples. This gives you the big picture before we dive deeper in coming days!

flowchart LR
    A["1. Import<br/>📂"] --> B["2. Explore<br/>🔍"] --> C["3. Clean<br/>🧹"]
    C --> D["4. Filter<br/>🎯"] --> E["5. Sort<br/>📊"]
    E --> F["6. Transform<br/>🔄"] --> G["7. Group<br/>👥"]
    G --> H["8. Aggregate<br/>📈"] --> I["9. Visualize<br/>📊"]

    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style C fill:#fff3e0
    style D fill:#f3e5f5
    style E fill:#e0f2f1
    style F fill:#fce4ec
    style G fill:#e8eaf6
    style H fill:#f1f8e9
    style I fill:#fff8e1

Session Goals

Today: Quick overview of all 9 steps with simple examples
Days 5-7: Deep dive into specific steps with real data
End-of-day: Practice the complete workflow yourself!

Getting Started

Create a new notebook called Session_4C_Workflow_Tour.ipynb and type along as we tour the data science workflow!

Setup

Code

import pandas as pd
import matplotlib.pyplot as plt

Workflow Tour: 9 Simple Steps

Follow along and type each step. We’ll use simple, short commands that are easy to type!

📂 Step 1: Import

Key Function: pd.read_csv()

Code

url = "https://raw.githubusercontent.com/EDS-217-Essential-Python/EDS-217-Essential-Python.github.io/refs/heads/main/data/ocean_temperatures.csv"
# Import data
df = pd.read_csv(url)
print("Data imported")

Data imported

🔍 Step 2: Explore

Key Function: df.head()

Code

# Explore data
df.head()

	date	location	temperature	salinity	depth
0	2020-01-01	Pacific	21.523585	NaN	200
1	2020-01-02	Pacific	14.800079	34.467264	100
2	2020-01-03	Pacific	23.752256	35.016505	100
3	2020-01-04	Pacific	24.702824	36.416944	200
4	2020-01-05	Pacific	10.244824	35.807487	1000

🧹 Step 3: Clean

Key Function: df.dropna()

Code

# Clean data
df_clean = df.dropna()
df_clean.shape

(4939, 5)

🎯 Step 4: Filter

Key Function: Boolean indexing df[df['column'] == 'value']

Code

# Filter data
filtered = df_clean[df_clean['location'] == 'Pacific']
filtered.head()

	date	location	temperature	salinity	depth
1	2020-01-02	Pacific	14.800079	34.467264	100
2	2020-01-03	Pacific	23.752256	35.016505	100
3	2020-01-04	Pacific	24.702824	36.416944	200
4	2020-01-05	Pacific	10.244824	35.807487	1000
5	2020-01-06	Pacific	13.489102	35.795872	0

📊 Step 5: Sort

Key Function: df.sort_values()

Code

# Sort data
sorted_df = df_clean.sort_values('temperature', ascending=False)
sorted_df.head()

	date	location	temperature	salinity	depth
4141	2022-05-03	Southern	37.270232	35.211510	500
4222	2022-07-23	Southern	36.355131	35.593140	1000
846	2022-04-26	Pacific	35.894268	34.878401	50
3304	2020-01-17	Southern	35.442561	34.676375	500
2863	2021-11-02	Indian	35.297656	34.413799	500

🔄 Step 6: Transform

Key Function: Create new columns

Code

# Transform data
df_clean['temp_f'] = df_clean['temperature'] * 9/5 + 32
df_clean[['temperature', 'temp_f']].head()

/var/folders/bs/x9tn9jz91cv6hb3q6p4djbmw0000gn/T/ipykernel_230/79280054.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['temp_f'] = df_clean['temperature'] * 9/5 + 32

	temperature	temp_f
1	14.800079	58.640143
2	23.752256	74.754061
3	24.702824	76.465082
4	10.244824	50.440683
5	13.489102	56.280384

👥 Step 7: Group

Key Function: df.groupby()

Code

# Group data
by_ocean = df_clean.groupby('location')
by_ocean.size()

location
Arctic      988
Atlantic    992
Indian      994
Pacific     990
Southern    975
dtype: int64

📈 Step 8: Aggregate

Key Function: .mean(), .sum(), .count()

Code

# Aggregate data
avg_temps = by_ocean['temperature'].mean()
avg_temps

location
Arctic      19.885050
Atlantic    19.628712
Indian      20.181456
Pacific     19.970877
Southern    20.036750
Name: temperature, dtype: float64

📊 Step 9: Visualize

Key Function: .plot()

Code

# Visualize data
avg_temps.plot(kind='bar')
plt.title('Average Ocean Temperatures')
plt.show()

Summary

You just learned the 9-step data science workflow:

Import: pd.read_csv()
Explore: df.head()
Clean: df.dropna()
Filter: df[df['column'] == 'value']
Sort: df.sort_values()
Transform: Create new columns
Group: df.groupby()
Aggregate: .mean(), .sum(), .count()
Visualize: .plot()

What’s Next?

Day 5: Practice filtering and cleaning in detail
Day 6: Master grouping and aggregation
Day 7: Create beautiful visualizations
End-of-day today: Apply this workflow yourself!

End interactive session 4C