Defining a Comprehensive 9-Step Data Science Workflow
9-Step Data Science Workflow
Every data science project follows the same systematic workflow. Whether youโre analyzing Netflix recommendations, climate research, social media trends, or working on your final project, youโll use these 9 steps:
flowchart LR
A["1. Import<br/>๐"] --> B["2. Explore<br/>๐"] --> C["3. Clean<br/>๐งน"]
C --> D["4. Filter<br/>๐ฏ"] --> E["5. Sort<br/>๐"]
E --> F["6. Transform<br/>๐"] --> G["7. Group<br/>๐ฅ"]
G --> H["8. Aggregate<br/>๐"] --> I["9. Visualize<br/>๐"]
style A fill:#e1f5fe
style B fill:#e8f5e8
style C fill:#fff3e0
style D fill:#f3e5f5
style E fill:#e0f2f1
style F fill:#fce4ec
style G fill:#e8eaf6
style H fill:#f1f8e9
style I fill:#fff8e1
Why This Workflow Matters
Today: See all 9 steps in action with ocean temperature analysis Days 4-7: Master each step individually with detailed sessions Your final project: Apply this exact workflow to answer your research question!
Course Integration
Almost all pandas functions and dataframe methods fit into one of these 9 categories. For reference, here is a cheatsheet that maps common pandas functions to our workflow steps.
Ocean Temperature Analysis: Complete Workflow Demo
In this session, weโll systematically work through every step of the data science workflow using ocean temperature data. Youโll see exactly how professional data scientists approach problems, and by the end, youโll have completed your first full data science project!
Research Question: Which ocean has the warmest average temperatures, and how do temperatures change between seasons?
Letโs systematically work through our 9-step workflow!
Setting up our environment
First, letโs import the libraries we know from previous sessions:
Code
import pandas as pdimport matplotlib.pyplot as plt
Libraries Weโre Using
pandas (pd): For working with DataFrames (from Sessions 4a & 4b)
matplotlib (plt): For creating charts and graphs (from Session 4c)
Workflow Progress Tracker
As we work through each step, weโll track our progress through the complete data science workflow:
Workflow Progress
Ocean Temperature Analysis - Workflow Steps
โ Step 1: Import - Load our ocean data
โ Step 2: Explore - Discover what we have
โ Step 3: Clean - Fix any problems
โ Step 4: Filter - Focus on specific data
โ Step 5: Sort - Find temperature patterns
โ Step 6: Transform - Create new insights
โ Step 7: Group - Organize by categories
โ Step 8: Aggregate - Calculate summaries
โ Step 9: Visualize - Present our results
Goal: Complete systematic data science analysis
๐ Step 1: Import Data
โ Workflow Step 1: Getting our data into Python
The first step in every data science project is getting your data into Python. Weโll use pd.read_csv() - the same function you learned in Session 4a!
Code
# Step 1: Import our ocean temperature datadf = pd.read_csv('ocean_temperatures_simple.csv')print("โ Step 1 Complete: Data imported successfully!")print(f"๐ Loaded {len(df)} rows of ocean temperature data")
โ Step 1 Complete: Data imported successfully!
๐ Loaded 30 rows of ocean temperature data
Real Data Science Connection
Professional data scientists start every project the same way - importing data! Whether itโs: - Climate data from NASA - User behavior from websites
- Financial data from banks - Your final project data
You always start with: pd.read_csv() or similar import functions
๐ฎ Coming Attractions: Later in the course, youโll learn to import Excel files, JSON data, and even data from databases!
๐ Step 2: Explore Data
โ Workflow Step 2: Discovering what we have
Before we can analyze data, we need to understand what weโre working with. Letโs use the exploration methods you learned in Session 4a:
Code
print("๐ EXPLORING OUR OCEAN DATA")print("="*40)print("\n๐ First few rows:")print(df.head())print(f"\n๐ DataFrame info:")df.info()print(f"\n๐ Summary statistics:")print(df.describe())print("\nโ Missing values check:")print(df.isna().sum())print("\nโ Step 2 Complete: We now understand our data!")
๐ EXPLORING OUR OCEAN DATA
========================================
๐ First few rows:
date location temperature salinity depth
0 2021-01-15 Pacific 18.5 34.2 50
1 2021-01-15 Atlantic 22.1 35.1 0
2 2021-01-15 Indian 20.0 34.8 100
3 2021-01-15 Southern 15.2 34.0 200
4 2021-01-15 Arctic 12.1 33.5 50
๐ DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 30 non-null object
1 location 30 non-null object
2 temperature 30 non-null float64
3 salinity 30 non-null float64
4 depth 30 non-null int64
dtypes: float64(2), int64(1), object(2)
memory usage: 1.3+ KB
๐ Summary statistics:
temperature salinity depth
count 30.000000 30.000000 30.000000
mean 19.283333 34.433333 80.000000
std 4.621843 0.616068 68.982756
min 11.500000 33.300000 0.000000
25% 15.400000 34.025000 50.000000
50% 19.200000 34.350000 50.000000
75% 22.925000 35.000000 100.000000
max 27.100000 35.400000 200.000000
โ Missing values check:
date 0
location 0
temperature 0
salinity 0
depth 0
dtype: int64
โ Step 2 Complete: We now understand our data!
What We Discovered
Our ocean dataset contains: - 5 oceans: Pacific, Atlantic, Indian, Southern, Arctic - Temperature measurements in degrees Celsius
- Salinity measurements (salt content) - Depth measurements where samples were taken - 30 total measurements across different dates
This is exactly what real data scientists do first!
๐ฎ Coming Attractions: In Day 5, youโll learn advanced exploration techniques like correlation analysis and custom statistics!
๐งน Step 3: Clean Data
โ Workflow Step 3: Fixing problems in our data
Good news! Our ocean data is already clean - no missing values to worry about. But letโs see what cleaning looks like:
Code
print("๐งน CLEANING OUR DATA")print("="*30)# Check for missing values (we already did this, but let's confirm)missing_data = df.isna().sum()print("Missing values per column:")print(missing_data)if missing_data.sum() ==0:print("\n๐ Great news! Our data is already clean!") df_cleaned = df.copy() # Make a copy for consistencyelse:print(f"\n๐ง Cleaning needed...") df_cleaned = df.dropna().copy() # Remove rows with missing valuesprint(f"Removed {len(df) -len(df_cleaned)} rows with missing data")print(f"\nโ Step 3 Complete: Clean dataset with {len(df_cleaned)} rows ready for analysis!")
๐งน CLEANING OUR DATA
==============================
Missing values per column:
date 0
location 0
temperature 0
salinity 0
depth 0
dtype: int64
๐ Great news! Our data is already clean!
โ Step 3 Complete: Clean dataset with 30 rows ready for analysis!
Why Cleaning Matters
In real data science projects, youโll spend 50-80% of your time cleaning data! Common problems include: - Missing values (what we just checked for) - Duplicate entries - Incorrect data types - Outliers and errors
The .dropna() method you just learned will be one of your most-used tools!
๐ฎ Coming Attractions: In Day 5, youโll learn advanced cleaning techniques like handling duplicates and fixing data types!
๐ฏ Step 4: Filter Data
โ Workflow Step 4: Focusing on what matters for our question
Letโs focus on specific data to answer our research question. Weโll filter for just the Pacific Ocean to start:
Code
print("๐ฏ FILTERING OUR DATA")print("="*30)# Filter for just Pacific Ocean data (using boolean indexing from Session 4b)pacific_data = df_cleaned[df_cleaned['location'] =='Pacific']print("Pacific Ocean measurements:")print(pacific_data)print(f"\n๐ Found {len(pacific_data)} Pacific Ocean measurements")# Let's also look at summer data (June measurements)summer_data = df_cleaned[df_cleaned['date'].str.contains('06-15')]print(f"\n๐ Summer measurements (June): {len(summer_data)} rows")print("\nโ Step 4 Complete: Focused on specific data for our analysis!")
๐ฏ FILTERING OUR DATA
==============================
Pacific Ocean measurements:
date location temperature salinity depth
0 2021-01-15 Pacific 18.5 34.2 50
5 2021-06-15 Pacific 24.3 34.5 50
10 2021-12-15 Pacific 19.1 34.3 50
15 2022-01-15 Pacific 18.2 34.1 50
20 2022-06-15 Pacific 24.8 34.6 50
25 2022-12-15 Pacific 19.3 34.4 50
๐ Found 6 Pacific Ocean measurements
๐ Summer measurements (June): 10 rows
โ Step 4 Complete: Focused on specific data for our analysis!
Filtering in Real Data Science
Professional data scientists constantly filter data to focus on specific questions: - Netflix: โShow me viewing data for comedy moviesโ - Climate research: โFocus on temperature data from Arctic regionsโ
- Your final project: โFilter for data relevant to your specific questionโ
The boolean indexing you just used (df[df['column'] == value]) is a fundamental skill!
๐ฎ Coming Attractions: In Day 5, youโll learn complex filtering with multiple conditions using & and | operators!
๐ Step 5: Sort Data
โ Workflow Step 5: Organizing data to find patterns
Sorting helps us find the highest and lowest values. Letโs find the warmest and coldest ocean measurements:
Code
print("๐ SORTING OUR DATA")print("="*30)# Sort by temperature (warmest first) using .sort_values() from Session 4bsorted_by_temp = df_cleaned.sort_values('temperature', ascending=False)print("๐ฅ TOP 5 WARMEST measurements:")print(sorted_by_temp[['location', 'temperature', 'date']].head())print("\n๐ง TOP 5 COLDEST measurements:")print(sorted_by_temp[['location', 'temperature', 'date']].tail())print("\nโ Step 5 Complete: Found temperature patterns by sorting!")
๐ SORTING OUR DATA
==============================
๐ฅ TOP 5 WARMEST measurements:
location temperature date
21 Atlantic 27.1 2022-06-15
6 Atlantic 26.8 2021-06-15
22 Indian 25.5 2022-06-15
7 Indian 25.2 2021-06-15
20 Pacific 24.8 2022-06-15
๐ง TOP 5 COLDEST measurements:
location temperature date
9 Arctic 14.8 2021-06-15
4 Arctic 12.1 2021-01-15
19 Arctic 11.9 2022-01-15
29 Arctic 11.8 2022-12-15
14 Arctic 11.5 2021-12-15
โ Step 5 Complete: Found temperature patterns by sorting!
Insights from Sorting
What we discovered: - ๐ฅ Warmest: Atlantic Ocean (27.1ยฐC in summer) - ๐ง Coldest: Arctic Ocean (11.5ยฐC in winter) - ๐ Pattern: Atlantic and Pacific are warmest, Arctic is coldest
This is how data scientists find patterns - sorting reveals extremes and trends!
๐ฎ Coming Attractions: In Day 6, youโll learn to sort by multiple columns and create hierarchical sorting!
๐ Step 6: Transform Data
โ Workflow Step 6: Creating new insights from existing data
Letโs create new information that will help answer our research question:
Code
print("๐ TRANSFORMING OUR DATA")print("="*35)# Create a new column: temperature in Fahrenheit (simple math from Session 4b)df_cleaned['temperature_f'] = (df_cleaned['temperature'] *9/5) +32# Create a season category based on the datedef get_season(date_str):if'01-15'in date_str or'12-15'in date_str:return'Winter'elif'06-15'in date_str:return'Summer'else:return'Other'df_cleaned['season'] = df_cleaned['date'].apply(get_season)# Show our new columnsprint("New columns added:")print(df_cleaned[['location', 'temperature', 'temperature_f', 'season']].head())print(f"\n๐ Original columns: 5")print(f"๐ After transformation: {len(df_cleaned.columns)} columns")print("\nโ Step 6 Complete: Created new insights from our data!")
๐ TRANSFORMING OUR DATA
===================================
New columns added:
location temperature temperature_f season
0 Pacific 18.5 65.30 Winter
1 Atlantic 22.1 71.78 Winter
2 Indian 20.0 68.00 Winter
3 Southern 15.2 59.36 Winter
4 Arctic 12.1 53.78 Winter
๐ Original columns: 5
๐ After transformation: 7 columns
โ Step 6 Complete: Created new insights from our data!
Why Transform Data?
Transformation creates new insights: - ๐ก๏ธ Temperature in Fahrenheit: Makes data accessible to different audiences - ๐๏ธ Season categories: Helps us compare winter vs summer patterns - ๐ New calculations: Ratios, categories, derived metrics
Real data scientists spend lots of time creating these โfeature engineeringโ transformations!
๐ฎ Coming Attractions: In Day 6, youโll learn advanced transformations and custom functions!
๐ฅ Step 7: Group Data
โ Workflow Step 7: Organizing by categories to find patterns
Now weโll group our data by categories to compare different oceans and seasons:
Code
print("๐ฅ GROUPING OUR DATA")print("="*30)# Group by ocean location (using .groupby() from Session 4b)by_ocean = df_cleaned.groupby('location')print("๐ Number of measurements per ocean:")print(by_ocean.size())# Group by season to compare winter vs summerby_season = df_cleaned.groupby('season')print("\n๐ Number of measurements per season:")print(by_season.size())print("\nโ Step 7 Complete: Data organized by meaningful categories!")
๐ฅ GROUPING OUR DATA
==============================
๐ Number of measurements per ocean:
location
Arctic 6
Atlantic 6
Indian 6
Pacific 6
Southern 6
dtype: int64
๐ Number of measurements per season:
season
Summer 10
Winter 20
dtype: int64
โ Step 7 Complete: Data organized by meaningful categories!
Why Group Data?
Grouping reveals patterns: - ๐ By ocean: Compare Pacific vs Atlantic vs Arctic temperatures - ๐๏ธ By season: See how temperatures change winter to summer
- ๐ By categories: Any categorical variable can create groups
This sets up the next step - calculating summary statistics for each group!
๐ฎ Coming Attractions: In Day 6, youโll learn to group by multiple columns simultaneously and create complex hierarchical groups!
Now for the exciting part - letโs calculate averages to answer โWhich ocean is warmest?โ
Code
print("๐ AGGREGATING OUR DATA")print("="*35)# Calculate average temperature by ocean (using .mean() from Session 4b)avg_temp_by_ocean = df_cleaned.groupby('location')['temperature'].mean()print("๐ AVERAGE TEMPERATURE BY OCEAN:")print(avg_temp_by_ocean.sort_values(ascending=False))# Calculate average temperature by seasonavg_temp_by_season = df_cleaned.groupby('season')['temperature'].mean()print("\n๐๏ธ AVERAGE TEMPERATURE BY SEASON:")print(avg_temp_by_season.sort_values(ascending=False))# Answer our research question!warmest_ocean = avg_temp_by_ocean.max()warmest_ocean_name = avg_temp_by_ocean.idxmax()print(f"\n๐ RESEARCH QUESTION ANSWERED!")print(f"๐ Warmest ocean: {warmest_ocean_name} ({warmest_ocean:.1f}ยฐC)")print("\nโ Step 8 Complete: Found the answer through aggregation!")
๐ AGGREGATING OUR DATA
===================================
๐ AVERAGE TEMPERATURE BY OCEAN:
location
Atlantic 24.083333
Indian 22.133333
Pacific 20.700000
Southern 16.616667
Arctic 12.883333
Name: temperature, dtype: float64
๐๏ธ AVERAGE TEMPERATURE BY SEASON:
season
Summer 22.100
Winter 17.875
Name: temperature, dtype: float64
๐ RESEARCH QUESTION ANSWERED!
๐ Warmest ocean: Atlantic (24.1ยฐC)
โ Step 8 Complete: Found the answer through aggregation!
Key Discovery!
Our Research Results: - ๐ฅ Warmest Ocean: Atlantic (24.1ยฐC average) - ๐ฅ Second Warmest: Pacific (20.7ยฐC average)
- ๐ฅ Coldest: Arctic (12.9ยฐC average) - ๐ Summer is warmer than winter (as expected!)
This is exactly how real data science works - use aggregation to answer research questions!
๐ฎ Coming Attractions: In Day 6, youโll learn advanced aggregation functions like .agg() to calculate multiple statistics at once!
๐ Step 9: Visualize Data
โ Workflow Step 9: Telling our story with charts
The final step is creating a chart to communicate our findings clearly:
Code
print("๐ VISUALIZING OUR RESULTS")print("="*35)# Create average temperature data for plottingavg_temps = df_cleaned.groupby('location')['temperature'].mean().sort_values(ascending=False)# Create a bar chart (using matplotlib from Session 4c)plt.figure(figsize=(10, 6))avg_temps.plot(kind='bar', color=['red', 'orange', 'blue', 'green', 'purple'])plt.title('๐ Average Ocean Temperatures: Research Results', fontsize=16, fontweight='bold')plt.xlabel('Ocean Location', fontsize=12)plt.ylabel('Average Temperature (ยฐC)', fontsize=12)plt.xticks(rotation=45)plt.grid(True, alpha=0.3)plt.tight_layout()# Add our research conclusion to the plotplt.figtext(0.5, 0.02, '๐ Research Conclusion: Atlantic Ocean is the warmest on average!', ha='center', fontsize=12, fontweight='bold')plt.show()print("\nโ Step 9 Complete: Story told through visualization!")
โ Step 9 Complete: Story told through visualization!
๐ CONGRATULATIONS! ๐
You just completed your first full data science project!
๐ Research Question: Which ocean has the warmest average temperatures? ๐ Answer: Atlantic Ocean (24.1ยฐC average) ๐ Method: Complete 9-step data science workflow!
Youโve completed the full workflow - youโre officially a data scientist! ๐
๐ฏ What You Accomplished Today
โ Complete Workflow Mastery
You just used the exact same process that professional data scientists use every day:
โ Imported real ocean temperature data
โ Explored to understand what you had
โ Cleaned (lucky us - data was already clean!)
โ Filtered to focus on specific questions
โ Sorted to find temperature patterns
โ Transformed data to create new insights
โ Grouped by meaningful categories
โ Aggregated to calculate summary statistics
โ Visualized results with a professional chart
๐ฎ Your Data Science Journey Continues
Next Week - Individual Step Mastery: - Day 5: Advanced filtering and transformation techniques - Day 6: Complex grouping and aggregation methods
- Day 7: Professional data visualization with seaborn
Your Final Project: Use this exact 9-step workflow to answer your own research question!
๐ The Workflow You Can Always Apply
Whenever you encounter a new dataset or research question, systematically work through these 9 steps: 1. Import โ 2. Explore โ 3. Clean โ 4. Filter โ 5. Sort โ 6. Transform โ 7. Group โ 8. Aggregate โ 9. Visualize
This is your systematic approach to data science success! ๐ฏ
๐ End interactive session 4C - Youโre now a data scientist! ๐