Creating a Comprehensive 9-Step Data Science Workflow
DIY Python Data ScienceWorkflow
In this final class activity, you will work in small groups (2-3) to develop a example data science workflow.
Import Data
Explore Data
Clean Data
Filter Data
Sort Data
Transform Data
Group Data
Aggregate Data
Visualize Data
What to do
To conduct this exercise, you should find a suitable dataset; it doesnβt need to be environmental data per se - be creative in your search! You should also focus on making a number of exploratory and analysis visualizations using seaborn. You should avoid planning any analysis that absolutely require mapping and focus on using only pandas, numpy, matplotlib, and seaborn libraries.
Your final product will be a self-contained notebook that is well-documented with markdown and code comments that you will walk through as a presentation to the rest of the class on the final day.
Your notebook should include each of the nine steps, even if you donβt need to do much in each of them.
Note
You can include visualizations as part of your data exploration (step 2), or anywhere else it is helpful.
Additional figures and graphics are also welcome - you are encouraged to make your notebooks as engaging and visually interesting as possible.
Syncing your data to Github
Here are some directions for syncing your classwork with GitHub
General places to find fun data
Here are some links to potential data resources that you can use to develop your analyses:
Once youβve found a .csv file that you want to use, you should:
Save your file to a google drive folder in your UCSB account.
Change the sharing settings to allow anyone with a link to view your file.
Open the sharing dialog and copy the sharing link to your clipboard.
Use the code below to download your file (you will need to add this code to the top of your notebook in the Import Data section)
Warning
For this code to work on the workbench server, you will need to switch your kernel from 3.10.0 to 3.7.13. You can switch kernels by clicking on the kernel name in the upper right of your notebook.
Code
import pandas as pdimport requestsdef extract_file_id(url):"""Extract file id from Google Drive Sharing URL."""return url.split("/")[-2]def df_from_gdrive_csv(url):""" Get the CSV file from a Google Drive Sharing URL.""" file_id = extract_file_id(url) URL ="https://docs.google.com/uc?export=download" session = requests.Session() response = session.get(URL, params={"id": file_id}, stream=True)return pd.read_csv(response.raw)# Example of how to use:# Note: your sharing link will be different, but should look like this:sharing_url ="https://drive.google.com/file/d/1RlilHNG7BtvXT2Pm4OpgNvEjVJJZNaps/view?usp=share_link"df = df_from_gdrive_csv(sharing_url)df.head()