Final Activity

DIY Python Data ScienceWorkflow

In this final class activity, you will work in small groups (2-3) to develop a example data science workflow.

Import Data
Explore Data
Clean Data
Filter Data
Sort Data
Transform Data
Group Data
Aggregate Data
Visualize Data

What to do

To conduct this exercise, you should find a suitable dataset; it doesn’t need to be environmental data per se - be creative in your search! You should also focus on making a number of exploratory and analysis visualizations using seaborn. You should avoid planning any analysis that absolutely require mapping and focus on using only pandas, numpy, matplotlib, and seaborn libraries.

Your final product will be a self-contained notebook that is well-documented with markdown and code comments that you will walk through as a presentation to the rest of the class on the final day.

Your notebook should include each of the nine steps, even if you don’t need to do much in each of them.

Note

You can include visualizations as part of your data exploration (step 2), or anywhere else it is helpful.

Additional figures and graphics are also welcome - you are encouraged to make your notebooks as engaging and visually interesting as possible.

Syncing your data to Github

Here are some directions for syncing your classwork with GitHub

General places to find fun data

Here are some links to potential data resources that you can use to develop your analyses:

Oddly specific datasets

Using Google Drive to store your .csv file.

Once you’ve found a .csv file that you want to use, you should:

Save your file to a google drive folder in your UCSB account.
Change the sharing settings to allow anyone with a link to view your file.
Open the sharing dialog and copy the sharing link to your clipboard.
Use the code below to download your file (you will need to add this code to the top of your notebook in the Import Data section)

Warning

For this code to work on the workbench server, you will need to switch your kernel from 3.11.0 to 3.7.13. You can switch kernels by clicking on the kernel name in the upper right of your notebook.

Code

import pandas as pd
import requests

def extract_file_id(url):
    """Extract file id from Google Drive Sharing URL."""
    return url.split("/")[-2]

def df_from_gdrive_csv(url):
    """ Get the CSV file from a Google Drive Sharing URL."""
    file_id = extract_file_id(url)
    URL = "https://docs.google.com/uc?export=download"
    session = requests.Session()
    response = session.get(URL, params={"id": file_id}, stream=True)
    return pd.read_csv(response.raw)

# Example of how to use:
# Note: your sharing link will be different, but should look like this:
sharing_url = "https://drive.google.com/file/d/1RlilHNG7BtvXT2Pm4OpgNvEjVJJZNaps/view?usp=share_link"
df = df_from_gdrive_csv(sharing_url)
df.head()

	date	location	temperature	salinity	depth
0	2020-01-01	Pacific	21.523585	NaN	200
1	2020-01-02	Pacific	14.800079	34.467264	100
2	2020-01-03	Pacific	23.752256	35.016505	100
3	2020-01-04	Pacific	24.702824	36.416944	200
4	2020-01-05	Pacific	10.244824	35.807487	1000