🎉🎉 Congratulations! You made it to the end of a Python data science workflow…🎉🎉
🎉🎉..and the end of the first day of EDS 217!! 🎉🎉
An Example Python Data Science Workflow
Toolik from the boardwalk (source)[https://media.arcus.org/album/polartrec-2019-alejandra-martinez/30679]
In this exercise, you will work with climate data using the Python data science workflow. You’ll load the data into a pandas DataFrame, perform basic exploration and cleaning, and create visualizations. This hands-on practice will help you understand how Python can be used for data analysis, with comparisons to similar tasks in R.
Our data comes from the Arctic Long Term Ecological Research station. The Arctic Long Term Ecological Research (ARC LTER) site is part of a network of sites established by the National Science Foundation to support long-term ecologicalLooking South of Toolik Field Station research in the United States. The research site is located in the foothills region of the Brooks Range, North Slope of Alaska (68° 38’N, 149° 36.4’W, elevation 720 m). The Arctic LTER project’s goal is to understand and predict the effects of environmental change on arctic landscapes, both natural and anthropogenic. Researchers at the site use long-term monitoring and surveys of natural variation of ecosystem characteristics, experimental manipulation of ecosystems (years to decades) and modeling at ecosystem and watershed scales to gain an understanding of the controls of ecosystem structure and function. The data and insights gained are provided to federal, Alaska state and North Slope Borough officials who regulate the lands on the North Slope and through this web site.
We will be using some basic weather data downloaded from Toolik Station:
I have already downloaded this data and placed in our course repository, where we can access it easily using its github raw url.
Let’s dive into the exercise!
Import Libraries
pandas
) and create plots (matplotlib.pyplot
). Use the standard python conventions that import pandas as pd
and import matplotlib.pyplot as plt
Load the Data
Our data is located at:
https://raw.githubusercontent.com/environmental-data-science/eds217-day0-comp/main/data/raw_data/toolik_weather.csv
url
that stores the URL provided above as a string.read_csv()
function from pandas to load the data from the URL into a new DataFrame called df
. Any pandas function will always be called using the pd
object and dot notation: pd.read_csv()
.The read_csv()
function can do a ton of different things, but today all you need to know is that it can take a url
to a csv file as it’s only input.
The read_csv()
function in pandas is similar to read.csv()
in R. In python, the function is part of the pandas library, which we imported as pd
. So we call the function using dot notation: pd.read_csv()
Preview the Data
head()
method to display the first few rows of the DataFrame df
.Because the head()
function is a method of a DataFrame, you will call it using dot notation and the dataframe you just created: df.head()
Syntax Similarities: In R, you would use head(df)
to view the first few rows.
Check for Missing Values
isnull()
method combined with sum()
to count missing values in each column.In R, you might use sum(is.na(df$column))
to check for missing values.
You should see that the Daily_AirTemp_Mean_C
doesn’t have any missing values. This means we can skip the usual step of dealing with missing data. We’ll learn these tools in Python and Pandas later in the course.
Data Description
describe()
method to generate summary statistics for numerical columns.info()
method to get an overview of the DataFrame, including data types and non-null counts. Just like the head()
function, these are methods associated with your df
object, so you call them with dot notation.The commands summary(df)
and str(df)
are R equivalents for summarizing and checking structure. Notice a pattern forming… Other than differences in function names (i.e. “Boot” vs. “Boot” in American/British English), a major “grammar” difference between R and Python is Python’s frequent use of dot notation for calling methods of objects!
Calculate Monthly Average Temperature
groupby()
method to group the data by the ‘Month’ column and save this as a new variable called monthly
.monthly
using the mean()
function. Save this result to a new variable called monthly_means
.You can do analysis on a specific column in a dataframe using [column_nanme]
notation: my_df["column A"].mean()
would give the average value of “column A” (if there was a column with that name in the dataframe). In the coming days, we will spend a lot of time learning how to select and subset data in dataframes!
This analysis is similar to using group_by()
and summarize()
in dplyr
.
Plot Monthly Average Temperature
bar()
method to create a bar plot of the monthly average temperature.bar()
function is a method of the plt
library you imported at the start of your code.Use plt.plot()
or plot.bar()
to create plots. In R, you would use ggplot()
.
Analyze Temperature Trends Over Years
groupby()
to explore how temperature trends change over the years.plot()
command of the yearly average temperature trend.Similar to calculating monthly averages, group by the ‘Year’ column.
Of course, we can always just re-run our notebook code to re-generate our analyses and figures. However, for complicated analyses and long-running processes, it is helpful to save intermediate or final outputs into files that can be re-loaded or used elsewhere. Let’s look as some ways to export our work.
To write a pandas.Series
to a CSV file, you can use the .to_csv()
method (just like you would with a pandas.DataFrame
). Here’s an example of how to do it:
Series
will be written as the first column in the CSV file.header=True
, the name
of the Series
will be written as the header in the CSV file.To create a csv from dataframe you use the dataframe’s built-in method .to_csv()
. In R, you would use a write.csv()
function.
A major difference between Python and R is the extensive use of object methods in Python and the extensive use of global functions in R.**
to_csv()
Output:If you inspect the monthly_means.csv
file using the file browser in JupyterLab, it will look something like this:
Month,Daily_AirTemp_Mean_C
1,-20.561290322580643
2,-23.94107142857143
3,-17.806451612903224
4,-15.25294117647059
5,-0.8758190327613105
6,8.76624
Use the plt.savefig()
command to save your figure to a file. This functions takes a set of keyword options that determine the output image format, resolution (DPI, or dots per inch) and the size of the image. Here’s an example of a command that produces a jpeg
file with 300 dots per inch and with the size of the output image cropped closely around the figure:
We will spend the rest of the course learning more about each of the steps we just went through. And of course, we have a lot more to learn about the essentials of the Python programming language over the next 8 days of class.
Take some time now to reflect on what you’ve learned today, and to add some additional comments and notes in your code to follow up on in the coming days.
By the end of the course you will be writing your own Python data science workflows just like this one… hopefully many of the “code strangers” you’ve just met will have become good friends!
🎉🎉 Congratulations! You made it to the end of a Python data science workflow…🎉🎉
🎉🎉..and the end of the first day of EDS 217!! 🎉🎉
End Activity Session (Day 1)