Day 7: 🙌 Coding Colab

Introduction

In this collaborative coding exercise, you’ll work with a partner to explore a dataset using the seaborn library. You’ll focus on a workflow that includes:

Exploring distributions with histograms
Examining correlations among variables
Investigating relationships more closely with regression plots and joint distribution plots

We’ll be using the Palmer Penguins dataset, which contains information about different penguin species, their physical characteristics, and the islands they inhabit.

Setup

First, let’s import the necessary libraries and load our dataset.

Code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Set the style for better-looking plots
sns.set_style("whitegrid")

# Load the Palmer Penguins dataset
penguins = sns.load_dataset("penguins")

# Display the first few rows and basic information about the dataset
print(penguins.head())
print(penguins.info())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None

Task 1: Exploring Distributions with Histograms

Let’s start by exploring the distributions of various numerical variables in our dataset using histograms.

Create histograms for ‘bill_length_mm’, ‘bill_depth_mm’, ‘flipper_length_mm’, and ‘body_mass_g’.
Experiment with different numbers of bins to see how it affects the visualization.
Try using sns.histplot() with the ‘kde’ parameter set to True to overlay a kernel density estimate.

Task 2: Examining Correlations

Now, let’s look at the correlations between the numerical variables in our dataset using Seaborn’s built-in correlation plot.

Use sns.pairplot() to create a grid of scatter plots for all numeric variables.
Modify the pairplot to show the species information using different colors.
Interpret the pairplot: which variables seem to be most strongly correlated? Do you notice any patterns related to species?

Task 3: Investigating Relationships with Regression Plots

Let’s dig deeper into the relationships between variables using regression plots.

Create a regression plot (sns.regplot) showing the relationship between ‘flipper_length_mm’ and ‘body_mass_g’.
Create another regplot showing the relationship between ‘bill_length_mm’ and ‘bill_depth_mm’.
Try adding the ‘species’ information to one of these plots using different colors. Hint: You might want to use sns.lmplot for this.

Task 4: Joint Distribution Plots

Finally, let’s use joint distribution plots to examine both the relationship between two variables and their individual distributions.

Create a joint plot for ‘flipper_length_mm’ and ‘body_mass_g’.
Experiment with different kind parameters in the joint plot (e.g., ‘scatter’, ‘kde’, ‘hex’).
Create another joint plot, this time for ‘bill_length_mm’ and ‘bill_depth_mm’, colored by species.

Bonus Challenge

If you finish early, try this bonus challenge:

Create a correlation matrix heatmap using Seaborn’s sns.heatmap() function. This will provide a different view of the correlations between variables compared to the pairplot.

Create a correlation matrix using the numerical columns in the dataset.

Creating correlation matricies in pandas

Pandas dataframes include two built-in methods that can be combined to quickly create a correlation matrix between all the numerical data in a dataframe.

.select_dtypes() is a method that selects only the columns of a dataframe that match a type of data. Running the .select_dtypes(include=np.number) method on a dataframe will return a new dataframe that contains only the columns that have a numeric datatype.
.corr() is a method that creates a correlation matrix between every column in a dataframe. For it to work, you need to make sure you only have numeric data in your dataframe, so chaining this method after the .select_dtypes() method will get you a complete correlation matrix in a single line of code!

Visualize this correlation matrix using sns.heatmap().
Customize the heatmap by adding annotations and adjusting the colormap.
Compare the insights from this heatmap with those from the pairplot. What additional information does each visualization provide?

Conclusion

You’ve practiced using seaborn to explore a dataset through various visualization techniques. Often these visualizations can be very helpful at the start of a data exploration activity as they are fundamental to exploratory data analysis in Python. As such, they will be valuable as you continue to work with more complex datasets.

End Coding Colab Session (Day 7)