Day 3: Tasks & activities

A cartoon panda looking over a year’s worth of monthly class exams. The panda is doing great; A+! (Midjourney5)[https://www.midjourney.com/jobs/6b63c3ca-c64d-41b8-a791-7e4b2594c781?index=0]

Introduction

In this end-of-day activity, we’ll practice using Pandas Series for data analysis and learn how to use NumPy’s random number generator. We’ll create a series of test scores using random numbers and explore how to make our random number generation reproducible.

Setup

First, let’s import the necessary libraries and set up our environment.

Code

import pandas as pd
import numpy as np

Understanding NumPy’s Random Number Generator

NumPy provides a powerful random number generation tool called Generator. Let’s explore how to use it and why it’s important in data science.

Creating a Random Number Generator

We can create a random number generator object like this:

Code

rng = np.random.default_rng()

This creates a generator with a random seed. Each time you run your code, you’ll get different random numbers.

Using a Seed for Reproducibility

In data science, it’s often crucial to be able to reproduce our results. We can do this by setting a seed for our random number generator. Here’s how:

Code

rng = np.random.default_rng(seed=42)

Now, every time we use this rng object to generate random numbers, we’ll get the same sequence of “random” numbers. This is extremely useful for debugging, sharing results, and ensuring consistency in our analyses.

Creating the Test Scores Series

Create a series called scores that contains 10 elements representing monthly test scores. We’ll use random integers between 70 and 100 to generate the monthly scores, and set the index to be the month names from September to June:

months = ['Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

Analyzing the Test Scores

Now that we have our test scores series, let’s analyze the data by answering the following questions:

1. What is the student’s average test score for the entire year?

Calculate the mean of all scores in the series.

2. What is the student’s average test score during the first half of the year?

Calculate the mean of the first five months’ scores.

3. What is the student’s average test score during the second half of the year?

Calculate the mean of the last five months’ scores.

4. Did the student improve their performance in the second half? If so, by how much?

Compare the average scores from the first and second half of the year.

Exploring Reproducibility

To demonstrate the importance of seeding, try creating two series with different random number generators:

Code

rng1 = np.random.default_rng(seed=42)
rng2 = np.random.default_rng(seed=42)

series1 = pd.Series(rng1.integers(70, 101, size=10), index=months)
series2 = pd.Series(rng2.integers(70, 101, size=10), index=months)

print(series1.equals(series2))  # This should return True

True

Now try creating two series with random number generators that have different seeds:

Code

rng3 = np.random.default_rng(seed=42)
rng4 = np.random.default_rng(seed=123)

series3 = pd.Series(rng3.integers(70, 101, size=10), index=months)
series4 = pd.Series(rng4.integers(70, 101, size=10), index=months)

print(series3.equals(series4))  # This should return False

False

Conclusion

In this activity, you practiced creating and analyzing a Pandas Series representing test scores. You also learned about NumPy’s random number generator and the importance of seeding for reproducibility in data science. These skills are fundamental in data analysis and will be useful in more complex data science workflows.

Additional Resources

Remember to document your code and results clearly in your Jupyter Notebook. Good luck!

End Activity Session (Day 3)