Day 6: Tasks & Activities

In this exercise, you’ll analyze Eurovision Song Contest data using pandas. You’ll practice various data manipulation techniques and explore trends in the contest’s history.

Setup

First, import the necessary libraries and load the dataset:

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
url = "https://github.com/Spijkervet/eurovision-dataset/releases/download/2020.0/contestants.csv"
eurovision_df = pd.read_csv(url)

Task 1: Data Exploration and Cleaning

Display the first few rows of the dataset.
Check the data types of each column.
Identify and handle any missing values.
Convert the ‘year’ column to datetime type.

Task 2: Filtering and Transformation

Create a new dataframe containing only data from 1990 onwards

Important

Use .copy() to make sure you create a new dataframe and not just a view.

Calculate the difference between final points and semi-final points for each entry and make a histogram of these values using the builtin dataframe .hist() command.

Task 3: Sorting and Aggregation

Find the top 10 countries with the most Eurovision appearances (use the entire dataset for this calculation)
Calculate the average final points for each country across all years. Make a simple bar plot of these data.

Note

Use value_counts() for counting appearances and groupby() for calculating averages.

Task 4: Grouping and Analysis

Determine the country with the highest average final points for each decade.

Hint: Grouping Years in Pandas

When working with time series data, it’s often useful to group years into larger intervals like decades, 5-year periods, etc. Here’s a general approach using pandas:

For decades (10-year intervals):

df['decade'] = df['year'].dt.year // 10 * 10

For any N-year interval:

N = 5  # Change this to your desired interval (e.g., 2, 5, 10, 20)
df['year_group'] = df['year'].dt.year // N * N

For more specific date ranges:

df['custom_group'] = pd.cut(df['year'], 
                            bins=[1990, 1995, 2000, 2005, 2010], 
                            labels=['1990-1994', '1995-1999', '2000-2004', '2005-2009'])

Remember: - // is integer division (rounds down) - Multiplying by the interval after division ensures the start year of each group

These methods create a new column that you can use with groupby() for aggregations across your chosen time intervals.

Task 5: Joining Data

Read in a new dataframe that contains population data stored at this url:

Code

population_url = 'https://bit.ly/euro_pop'

Join this data with the Eurovision dataframe.

Warning

Ensure that country names match exactly between the two dataframes before joining.

Calculate total entries per capita by country.

Substeps:

3a. Create a new dataframe containing the counts of entries for each county (use value_counts)

3b. Merge the dataframe of counts of entries for each country with the population dataframe.

3c. Calculate entries per million population (using entries per million to make the numbers easier to work with)

3d. Sort the results by entries per capita

3e. Print the top 10 values

Task 6: Time Series Analysis

Plot the trend of maximum final points awarded over the years.
Identify any significant changes in the scoring system based on this trend.

(This step simply requires visual interpretation of the plot, but perhaps you could explore if there are actual rules changes underlying observed patterns using google)

Task 7: Choose your own analysis!

Come up with your own analysis of the Eurovision data that reveals some pattern across the data or through time. Feel free to discuss your ideas with others; often this leads to new ideas or refinement of ones you are already working on.

Reflection

Now that you’ve completed the Eurovision data analysis exercise, it’s time to reflect on your experience. Add a new markdown cell to your notebook and answer the following questions:

Which tasks did you feel most comfortable with? Why do you think these were easier for you?
Which tasks did you find most challenging? What made these tasks difficult?
Are there any pandas commands or concepts that you’d like to explore further? List a few and briefly explain why you’re interested in them.
How do you think the skills you practiced in this exercise could be applied to other datasets or real-world problems?
What was the most interesting insight you gained about the Eurovision contest from this analysis?

Note

Remember, reflection is a crucial part of the learning process. It helps you identify areas for improvement and reinforces what you’ve learned.

Remember to document your code, explain your reasoning, and interpret the results of your analysis throughout the exercise.