Interactive Session

A cartoon panda in a frame shop. MidJourney 5

Getting Started

Before we begin our interactive session, please follow these steps to set up your Jupyter Notebook:

Open JupyterLab and create a new notebook:
- Click on the + button in the top left corner
- Select Python 3.11.0 from the Notebook options
Rename your notebook:
- Right-click on the Untitled.ipynb tab
- Select “Rename”
- Name your notebook with the format: Session_XY_Topic.ipynb (Replace X with the day number and Y with the session number)
Add a title cell:
- In the first cell of your notebook, change the cell type to “Markdown”
- Add the following content (replace the placeholders with the actual information):

# Day X: Session Y - [Session Topic]

[Link to session webpage]

Date: [Current Date]

Add a code cell:
- Below the title cell, add a new cell
- Ensure it’s set as a “Code” cell
- This will be where you start writing your Python code for the session
Throughout the session:
- Take notes in Markdown cells
- Copy or write code in Code cells
- Run cells to test your code
- Ask questions if you need clarification

Caution

Remember to save your work frequently by clicking the save icon or using the keyboard shortcut (Ctrl+S or Cmd+S).

Let’s begin our interactive session!

Introduction

In this interactive session, we’ll explore the basics of working with pandas DataFrames using a dataset of world cities. We’ll cover importing data, basic DataFrame operations, and essential methods for data exploration and manipulation. This session will prepare you for more advanced data analysis tasks and upcoming collaborative coding exercises.

Learning Objectives

By the end of this session, you will be able to:

Import data into a pandas DataFrame
Explore basic DataFrame properties and methods
Perform simple data filtering and selection operations
Use basic aggregation and grouping functions

Setting Up

Let’s start by importing the pandas library and loading our dataset.

Code

import pandas as pd
import numpy as np

1. Basic Data Importing

Code

url = "https://raw.githubusercontent.com/datasets/world-cities/master/data/world-cities.csv"
cities_df = pd.read_csv(url)

2. Basic DataFrame Exploration

Viewing the Data

Let’s take a look at the first few rows of our DataFrame:

Code

print(cities_df.head())

                 name               country          subcountry  geonameid
0        les Escaldes               Andorra  Escaldes-Engordany    3040051
1    Andorra la Vella               Andorra    Andorra la Vella    3041563
2             Warīsān  United Arab Emirates               Dubai     290503
3          Umm Suqaym  United Arab Emirates               Dubai     290581
4  Umm Al Quwain City  United Arab Emirates  ImaratUmmalQaywayn     290594

To see the last few rows, we can use:

Code

print(cities_df.tail())

                         name   country                   subcountry  \
31407                 Bindura  Zimbabwe          Mashonaland Central   
31408              Beitbridge  Zimbabwe  Matabeleland South Province   
31409                 Epworth  Zimbabwe                       Harare   
31410             Chitungwiza  Zimbabwe                       Harare   
31411  Harare Western Suburbs  Zimbabwe             Mashonaland West   

       geonameid  
31407     895061  
31408     895269  
31409    1085510  
31410    1106542  
31411   13132735

DataFrame Properties

Now, let’s explore some basic properties of our DataFrame:

Code

# Number of rows and columns
print("Shape:", cities_df.shape)

# Column names
print("\nColumns:", cities_df.columns)

# Data types of each column
print("\nData types:\n", cities_df.dtypes)

# Summary statistics of numeric columns (if any)
print("\nSummary statistics:\n", cities_df.describe())

Shape: (31412, 4)

Columns: Index(['name', 'country', 'subcountry', 'geonameid'], dtype='object')

Data types:
 name          object
country       object
subcountry    object
geonameid      int64
dtype: object

Summary statistics:
           geonameid
count  3.141200e+04
mean   3.249518e+06
std    2.843979e+06
min    1.057000e+04
25%    1.276620e+06
50%    2.634576e+06
75%    3.689249e+06
max    1.335370e+07

Checking for Missing Values

It’s important to identify any missing data in your DataFrame:

Code

print(cities_df.isnull().sum())

name            0
country         0
subcountry    117
geonameid       0
dtype: int64

3. Baisc Cleaning

Remove rows with missing data in subcountry using dropna() and the subset argument.

Code

cities_df = cities_df.dropna(subset=['subcountry'])

4. Basic Data Selection and Filtering

Selecting Columns

To select specific columns:

Code

# Select a single column
print(cities_df['name'].head())

# Select multiple columns
print(cities_df[['name', 'country', 'subcountry']].head())

0          les Escaldes
1      Andorra la Vella
2               Warīsān
3            Umm Suqaym
4    Umm Al Quwain City
Name: name, dtype: object
                 name               country          subcountry
0        les Escaldes               Andorra  Escaldes-Engordany
1    Andorra la Vella               Andorra    Andorra la Vella
2             Warīsān  United Arab Emirates               Dubai
3          Umm Suqaym  United Arab Emirates               Dubai
4  Umm Al Quwain City  United Arab Emirates  ImaratUmmalQaywayn

Filtering Rows

We can filter rows based on conditions:

Code

# Cities in the United States
us_cities = cities_df[cities_df['country'] == 'United States']
print(us_cities[['name', 'country']].head())

# Cities in California
california_cities = cities_df[(cities_df['country'] == 'United States') & (cities_df['subcountry'] == 'California')]
print(california_cities[['name', 'country', 'subcountry']].head())

             name        country
27058   Fort Hunt  United States
27059    Bessemer  United States
27060     Paducah  United States
27061  Birmingham  United States
27062     Cordova  United States
                name        country  subcountry
29470       Fillmore  United States  California
29519       Adelanto  United States  California
29520         Agoura  United States  California
29521   Agoura Hills  United States  California
29522  Agua Caliente  United States  California

Combining Conditions

We can use logical operators to combine multiple conditions:

Code

# Cities in Canada that start with the letter 'T'
canadian_t_cities = cities_df[(cities_df['country'] == 'Canada') & (cities_df['name'].str.startswith('T'))]
print(canadian_t_cities[['name', 'country', 'subcountry']])

                        name country        subcountry
3895  Tam O'Shanter-Sullivan  Canada           Ontario
3896                Tecumseh  Canada           Ontario
3897           Templeton-Est  Canada            Quebec
3898                 Terrace  Canada  British Columbia
3899              Terrebonne  Canada            Quebec
3900             The Beaches  Canada           Ontario
3901                 Thorold  Canada           Ontario
3902             Thunder Bay  Canada           Ontario
3903                 Timmins  Canada           Ontario
3904                 Toronto  Canada           Ontario
3905          Trois-Rivières  Canada            Quebec
3906              Tsawwassen  Canada  British Columbia
3944          Thetford-Mines  Canada            Quebec
3957       Trinity-Bellwoods  Canada           Ontario
3986           Taylor-Massey  Canada           Ontario
4000        Thorncliffe Park  Canada           Ontario

5. Basic Sorting and Ranking

To sort the DataFrame based on one or more columns:

Code

# Sort cities alphabetically
sorted_cities = cities_df.sort_values('name')
print(sorted_cities[['name', 'country']].head())

# Sort cities by country, then by name
sorted_cities_by_country = cities_df.sort_values(['country', 'name'])
print(sorted_cities_by_country[['name', 'country']].head())

                      name      country
21486       's-Gravenzande  Netherlands
21485     's-Hertogenbosch  Netherlands
24959            'Ārdamatā        Sudan
8935   6th of October City        Egypt
9566              A Coruña        Spain
         name      country
112   Andkhōy  Afghanistan
111  Asadābād  Afghanistan
72      Aībak  Afghanistan
108   Baghlān  Afghanistan
107     Balkh  Afghanistan

6. Basic Transformations

Creating New Columns

We can create new columns based on existing data:

Code

# Create a column for city name length
cities_df['name_length'] = cities_df['name'].str.len()

# Display the top 5 cities with the longest names
long_named_cities = cities_df.nlargest(5, 'name_length')
print(long_named_cities[['name', 'country', 'name_length']])

                                                name        country  \
30266  Diamond Head / Kapahulu / Saint Louis Heights  United States   
7995         Universitäts- und Hansestadt Greifswald        Germany   
30418         Aliamanu / Salt Lakes / Foster Village  United States   
6472           Sandaoling Lutiankuang Wuqi Nongchang          China   
9559           Sant Pere, Santa Caterina i La Ribera          Spain   

       name_length  
30266           45  
7995            39  
30418           38  
6472            37  
9559            37

7-8: Basic Grouping and Aggregation

Grouping allows us to perform operations on subsets of the data:

Code

# Number of cities by country
cities_per_country = cities_df.groupby('country')['name'].count().sort_values(ascending=False)
print(cities_per_country.head())

# Number of subcountries (e.g., states, provinces) by country
subcountries_per_country = cities_df.groupby('country')['subcountry'].nunique().sort_values(ascending=False)
print(subcountries_per_country.head())

country
United States    3367
India            3311
Brazil           2042
China            1997
Japan            1293
Name: name, dtype: int64
country
Russian Federation    83
Türkiye               81
Thailand              75
Viet Nam              63
Algeria               53
Name: subcountry, dtype: int64

Conclusion

In this session, we’ve covered the basics of working with pandas DataFrames using a world cities dataset, including:

Importing data
Exploring DataFrame properties
Selecting and filtering data
Sorting and ranking
Grouping and aggregation
Creating new columns

These skills form the foundation of data analysis with pandas and will be essential for upcoming exercises and projects. Remember, pandas has many more functions and methods that we haven’t covered here. Don’t hesitate to explore the pandas documentation for more advanced features!

Getting Started

Introduction

Learning Objectives

Setting Up

1. Basic Data Importing

2. Basic DataFrame Exploration

Viewing the Data

DataFrame Properties

Checking for Missing Values

3. Baisc Cleaning

4. Basic Data Selection and Filtering

Selecting Columns

Filtering Rows

Combining Conditions

5. Basic Sorting and Ranking

6. Basic Transformations

Creating New Columns

7-8: Basic Grouping and Aggregation

Conclusion

Resources