Code
import pandas as pd
import numpy as np
Introduction to Pandas DataFrames with World Cities Data
A cartoon panda in a frame shop. MidJourney 5
Before we begin our interactive session, please follow these steps to set up your Jupyter Notebook:
+
button in the top left cornerPython 3.10.0
from the Notebook optionsUntitled.ipynb
tabSession_XY_Topic.ipynb
(Replace X with the day number and Y with the session number)Remember to save your work frequently by clicking the save icon or using the keyboard shortcut (Ctrl+S or Cmd+S).
Let’s begin our interactive session!
In this interactive session, we’ll explore the basics of working with pandas DataFrames using a dataset of world cities. We’ll cover importing data, basic DataFrame operations, and essential methods for data exploration and manipulation. This session will prepare you for more advanced data analysis tasks and upcoming collaborative coding exercises.
By the end of this session, you will be able to:
Let’s start by importing the pandas library and loading our dataset.
Let’s take a look at the first few rows of our DataFrame:
name country subcountry geonameid
0 les Escaldes Andorra Escaldes-Engordany 3040051
1 Andorra la Vella Andorra Andorra la Vella 3041563
2 Umm Al Quwain City United Arab Emirates Imārat Umm al Qaywayn 290594
3 Ras Al Khaimah City United Arab Emirates Raʼs al Khaymah 291074
4 Zayed City United Arab Emirates Abu Dhabi 291580
To see the last few rows, we can use:
Now, let’s explore some basic properties of our DataFrame:
Shape: (26467, 4)
Columns: Index(['name', 'country', 'subcountry', 'geonameid'], dtype='object')
Data types:
name object
country object
subcountry object
geonameid int64
dtype: object
Summary statistics:
geonameid
count 2.646700e+04
mean 2.858410e+06
std 2.167506e+06
min 1.057000e+04
25% 1.274182e+06
50% 2.524907e+06
75% 3.589464e+06
max 1.254173e+07
It’s important to identify any missing data in your DataFrame:
Remove rows with missing data in subcountry using dropna()
and the subset
argument.
To select specific columns:
0 les Escaldes
1 Andorra la Vella
2 Umm Al Quwain City
3 Ras Al Khaimah City
4 Zayed City
Name: name, dtype: object
name country subcountry
0 les Escaldes Andorra Escaldes-Engordany
1 Andorra la Vella Andorra Andorra la Vella
2 Umm Al Quwain City United Arab Emirates Imārat Umm al Qaywayn
3 Ras Al Khaimah City United Arab Emirates Raʼs al Khaymah
4 Zayed City United Arab Emirates Abu Dhabi
We can filter rows based on conditions:
# Cities in the United States
us_cities = cities_df[cities_df['country'] == 'United States']
print(us_cities[['name', 'country']].head())
# Cities in California
california_cities = cities_df[(cities_df['country'] == 'United States') & (cities_df['subcountry'] == 'California')]
print(california_cities[['name', 'country', 'subcountry']].head())
name country
22451 Fort Hunt United States
22452 Bessemer United States
22453 Paducah United States
22454 Birmingham United States
22455 Cordova United States
name country subcountry
24818 Fillmore United States California
24867 Adelanto United States California
24868 Agoura United States California
24869 Agoura Hills United States California
24870 Alameda United States California
We can use logical operators to combine multiple conditions:
name country subcountry
2813 Tam O'Shanter-Sullivan Canada Ontario
2814 Tecumseh Canada Ontario
2815 Terrace Canada British Columbia
2816 Terrebonne Canada Quebec
2817 The Beaches Canada Ontario
2818 Thorold Canada Ontario
2819 Thunder Bay Canada Ontario
2820 Timmins Canada Ontario
2821 Toronto Canada Ontario
2822 Trois-Rivières Canada Quebec
2823 Tsawwassen Canada British Columbia
2860 Thetford-Mines Canada Quebec
2872 Trinity-Bellwoods Canada Ontario
2901 Taylor-Massey Canada Ontario
2915 Thorncliffe Park Canada Ontario
To sort the DataFrame based on one or more columns:
name country
7018 'Ali Sabieh Djibouti
17467 's-Gravenzande Netherlands
17466 's-Hertogenbosch Netherlands
8037 A Coruña Spain
8036 A Estrada Spain
name country
67 Andkhōy Afghanistan
66 Asadābād Afghanistan
29 Aībak Afghanistan
63 Baghlān Afghanistan
62 Balkh Afghanistan
We can create new columns based on existing data:
name country name_length
5134 Sandaoling Lutiankuang Wuqi Nongchang China 37
8032 Sant Pere, Santa Caterina i La Ribera Spain 37
8392 Palikir - National Government Center Micronesia 36
16421 Nanchital de Lázaro Cárdenas del Río Mexico 36
2896 Dovercourt-Wallace Emerson-Junction Canada 35
Grouping allows us to perform operations on subsets of the data:
# Number of cities by country
cities_per_country = cities_df.groupby('country')['name'].count().sort_values(ascending=False)
print(cities_per_country.head())
# Number of subcountries (e.g., states, provinces) by country
subcountries_per_country = cities_df.groupby('country')['subcountry'].nunique().sort_values(ascending=False)
print(subcountries_per_country.head())
country
United States 3273
India 2480
China 1955
Brazil 1217
Germany 1117
Name: name, dtype: int64
country
Russia 83
Turkey 81
Thailand 75
Vietnam 62
Algeria 53
Name: subcountry, dtype: int64
In this session, we’ve covered the basics of working with pandas DataFrames using a world cities dataset, including:
These skills form the foundation of data analysis with pandas and will be essential for upcoming exercises and projects. Remember, pandas has many more functions and methods that we haven’t covered here. Don’t hesitate to explore the pandas documentation for more advanced features!