Code
import pandas as pd
import numpy as np
Introduction to Pandas DataFrames with World Cities Data
A cartoon panda in a frame shop. MidJourney 5
Before we begin our interactive session, please follow these steps to set up your Jupyter Notebook:
+
button in the top left cornerPython 3.11.0
from the Notebook optionsUntitled.ipynb
tabSession_XY_Topic.ipynb
(Replace X with the day number and Y with the session number)Remember to save your work frequently by clicking the save icon or using the keyboard shortcut (Ctrl+S or Cmd+S).
Let’s begin our interactive session!
In this interactive session, we’ll explore the basics of working with pandas DataFrames using a dataset of world cities. We’ll cover importing data, basic DataFrame operations, and essential methods for data exploration and manipulation. This session will prepare you for more advanced data analysis tasks and upcoming collaborative coding exercises.
By the end of this session, you will be able to:
Let’s start by importing the pandas library and loading our dataset.
Let’s take a look at the first few rows of our DataFrame:
name country subcountry geonameid
0 les Escaldes Andorra Escaldes-Engordany 3040051
1 Andorra la Vella Andorra Andorra la Vella 3041563
2 Warīsān United Arab Emirates Dubai 290503
3 Umm Suqaym United Arab Emirates Dubai 290581
4 Umm Al Quwain City United Arab Emirates UmmalQaywayn 290594
To see the last few rows, we can use:
name country subcountry \
31694 Bindura Zimbabwe Mashonaland Central
31695 Beitbridge Zimbabwe Matabeleland South Province
31696 Epworth Zimbabwe Harare
31697 Chitungwiza Zimbabwe Harare
31698 Harare Western Suburbs Zimbabwe Mashonaland West
geonameid
31694 895061
31695 895269
31696 1085510
31697 1106542
31698 13132735
Now, let’s explore some basic properties of our DataFrame:
Shape: (31699, 4)
Columns: Index(['name', 'country', 'subcountry', 'geonameid'], dtype='object')
Data types:
name object
country object
subcountry object
geonameid int64
dtype: object
Summary statistics:
geonameid
count 3.169900e+04
mean 3.266489e+06
std 2.863419e+06
min 4.900000e+02
25% 1.277083e+06
50% 2.636503e+06
75% 3.693436e+06
max 1.349420e+07
It’s important to identify any missing data in your DataFrame:
Remove rows with missing data in subcountry using dropna()
and the subset
argument.
To select specific columns:
0 les Escaldes
1 Andorra la Vella
2 Warīsān
3 Umm Suqaym
4 Umm Al Quwain City
Name: name, dtype: object
name country subcountry
0 les Escaldes Andorra Escaldes-Engordany
1 Andorra la Vella Andorra Andorra la Vella
2 Warīsān United Arab Emirates Dubai
3 Umm Suqaym United Arab Emirates Dubai
4 Umm Al Quwain City United Arab Emirates UmmalQaywayn
We can filter rows based on conditions:
# Cities in the United States
us_cities = cities_df[cities_df['country'] == 'United States']
print(us_cities[['name', 'country']].head())
# Cities in California
california_cities = cities_df[(cities_df['country'] == 'United States') & (cities_df['subcountry'] == 'California')]
print(california_cities[['name', 'country', 'subcountry']].head())
name country
27316 Fort Hunt United States
27317 Bessemer United States
27318 Paducah United States
27319 Birmingham United States
27320 Cordova United States
name country subcountry
29728 Fillmore United States California
29777 Adelanto United States California
29778 Agoura United States California
29779 Agoura Hills United States California
29780 Agua Caliente United States California
We can use logical operators to combine multiple conditions:
name country subcountry
3986 Tam O'Shanter-Sullivan Canada Ontario
3987 Tecumseh Canada Ontario
3988 Templeton-Est Canada Quebec
3989 Terrace Canada British Columbia
3990 Terrebonne Canada Quebec
3991 The Beaches Canada Ontario
3992 Thorold Canada Ontario
3993 Thunder Bay Canada Ontario
3994 Tillsonburg Canada Ontario
3995 Timmins Canada Ontario
3996 Toronto Canada Ontario
3997 Trois-Rivières Canada Quebec
3998 Tsawwassen Canada British Columbia
4038 Thetford-Mines Canada Quebec
4051 Trinity-Bellwoods Canada Ontario
4080 Taylor-Massey Canada Ontario
4094 Thorncliffe Park Canada Ontario
To sort the DataFrame based on one or more columns:
name country
21637 's-Gravenzande Netherlands
21636 's-Hertogenbosch Netherlands
25121 'Ārdamatā Sudan
9057 6th of October City Egypt
9688 A Coruña Spain
name country
112 Andkhōy Afghanistan
111 Asadābād Afghanistan
72 Aībak Afghanistan
108 Baghlān Afghanistan
107 Balkh Afghanistan
We can create new columns based on existing data:
name country \
22968 Karachi University Employees Co-operative Hous... Pakistan
30524 Diamond Head / Kapahulu / Saint Louis Heights United States
8114 Universitäts- und Hansestadt Greifswald Germany
30676 Aliamanu / Salt Lakes / Foster Village United States
6569 Sandaoling Lutiankuang Wuqi Nongchang China
name_length
22968 57
30524 45
8114 39
30676 38
6569 37
Grouping allows us to perform operations on subsets of the data:
# Number of cities by country
cities_per_country = cities_df.groupby('country')['name'].count().sort_values(ascending=False)
print(cities_per_country.head())
# Number of subcountries (e.g., states, provinces) by country
subcountries_per_country = cities_df.groupby('country')['subcountry'].nunique().sort_values(ascending=False)
print(subcountries_per_country.head())
country
United States 3367
India 3312
Brazil 2111
China 1999
Japan 1293
Name: name, dtype: int64
country
Russian Federation 83
Türkiye 81
Thailand 75
Algeria 53
United States 51
Name: subcountry, dtype: int64
In this session, we’ve covered the basics of working with pandas DataFrames using a world cities dataset, including:
These skills form the foundation of data analysis with pandas and will be essential for upcoming exercises and projects. Remember, pandas has many more functions and methods that we haven’t covered here. Don’t hesitate to explore the pandas documentation for more advanced features!