Code
import pandas as pd
import numpy as np
Introduction to Pandas DataFrames with World Cities Data
A cartoon panda in a frame shop. MidJourney 5
Before we begin our interactive session, please follow these steps to set up your Jupyter Notebook:
+
button in the top left cornerPython 3.11.0
from the Notebook optionsUntitled.ipynb
tabSession_XY_Topic.ipynb
(Replace X with the day number and Y with the session number)Remember to save your work frequently by clicking the save icon or using the keyboard shortcut (Ctrl+S or Cmd+S).
Let’s begin our interactive session!
In this interactive session, we’ll explore the basics of working with pandas DataFrames using a dataset of world cities. We’ll cover importing data, basic DataFrame operations, and essential methods for data exploration and manipulation. This session will prepare you for more advanced data analysis tasks and upcoming collaborative coding exercises.
By the end of this session, you will be able to:
Let’s start by importing the pandas library and loading our dataset.
Let’s take a look at the first few rows of our DataFrame:
name country subcountry geonameid
0 les Escaldes Andorra Escaldes-Engordany 3040051
1 Andorra la Vella Andorra Andorra la Vella 3041563
2 Warīsān United Arab Emirates Dubai 290503
3 Umm Suqaym United Arab Emirates Dubai 290581
4 Umm Al Quwain City United Arab Emirates ImaratUmmalQaywayn 290594
To see the last few rows, we can use:
name country subcountry \
31407 Bindura Zimbabwe Mashonaland Central
31408 Beitbridge Zimbabwe Matabeleland South Province
31409 Epworth Zimbabwe Harare
31410 Chitungwiza Zimbabwe Harare
31411 Harare Western Suburbs Zimbabwe Mashonaland West
geonameid
31407 895061
31408 895269
31409 1085510
31410 1106542
31411 13132735
Now, let’s explore some basic properties of our DataFrame:
Shape: (31412, 4)
Columns: Index(['name', 'country', 'subcountry', 'geonameid'], dtype='object')
Data types:
name object
country object
subcountry object
geonameid int64
dtype: object
Summary statistics:
geonameid
count 3.141200e+04
mean 3.249518e+06
std 2.843979e+06
min 1.057000e+04
25% 1.276620e+06
50% 2.634576e+06
75% 3.689249e+06
max 1.335370e+07
It’s important to identify any missing data in your DataFrame:
Remove rows with missing data in subcountry using dropna()
and the subset
argument.
To select specific columns:
0 les Escaldes
1 Andorra la Vella
2 Warīsān
3 Umm Suqaym
4 Umm Al Quwain City
Name: name, dtype: object
name country subcountry
0 les Escaldes Andorra Escaldes-Engordany
1 Andorra la Vella Andorra Andorra la Vella
2 Warīsān United Arab Emirates Dubai
3 Umm Suqaym United Arab Emirates Dubai
4 Umm Al Quwain City United Arab Emirates ImaratUmmalQaywayn
We can filter rows based on conditions:
# Cities in the United States
us_cities = cities_df[cities_df['country'] == 'United States']
print(us_cities[['name', 'country']].head())
# Cities in California
california_cities = cities_df[(cities_df['country'] == 'United States') & (cities_df['subcountry'] == 'California')]
print(california_cities[['name', 'country', 'subcountry']].head())
name country
27058 Fort Hunt United States
27059 Bessemer United States
27060 Paducah United States
27061 Birmingham United States
27062 Cordova United States
name country subcountry
29470 Fillmore United States California
29519 Adelanto United States California
29520 Agoura United States California
29521 Agoura Hills United States California
29522 Agua Caliente United States California
We can use logical operators to combine multiple conditions:
name country subcountry
3895 Tam O'Shanter-Sullivan Canada Ontario
3896 Tecumseh Canada Ontario
3897 Templeton-Est Canada Quebec
3898 Terrace Canada British Columbia
3899 Terrebonne Canada Quebec
3900 The Beaches Canada Ontario
3901 Thorold Canada Ontario
3902 Thunder Bay Canada Ontario
3903 Timmins Canada Ontario
3904 Toronto Canada Ontario
3905 Trois-Rivières Canada Quebec
3906 Tsawwassen Canada British Columbia
3944 Thetford-Mines Canada Quebec
3957 Trinity-Bellwoods Canada Ontario
3986 Taylor-Massey Canada Ontario
4000 Thorncliffe Park Canada Ontario
To sort the DataFrame based on one or more columns:
name country
21486 's-Gravenzande Netherlands
21485 's-Hertogenbosch Netherlands
24959 'Ārdamatā Sudan
8935 6th of October City Egypt
9566 A Coruña Spain
name country
112 Andkhōy Afghanistan
111 Asadābād Afghanistan
72 Aībak Afghanistan
108 Baghlān Afghanistan
107 Balkh Afghanistan
We can create new columns based on existing data:
name country \
30266 Diamond Head / Kapahulu / Saint Louis Heights United States
7995 Universitäts- und Hansestadt Greifswald Germany
30418 Aliamanu / Salt Lakes / Foster Village United States
6472 Sandaoling Lutiankuang Wuqi Nongchang China
9559 Sant Pere, Santa Caterina i La Ribera Spain
name_length
30266 45
7995 39
30418 38
6472 37
9559 37
Grouping allows us to perform operations on subsets of the data:
# Number of cities by country
cities_per_country = cities_df.groupby('country')['name'].count().sort_values(ascending=False)
print(cities_per_country.head())
# Number of subcountries (e.g., states, provinces) by country
subcountries_per_country = cities_df.groupby('country')['subcountry'].nunique().sort_values(ascending=False)
print(subcountries_per_country.head())
country
United States 3367
India 3311
Brazil 2042
China 1997
Japan 1293
Name: name, dtype: int64
country
Russian Federation 83
Türkiye 81
Thailand 75
Viet Nam 63
Algeria 53
Name: subcountry, dtype: int64
In this session, we’ve covered the basics of working with pandas DataFrames using a world cities dataset, including:
These skills form the foundation of data analysis with pandas and will be essential for upcoming exercises and projects. Remember, pandas has many more functions and methods that we haven’t covered here. Don’t hesitate to explore the pandas documentation for more advanced features!