Just as NumPy, pandas
has procedures to concat
several data frames together.
Yet pandas
can do more.
In a similar fashion to SQL databases,
one can do relational algebra joins on data frames.
Let's import both libraries.
import numpy as np
import pandas as pd
And let us build again the geographic data we have been using until now.
city = ['Liverpool', 'Manchester', 'Cardiff',
'Swansea', 'Douglas', 'Belfast',
'Glasgow', 'Edinburgh', 'Dublin']
country = ['England', 'England', 'Wales',
'Wales', 'Isle of Man',
'Northern Ireland', 'Scotland',
'Scotland', 'Ireland']
county = ['Merseyside', 'Greater Manchester', 'South Glamorgan',
'West Glamorgan', None, 'County Antrim and County Down',
None, None, None]
population2001 = [435500, 405300, 305353,
270506, np.nan, 276459,
577869, 448624, np.nan]
population2011 = [466400, 503127, 335145,
239023, 27938, 333871,
593200, 476600, 554550]
df_city = pd.DataFrame({'county': county,
'country': country,
'population 2001': population2001,
'population 2011': population2011,
},
index=city)
df_city
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man', 'Ireland']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas', 'Dublin']
state = ['United Kingdom', 'United Kingdom', 'United Kingdom',
'United Kingdom', 'Isle of Man', 'Republic of Ireland']
monarch = ['Elizabeth II', 'Elizabeth II', 'Elizabeth II',
'Elizabeth II', 'Elizabeth II', None]
area = np.array([14130, 77933, 20779, 130279, 572, 70273])
population2001 = [1.686e6, 5.064e6, np.nan, 48.65e6, 77.703e3, np.nan]
population2011 = [1.811e6, 5.281e6, 3.057e6, 53.01e6, 84.886e3, 4.571e6]
df_country = pd.DataFrame({'capital': capital,
'state': state,
'monarch': monarch,
'area': area,
'population 2001': population2001,
'population 2011': population2011,
},
index=country)
df_country
We know that both data frames are related to each other: The cities reside within the countries. Moreover, the data on cities contains the country in which the city resides.
The most common way of joining both data frames,
in a similar fashion to a SQL JOIN statement, is merge
.
The procedure has a plethora of arguments
but let's first use it and see what result we get.
df = pd.merge(df_country, df_city, left_index=True, right_on='country', suffixes=('_country', '_city'))
df
There is a lot happening here!
We know that we are trying to join the data on the countries
but only one data frame has a country
column.
First we give the two data frames,
the order is important since the first data frame will be referred to
as left
in the arguments
and the second will be referred as right
.
The df_country
data frame has the countries as the index,
hence we can use left_index=True
to say that we are joining
on the index of that data frame.
The df_city
data frame has a country
column,
and we use that column for the join with right_on='country'
.
To make things a bit more complicated both data frames
have population data with columns that use the same names.
To solve this we give suffixes=
,
the population columns from df_country
will be appended with
_country
and the population columns from df_city
with _city
.
By default pandas
performs an inner join
but the argument how=
allows for the common join
suspects: left, right, outer, inner.
One may also encounter the use of the join
procedure instead of the merge
.
join
is slightly more limited because one needs to be careful
from which data frame to join.
In order to achieve the same result as above we need to join from df_city
.
df_city.join(df_country, on='country', lsuffix='_city', rsuffix='_country')
As with database queries on joins we can now query one data frame based on the joined contents of the other.
For example, we can ask which of the cities are capital cities.
df[df.index == df.capital]
There is much, much more about the use of pandas
as a querying tool.
Merging or joining are some of the most commonly used procedures
but many other exist.
In recent versions pandas
added a compare
procedure,
which out a join containing only the differences
between the data frames.
Another pandas
utility that acquires its name from databases
is aggregation with groupby
.
We will explore groupby
from the ground up when we will look
at time series but the aggregation is not limited to time series processing.
The grouping can happen by one or more column, and then an aggregation function is run on all other columns. The result is one aggregation per distinct value in the grouped by columns. Below we take our country data and group by the state the countries belong to, we then sum all other columns as the aggregation function. Since summing only makes sense for numeric columns, non-numeric columns are dropped.
df_country.groupby('state').sum()
There exist many aggregations that do not drop non-numeric columns but numeric aggregations are by far the most common.
Grouping by in pandas
also respects missing data.
We will take our joined data frame and ask how many city dwellers
within our dataset Her Majesty Queen Elizabeth the Second rules over.
df.groupby('monarch').sum()['population 2011_city']
And we can group over several columns.
For example, we can find what these cities are.
The reset_index
operation forces the index to be a column.
df.reset_index().groupby(['index', 'monarch']).sum()['population 2011_city']
The city population totals are limited to the amount of data we have in our toy dataset. Also, what we see here is a multi-level index, and index with two levels: index and monarch. We will see how these indexes work next.