04.02 Database Operations

Just as NumPy, pandas has procedures to concat several data frames together. Yet pandas can do more. In a similar fashion to SQL databases, one can do relational algebra joins on data frames.

Let's import both libraries.

In [1]:
import numpy as np
import pandas as pd

And let us build again the geographic data we have been using until now.

In [2]:
city = ['Liverpool', 'Manchester', 'Cardiff',
        'Swansea', 'Douglas', 'Belfast',
        'Glasgow', 'Edinburgh', 'Dublin']
country = ['England', 'England', 'Wales',
           'Wales', 'Isle of Man',
           'Northern Ireland', 'Scotland',
           'Scotland', 'Ireland']
county = ['Merseyside', 'Greater Manchester', 'South Glamorgan',
          'West Glamorgan', None,  'County Antrim and County Down',
          None, None, None]
population2001 = [435500, 405300, 305353,
                  270506, np.nan, 276459,
                  577869, 448624, np.nan]
population2011 = [466400, 503127, 335145,
                  239023,  27938, 333871,
                  593200, 476600, 554550]
df_city = pd.DataFrame({'county': county,
                        'country': country,
                        'population 2001': population2001,
                        'population 2011': population2011,
                       },
                       index=city)
df_city
Out[2]:
county country population 2001 population 2011
Liverpool Merseyside England 435500.0 466400
Manchester Greater Manchester England 405300.0 503127
Cardiff South Glamorgan Wales 305353.0 335145
Swansea West Glamorgan Wales 270506.0 239023
Douglas None Isle of Man NaN 27938
Belfast County Antrim and County Down Northern Ireland 276459.0 333871
Glasgow None Scotland 577869.0 593200
Edinburgh None Scotland 448624.0 476600
Dublin None Ireland NaN 554550

Cardiff Pier Head

pd-cardiff.svg
The United Kingdom Python Conference (PyConUK) moved across the country, it changed its location every two years. That was before establishing itself in the city of Cardiff, where the conference is held for several years since. The conference is known by the citizens of Cardiff, many of which believe that an academic conference on Herpetology (study of amphibians and reptiles, including snakes) visit their city every year. The love of snakes presented by Python programmers can be seen even by non-programmers.
In [3]:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man', 'Ireland']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas', 'Dublin']
state = ['United Kingdom', 'United Kingdom', 'United Kingdom',
         'United Kingdom', 'Isle of Man', 'Republic of Ireland']
monarch = ['Elizabeth II', 'Elizabeth II', 'Elizabeth II',
           'Elizabeth II', 'Elizabeth II', None]
area = np.array([14130, 77933, 20779, 130279, 572, 70273])
population2001 = [1.686e6, 5.064e6,  np.nan, 48.65e6, 77.703e3,  np.nan]
population2011 = [1.811e6, 5.281e6, 3.057e6, 53.01e6, 84.886e3, 4.571e6]
df_country = pd.DataFrame({'capital': capital,
                           'state': state,
                           'monarch': monarch,
                           'area': area,
                           'population 2001': population2001,
                           'population 2011': population2011,
                          },
                          index=country)
df_country
Out[3]:
capital state monarch area population 2001 population 2011
Northern Ireland Belfast United Kingdom Elizabeth II 14130 1686000.0 1811000.0
Scotland Edinburgh United Kingdom Elizabeth II 77933 5064000.0 5281000.0
Wales Cardiff United Kingdom Elizabeth II 20779 NaN 3057000.0
England London United Kingdom Elizabeth II 130279 48650000.0 53010000.0
Isle of Man Douglas Isle of Man Elizabeth II 572 77703.0 84886.0
Ireland Dublin Republic of Ireland None 70273 NaN 4571000.0

Joins

We know that both data frames are related to each other: The cities reside within the countries. Moreover, the data on cities contains the country in which the city resides.

The most common way of joining both data frames, in a similar fashion to a SQL JOIN statement, is merge. The procedure has a plethora of arguments but let's first use it and see what result we get.

In [4]:
df = pd.merge(df_country, df_city, left_index=True, right_on='country', suffixes=('_country', '_city'))
df
Out[4]:
capital state monarch area population 2001_country population 2011_country county country population 2001_city population 2011_city
Belfast Belfast United Kingdom Elizabeth II 14130 1686000.0 1811000.0 County Antrim and County Down Northern Ireland 276459.0 333871
Glasgow Edinburgh United Kingdom Elizabeth II 77933 5064000.0 5281000.0 None Scotland 577869.0 593200
Edinburgh Edinburgh United Kingdom Elizabeth II 77933 5064000.0 5281000.0 None Scotland 448624.0 476600
Cardiff Cardiff United Kingdom Elizabeth II 20779 NaN 3057000.0 South Glamorgan Wales 305353.0 335145
Swansea Cardiff United Kingdom Elizabeth II 20779 NaN 3057000.0 West Glamorgan Wales 270506.0 239023
Liverpool London United Kingdom Elizabeth II 130279 48650000.0 53010000.0 Merseyside England 435500.0 466400
Manchester London United Kingdom Elizabeth II 130279 48650000.0 53010000.0 Greater Manchester England 405300.0 503127
Douglas Douglas Isle of Man Elizabeth II 572 77703.0 84886.0 None Isle of Man NaN 27938
Dublin Dublin Republic of Ireland None 70273 NaN 4571000.0 None Ireland NaN 554550

There is a lot happening here! We know that we are trying to join the data on the countries but only one data frame has a country column. First we give the two data frames, the order is important since the first data frame will be referred to as left in the arguments and the second will be referred as right.

The df_country data frame has the countries as the index, hence we can use left_index=True to say that we are joining on the index of that data frame. The df_city data frame has a country column, and we use that column for the join with right_on='country'. To make things a bit more complicated both data frames have population data with columns that use the same names. To solve this we give suffixes=, the population columns from df_country will be appended with _country and the population columns from df_city with _city.

By default pandas performs an inner join but the argument how= allows for the common join suspects: left, right, outer, inner.

One may also encounter the use of the join procedure instead of the merge. join is slightly more limited because one needs to be careful from which data frame to join. In order to achieve the same result as above we need to join from df_city.

In [5]:
df_city.join(df_country, on='country', lsuffix='_city', rsuffix='_country')
Out[5]:
county country population 2001_city population 2011_city capital state monarch area population 2001_country population 2011_country
Liverpool Merseyside England 435500.0 466400 London United Kingdom Elizabeth II 130279 48650000.0 53010000.0
Manchester Greater Manchester England 405300.0 503127 London United Kingdom Elizabeth II 130279 48650000.0 53010000.0
Cardiff South Glamorgan Wales 305353.0 335145 Cardiff United Kingdom Elizabeth II 20779 NaN 3057000.0
Swansea West Glamorgan Wales 270506.0 239023 Cardiff United Kingdom Elizabeth II 20779 NaN 3057000.0
Douglas None Isle of Man NaN 27938 Douglas Isle of Man Elizabeth II 572 77703.0 84886.0
Belfast County Antrim and County Down Northern Ireland 276459.0 333871 Belfast United Kingdom Elizabeth II 14130 1686000.0 1811000.0
Glasgow None Scotland 577869.0 593200 Edinburgh United Kingdom Elizabeth II 77933 5064000.0 5281000.0
Edinburgh None Scotland 448624.0 476600 Edinburgh United Kingdom Elizabeth II 77933 5064000.0 5281000.0
Dublin None Ireland NaN 554550 Dublin Republic of Ireland None 70273 NaN 4571000.0

As with database queries on joins we can now query one data frame based on the joined contents of the other.

For example, we can ask which of the cities are capital cities.

In [6]:
df[df.index == df.capital]
Out[6]:
capital state monarch area population 2001_country population 2011_country county country population 2001_city population 2011_city
Belfast Belfast United Kingdom Elizabeth II 14130 1686000.0 1811000.0 County Antrim and County Down Northern Ireland 276459.0 333871
Edinburgh Edinburgh United Kingdom Elizabeth II 77933 5064000.0 5281000.0 None Scotland 448624.0 476600
Cardiff Cardiff United Kingdom Elizabeth II 20779 NaN 3057000.0 South Glamorgan Wales 305353.0 335145
Douglas Douglas Isle of Man Elizabeth II 572 77703.0 84886.0 None Isle of Man NaN 27938
Dublin Dublin Republic of Ireland None 70273 NaN 4571000.0 None Ireland NaN 554550

There is much, much more about the use of pandas as a querying tool. Merging or joining are some of the most commonly used procedures but many other exist. In recent versions pandas added a compare procedure, which out a join containing only the differences between the data frames.

Aggregation

Another pandas utility that acquires its name from databases is aggregation with groupby. We will explore groupby from the ground up when we will look at time series but the aggregation is not limited to time series processing.

The grouping can happen by one or more column, and then an aggregation function is run on all other columns. The result is one aggregation per distinct value in the grouped by columns. Below we take our country data and group by the state the countries belong to, we then sum all other columns as the aggregation function. Since summing only makes sense for numeric columns, non-numeric columns are dropped.

In [7]:
df_country.groupby('state').sum()
Out[7]:
area population 2001 population 2011
state
Isle of Man 572 77703.0 84886.0
Republic of Ireland 70273 0.0 4571000.0
United Kingdom 243121 55400000.0 63159000.0

There exist many aggregations that do not drop non-numeric columns but numeric aggregations are by far the most common.

Grouping by in pandas also respects missing data. We will take our joined data frame and ask how many city dwellers within our dataset Her Majesty Queen Elizabeth the Second rules over.

In [8]:
df.groupby('monarch').sum()['population 2011_city']
Out[8]:
monarch
Elizabeth II    2975304
Name: population 2011_city, dtype: int64

And we can group over several columns. For example, we can find what these cities are. The reset_index operation forces the index to be a column.

In [9]:
df.reset_index().groupby(['index', 'monarch']).sum()['population 2011_city']
Out[9]:
index       monarch     
Belfast     Elizabeth II    333871
Cardiff     Elizabeth II    335145
Douglas     Elizabeth II     27938
Edinburgh   Elizabeth II    476600
Glasgow     Elizabeth II    593200
Liverpool   Elizabeth II    466400
Manchester  Elizabeth II    503127
Swansea     Elizabeth II    239023
Name: population 2011_city, dtype: int64

The city population totals are limited to the amount of data we have in our toy dataset. Also, what we see here is a multi-level index, and index with two levels: index and monarch. We will see how these indexes work next.