04.02 Database Operations¶

Just as NumPy, pandas has procedures to concat several data frames together. Yet pandas can do more. In a similar fashion to SQL databases, one can do relational algebra joins on data frames.

Let's import both libraries.

import numpy as np
import pandas as pd

And let us build again the geographic data we have been using until now.

city = ['Liverpool', 'Manchester', 'Cardiff',
        'Swansea', 'Douglas', 'Belfast',
        'Glasgow', 'Edinburgh', 'Dublin']
country = ['England', 'England', 'Wales',
           'Wales', 'Isle of Man',
           'Northern Ireland', 'Scotland',
           'Scotland', 'Ireland']
county = ['Merseyside', 'Greater Manchester', 'South Glamorgan',
          'West Glamorgan', None,  'County Antrim and County Down',
          None, None, None]
population2001 = [435500, 405300, 305353,
                  270506, np.nan, 276459,
                  577869, 448624, np.nan]
population2011 = [466400, 503127, 335145,
                  239023,  27938, 333871,
                  593200, 476600, 554550]
df_city = pd.DataFrame({'county': county,
                        'country': country,
                        'population 2001': population2001,
                        'population 2011': population2011,
                       },
                       index=city)
df_city

Cardiff Pier Head

^{pd-cardiff.svg}

The United Kingdom Python Conference (PyConUK) moved across the country, it changed its location every two years. That was before establishing itself in the city of Cardiff, where the conference is held for several years since. The conference is known by the citizens of Cardiff, many of which believe that an academic conference on Herpetology (study of amphibians and reptiles, including snakes) visit their city every year. The love of snakes presented by Python programmers can be seen even by non-programmers.

country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man', 'Ireland']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas', 'Dublin']
state = ['United Kingdom', 'United Kingdom', 'United Kingdom',
         'United Kingdom', 'Isle of Man', 'Republic of Ireland']
monarch = ['Elizabeth II', 'Elizabeth II', 'Elizabeth II',
           'Elizabeth II', 'Elizabeth II', None]
area = np.array([14130, 77933, 20779, 130279, 572, 70273])
population2001 = [1.686e6, 5.064e6,  np.nan, 48.65e6, 77.703e3,  np.nan]
population2011 = [1.811e6, 5.281e6, 3.057e6, 53.01e6, 84.886e3, 4.571e6]
df_country = pd.DataFrame({'capital': capital,
                           'state': state,
                           'monarch': monarch,
                           'area': area,
                           'population 2001': population2001,
                           'population 2011': population2011,
                          },
                          index=country)
df_country

Joins¶

We know that both data frames are related to each other: The cities reside within the countries. Moreover, the data on cities contains the country in which the city resides.

The most common way of joining both data frames, in a similar fashion to a SQL JOIN statement, is merge. The procedure has a plethora of arguments but let's first use it and see what result we get.

df = pd.merge(df_country, df_city, left_index=True, right_on='country', suffixes=('_country', '_city'))
df

There is a lot happening here! We know that we are trying to join the data on the countries but only one data frame has a country column. First we give the two data frames, the order is important since the first data frame will be referred to as left in the arguments and the second will be referred as right.

The df_country data frame has the countries as the index, hence we can use left_index=True to say that we are joining on the index of that data frame. The df_city data frame has a country column, and we use that column for the join with right_on='country'. To make things a bit more complicated both data frames have population data with columns that use the same names. To solve this we give suffixes=, the population columns from df_country will be appended with _country and the population columns from df_city with _city.

By default pandas performs an inner join but the argument how= allows for the common join suspects: left, right, outer, inner.

One may also encounter the use of the join procedure instead of the merge. join is slightly more limited because one needs to be careful from which data frame to join. In order to achieve the same result as above we need to join from df_city.

df_city.join(df_country, on='country', lsuffix='_city', rsuffix='_country')

As with database queries on joins we can now query one data frame based on the joined contents of the other.

For example, we can ask which of the cities are capital cities.

df[df.index == df.capital]

There is much, much more about the use of pandas as a querying tool. Merging or joining are some of the most commonly used procedures but many other exist. In recent versions pandas added a compare procedure, which out a join containing only the differences between the data frames.

Aggregation¶

Another pandas utility that acquires its name from databases is aggregation with groupby. We will explore groupby from the ground up when we will look at time series but the aggregation is not limited to time series processing.

The grouping can happen by one or more column, and then an aggregation function is run on all other columns. The result is one aggregation per distinct value in the grouped by columns. Below we take our country data and group by the state the countries belong to, we then sum all other columns as the aggregation function. Since summing only makes sense for numeric columns, non-numeric columns are dropped.

df_country.groupby('state').sum()

There exist many aggregations that do not drop non-numeric columns but numeric aggregations are by far the most common.

Grouping by in pandas also respects missing data. We will take our joined data frame and ask how many city dwellers within our dataset Her Majesty Queen Elizabeth the Second rules over.

df.groupby('monarch').sum()['population 2011_city']

monarch
Elizabeth II    2975304
Name: population 2011_city, dtype: int64

And we can group over several columns. For example, we can find what these cities are. The reset_index operation forces the index to be a column.

df.reset_index().groupby(['index', 'monarch']).sum()['population 2011_city']

index       monarch     
Belfast     Elizabeth II    333871
Cardiff     Elizabeth II    335145
Douglas     Elizabeth II     27938
Edinburgh   Elizabeth II    476600
Glasgow     Elizabeth II    593200
Liverpool   Elizabeth II    466400
Manchester  Elizabeth II    503127
Swansea     Elizabeth II    239023
Name: population 2011_city, dtype: int64

The city population totals are limited to the amount of data we have in our toy dataset. Also, what we see here is a multi-level index, and index with two levels: index and monarch. We will see how these indexes work next.

	county	country	population 2001	population 2011
Liverpool	Merseyside	England	435500.0	466400
Manchester	Greater Manchester	England	405300.0	503127
Cardiff	South Glamorgan	Wales	305353.0	335145
Swansea	West Glamorgan	Wales	270506.0	239023
Douglas	None	Isle of Man	NaN	27938
Belfast	County Antrim and County Down	Northern Ireland	276459.0	333871
Glasgow	None	Scotland	577869.0	593200
Edinburgh	None	Scotland	448624.0	476600
Dublin	None	Ireland	NaN	554550

	capital	state	monarch	area	population 2001	population 2011
Northern Ireland	Belfast	United Kingdom	Elizabeth II	14130	1686000.0	1811000.0
Scotland	Edinburgh	United Kingdom	Elizabeth II	77933	5064000.0	5281000.0
Wales	Cardiff	United Kingdom	Elizabeth II	20779	NaN	3057000.0
England	London	United Kingdom	Elizabeth II	130279	48650000.0	53010000.0
Isle of Man	Douglas	Isle of Man	Elizabeth II	572	77703.0	84886.0
Ireland	Dublin	Republic of Ireland	None	70273	NaN	4571000.0

	area	population 2001	population 2011
state
Isle of Man	572	77703.0	84886.0
Republic of Ireland	70273	0.0	4571000.0
United Kingdom	243121	55400000.0	63159000.0