04.03 Pandas Indexes as Dimensions

Indexes in pandas do much more than enumerating the rows of a series or data frame. An index can hold a list of values as the index of a certain element, in other words we can have a combination of values as the index, a multi-index. Column names are also an index and can be multi-valued as well.

In [1]:
import numpy as np
import pandas as pd

Let's pick some of the numeric columns from our British Isles data frame and stack them together into a multiple index.

In [2]:
city = ['Liverpool', 'Manchester', 'Cardiff',
        'Swansea', 'Douglas', 'Belfast',
        'Glasgow', 'Edinburgh', 'Dublin']
country = ['England', 'England', 'Wales',
           'Wales', 'Isle of Man',
           'Northern Ireland', 'Scotland',
           'Scotland', 'Ireland']
county = ['Merseyside', 'Greater Manchester', 'South Glamorgan',
          'West Glamorgan', None,  'County Antrim and County Down',
          None, None, None]
population2001 = [435500, 405300, 305353,
                  270506, np.nan, 276459,
                  577869, 448624, np.nan]
population2011 = [466400, 503127, 335145,
                  239023,  27938, 333871,
                  593200, 476600, 554550]
df = pd.DataFrame({'county': county,
                   'country': country,
                   'population 2001': population2001,
                   'population 2011': population2011,
                  },
                  index=city)
df
Out[2]:
county country population 2001 population 2011
Liverpool Merseyside England 435500.0 466400
Manchester Greater Manchester England 405300.0 503127
Cardiff South Glamorgan Wales 305353.0 335145
Swansea West Glamorgan Wales 270506.0 239023
Douglas None Isle of Man NaN 27938
Belfast County Antrim and County Down Northern Ireland 276459.0 333871
Glasgow None Scotland 577869.0 593200
Edinburgh None Scotland 448624.0 476600
Dublin None Ireland NaN 554550

We choose the population columns because these are more interesting. For one things they contain missing data, and we can see how pandas will deal with that extra difficulty.

In [3]:
pop = df[['population 2001', 'population 2011']]
pop.columns = [2001, 2011]
pop
Out[3]:
2001 2011
Liverpool 435500.0 466400
Manchester 405300.0 503127
Cardiff 305353.0 335145
Swansea 270506.0 239023
Douglas NaN 27938
Belfast 276459.0 333871
Glasgow 577869.0 593200
Edinburgh 448624.0 476600
Dublin NaN 554550

We also renamed the columns to $2001$ and $2011$ to make examples shorter.

Now we stack, and we get a Series with a double valued index.

In [4]:
pop_year = pop.stack()
pop_year
Out[4]:
Liverpool   2001    435500.0
            2011    466400.0
Manchester  2001    405300.0
            2011    503127.0
Cardiff     2001    305353.0
            2011    335145.0
Swansea     2001    270506.0
            2011    239023.0
Douglas     2011     27938.0
Belfast     2001    276459.0
            2011    333871.0
Glasgow     2001    577869.0
            2011    593200.0
Edinburgh   2001    448624.0
            2011    476600.0
Dublin      2011    554550.0
dtype: float64

Selecting only one part of the index provides a single indexed data frame, which may contain more than one value.

In [5]:
pop_year['Cardiff']
Out[5]:
2001    305353.0
2011    335145.0
dtype: float64

To get a single value we select a combined index value. Note that the tuple syntax is not necessary for Series but may be needed for data frames.

In [6]:
pop_year[('Cardiff', 2011)]
Out[6]:
335145.0

The slicing operators allow us to select parts of the index. For example, all places that do have data for $2001$.

In [7]:
pop_year[:, 2001]
Out[7]:
Liverpool     435500.0
Manchester    405300.0
Cardiff       305353.0
Swansea       270506.0
Belfast       276459.0
Glasgow       577869.0
Edinburgh     448624.0
dtype: float64

By unstacking we get back the data frame.

In [8]:
pop_year.unstack()
Out[8]:
2001 2011
Liverpool 435500.0 466400.0
Manchester 405300.0 503127.0
Cardiff 305353.0 335145.0
Swansea 270506.0 239023.0
Douglas NaN 27938.0
Belfast 276459.0 333871.0
Glasgow 577869.0 593200.0
Edinburgh 448624.0 476600.0
Dublin NaN 554550.0

British Isles

pd-british-isles.svg

Indexes on Data Frames

If rows and columns are indexed in the same way we can exchange the row (index) labels with column labels and reposition data appropriately. Moreover, if several values are used to index a row or column we can exchange only some of the values between rows and columns, or vice-versa. That sounds horribly complicated but it is actually a common task within databases, notably data warehouses. The operation of changing labels between rows and columns whilst reordering the data accordingly is called pivoting or crosstabbing in database jargon. Database software extensions often provide pivot or crosstab operations.

Before we attempt pivoting let's try to move the index into the data frame itself.

In [9]:
pop.index.name = 'city'
pop_full = pop.reset_index()
pop_full
Out[9]:
city 2001 2011
0 Liverpool 435500.0 466400
1 Manchester 405300.0 503127
2 Cardiff 305353.0 335145
3 Swansea 270506.0 239023
4 Douglas NaN 27938
5 Belfast 276459.0 333871
6 Glasgow 577869.0 593200
7 Edinburgh 448624.0 476600
8 Dublin NaN 554550

We moved the index into a column, good. But since the columns are an index too we can move them into the data frame too.

In [10]:
pop_melt = pop_full.melt(id_vars=['city'], var_name='year')
pop_melt
Out[10]:
city year value
0 Liverpool 2001 435500.0
1 Manchester 2001 405300.0
2 Cardiff 2001 305353.0
3 Swansea 2001 270506.0
4 Douglas 2001 NaN
5 Belfast 2001 276459.0
6 Glasgow 2001 577869.0
7 Edinburgh 2001 448624.0
8 Dublin 2001 NaN
9 Liverpool 2011 466400.0
10 Manchester 2011 503127.0
11 Cardiff 2011 335145.0
12 Swansea 2011 239023.0
13 Douglas 2011 27938.0
14 Belfast 2011 333871.0
15 Glasgow 2011 593200.0
16 Edinburgh 2011 476600.0
17 Dublin 2011 554550.0

Melting a data frame produces spread data, i.e. we see the same data as before but instead of looking through a row and column we look at a combination of columns in a row to understand what the "value" means.

The pivot operation is the opposite of melting. We build meaningful columns from the data in the rows.

In [11]:
pop_full = pop_melt.pivot(index='city', columns='year', values='value')
pop_full
Out[11]:
year 2001 2011
city
Belfast 276459.0 333871.0
Cardiff 305353.0 335145.0
Douglas NaN 27938.0
Dublin NaN 554550.0
Edinburgh 448624.0 476600.0
Glasgow 577869.0 593200.0
Liverpool 435500.0 466400.0
Manchester 405300.0 503127.0
Swansea 270506.0 239023.0

Pivoting can be powerful, not only it can build new columns but it can aggregate the resulting values. In pandas the pivot_table method accept aggregations. We can output a mean and keep the number of values from which the mean was taken from.

In [12]:
pop_agg = pop_melt[['city', 'value']].pivot_table(
    index='city', aggfunc=[np.mean, lambda x: np.sum(~np.isnan(x)), np.max, np.min])
pop_agg
Out[12]:
mean <lambda> amax amin
value value value value
city
Belfast 305165.0 2.0 333871.0 276459.0
Cardiff 320249.0 2.0 335145.0 305353.0
Douglas 27938.0 1.0 27938.0 27938.0
Dublin 554550.0 1.0 554550.0 554550.0
Edinburgh 462612.0 2.0 476600.0 448624.0
Glasgow 585534.5 2.0 593200.0 577869.0
Liverpool 450950.0 2.0 466400.0 435500.0
Manchester 454213.5 2.0 503127.0 405300.0
Swansea 254764.5 2.0 270506.0 239023.0

There is a side effect here. Since we may aggregate on more than a single column at once we get a multi-index on the columns. Since we do not need it for this case we name the columns ourselves.

In [13]:
pop_agg.columns = ['mean', 'not null', 'max', 'min']
pop_agg
Out[13]:
mean not null max min
city
Belfast 305165.0 2.0 333871.0 276459.0
Cardiff 320249.0 2.0 335145.0 305353.0
Douglas 27938.0 1.0 27938.0 27938.0
Dublin 554550.0 1.0 554550.0 554550.0
Edinburgh 462612.0 2.0 476600.0 448624.0
Glasgow 585534.5 2.0 593200.0 577869.0
Liverpool 450950.0 2.0 466400.0 435500.0
Manchester 454213.5 2.0 503127.0 405300.0
Swansea 254764.5 2.0 270506.0 239023.0

Dimensions

pandas provide several ways to pivot columns and rows, for example, stacking and unstacking can be performed on data frames. Yet, the most important point of the pivot operation is the fact that we can represent several dimensions in a smaller number of dimensions by labeling data with combinations of values.

Just like we can represent a function of the type $f(x, y) = z$ by either storing a 2-dimensional grid of $x$ and $y$ points mapping to values of $z$; or by building a long 1-dimensional list of points of the form $(x, y)$ and mapping it to the $z$ values.