04.03 Pandas Indexes as Dimensions¶

Indexes in pandas do much more than enumerating the rows of a series or data frame. An index can hold a list of values as the index of a certain element, in other words we can have a combination of values as the index, a multi-index. Column names are also an index and can be multi-valued as well.

import numpy as np
import pandas as pd

Let's pick some of the numeric columns from our British Isles data frame and stack them together into a multiple index.

city = ['Liverpool', 'Manchester', 'Cardiff',
        'Swansea', 'Douglas', 'Belfast',
        'Glasgow', 'Edinburgh', 'Dublin']
country = ['England', 'England', 'Wales',
           'Wales', 'Isle of Man',
           'Northern Ireland', 'Scotland',
           'Scotland', 'Ireland']
county = ['Merseyside', 'Greater Manchester', 'South Glamorgan',
          'West Glamorgan', None,  'County Antrim and County Down',
          None, None, None]
population2001 = [435500, 405300, 305353,
                  270506, np.nan, 276459,
                  577869, 448624, np.nan]
population2011 = [466400, 503127, 335145,
                  239023,  27938, 333871,
                  593200, 476600, 554550]
df = pd.DataFrame({'county': county,
                   'country': country,
                   'population 2001': population2001,
                   'population 2011': population2011,
                  },
                  index=city)
df

We choose the population columns because these are more interesting. For one things they contain missing data, and we can see how pandas will deal with that extra difficulty.

pop = df[['population 2001', 'population 2011']]
pop.columns = [2001, 2011]
pop

We also renamed the columns to $2001$ and $2011$ to make examples shorter.

Now we stack, and we get a Series with a double valued index.

pop_year = pop.stack()
pop_year

Liverpool   2001    435500.0
            2011    466400.0
Manchester  2001    405300.0
            2011    503127.0
Cardiff     2001    305353.0
            2011    335145.0
Swansea     2001    270506.0
            2011    239023.0
Douglas     2011     27938.0
Belfast     2001    276459.0
            2011    333871.0
Glasgow     2001    577869.0
            2011    593200.0
Edinburgh   2001    448624.0
            2011    476600.0
Dublin      2011    554550.0
dtype: float64

Selecting only one part of the index provides a single indexed data frame, which may contain more than one value.

pop_year['Cardiff']

2001    305353.0
2011    335145.0
dtype: float64

To get a single value we select a combined index value. Note that the tuple syntax is not necessary for Series but may be needed for data frames.

pop_year[('Cardiff', 2011)]

335145.0

The slicing operators allow us to select parts of the index. For example, all places that do have data for $2001$.

pop_year[:, 2001]

Liverpool     435500.0
Manchester    405300.0
Cardiff       305353.0
Swansea       270506.0
Belfast       276459.0
Glasgow       577869.0
Edinburgh     448624.0
dtype: float64

By unstacking we get back the data frame.

pop_year.unstack()

British Isles

^{pd-british-isles.svg}

Indexes on Data Frames¶

If rows and columns are indexed in the same way we can exchange the row (index) labels with column labels and reposition data appropriately. Moreover, if several values are used to index a row or column we can exchange only some of the values between rows and columns, or vice-versa. That sounds horribly complicated but it is actually a common task within databases, notably data warehouses. The operation of changing labels between rows and columns whilst reordering the data accordingly is called pivoting or crosstabbing in database jargon. Database software extensions often provide pivot or crosstab operations.

Before we attempt pivoting let's try to move the index into the data frame itself.

pop.index.name = 'city'
pop_full = pop.reset_index()
pop_full

We moved the index into a column, good. But since the columns are an index too we can move them into the data frame too.

pop_melt = pop_full.melt(id_vars=['city'], var_name='year')
pop_melt

Melting a data frame produces spread data, i.e. we see the same data as before but instead of looking through a row and column we look at a combination of columns in a row to understand what the "value" means.

The pivot operation is the opposite of melting. We build meaningful columns from the data in the rows.

pop_full = pop_melt.pivot(index='city', columns='year', values='value')
pop_full

Pivoting can be powerful, not only it can build new columns but it can aggregate the resulting values. In pandas the pivot_table method accept aggregations. We can output a mean and keep the number of values from which the mean was taken from.

pop_agg = pop_melt[['city', 'value']].pivot_table(
    index='city', aggfunc=[np.mean, lambda x: np.sum(~np.isnan(x)), np.max, np.min])
pop_agg

There is a side effect here. Since we may aggregate on more than a single column at once we get a multi-index on the columns. Since we do not need it for this case we name the columns ourselves.

pop_agg.columns = ['mean', 'not null', 'max', 'min']
pop_agg

Dimensions¶

pandas provide several ways to pivot columns and rows, for example, stacking and unstacking can be performed on data frames. Yet, the most important point of the pivot operation is the fact that we can represent several dimensions in a smaller number of dimensions by labeling data with combinations of values.

Just like we can represent a function of the type $f(x, y) = z$ by either storing a 2-dimensional grid of $x$ and $y$ points mapping to values of $z$; or by building a long 1-dimensional list of points of the form $(x, y)$ and mapping it to the $z$ values.

	county	country	population 2001	population 2011
Liverpool	Merseyside	England	435500.0	466400
Manchester	Greater Manchester	England	405300.0	503127
Cardiff	South Glamorgan	Wales	305353.0	335145
Swansea	West Glamorgan	Wales	270506.0	239023
Douglas	None	Isle of Man	NaN	27938
Belfast	County Antrim and County Down	Northern Ireland	276459.0	333871
Glasgow	None	Scotland	577869.0	593200
Edinburgh	None	Scotland	448624.0	476600
Dublin	None	Ireland	NaN	554550

	2001	2011
Liverpool	435500.0	466400.0
Manchester	405300.0	503127.0
Cardiff	305353.0	335145.0
Swansea	270506.0	239023.0
Douglas	NaN	27938.0
Belfast	276459.0	333871.0
Glasgow	577869.0	593200.0
Edinburgh	448624.0	476600.0
Dublin	NaN	554550.0

	city	year	value
0	Liverpool	2001	435500.0
1	Manchester	2001	405300.0
2	Cardiff	2001	305353.0
3	Swansea	2001	270506.0
4	Douglas	2001	NaN
5	Belfast	2001	276459.0
6	Glasgow	2001	577869.0
7	Edinburgh	2001	448624.0
8	Dublin	2001	NaN
9	Liverpool	2011	466400.0
10	Manchester	2011	503127.0
11	Cardiff	2011	335145.0
12	Swansea	2011	239023.0
13	Douglas	2011	27938.0
14	Belfast	2011	333871.0
15	Glasgow	2011	593200.0
16	Edinburgh	2011	476600.0
17	Dublin	2011	554550.0

year	2001	2011
city
Belfast	276459.0	333871.0
Cardiff	305353.0	335145.0
Douglas	NaN	27938.0
Dublin	NaN	554550.0
Edinburgh	448624.0	476600.0
Glasgow	577869.0	593200.0
Liverpool	435500.0	466400.0
Manchester	405300.0	503127.0
Swansea	270506.0	239023.0

	mean	<lambda>	amax	amin
	value	value	value	value
city
Belfast	305165.0	2.0	333871.0	276459.0
Cardiff	320249.0	2.0	335145.0	305353.0
Douglas	27938.0	1.0	27938.0	27938.0
Dublin	554550.0	1.0	554550.0	554550.0
Edinburgh	462612.0	2.0	476600.0	448624.0
Glasgow	585534.5	2.0	593200.0	577869.0
Liverpool	450950.0	2.0	466400.0	435500.0
Manchester	454213.5	2.0	503127.0	405300.0
Swansea	254764.5	2.0	270506.0	239023.0