Indexes in pandas
do much more than enumerating the rows of a series or data frame.
An index can hold a list of values as the index of a certain element, in other words
we can have a combination of values as the index, a multi-index. Column names are also
an index and can be multi-valued as well.
import numpy as np
import pandas as pd
Let's pick some of the numeric columns from our British Isles data frame and stack them together into a multiple index.
city = ['Liverpool', 'Manchester', 'Cardiff',
'Swansea', 'Douglas', 'Belfast',
'Glasgow', 'Edinburgh', 'Dublin']
country = ['England', 'England', 'Wales',
'Wales', 'Isle of Man',
'Northern Ireland', 'Scotland',
'Scotland', 'Ireland']
county = ['Merseyside', 'Greater Manchester', 'South Glamorgan',
'West Glamorgan', None, 'County Antrim and County Down',
None, None, None]
population2001 = [435500, 405300, 305353,
270506, np.nan, 276459,
577869, 448624, np.nan]
population2011 = [466400, 503127, 335145,
239023, 27938, 333871,
593200, 476600, 554550]
df = pd.DataFrame({'county': county,
'country': country,
'population 2001': population2001,
'population 2011': population2011,
},
index=city)
df
We choose the population columns because these are more interesting.
For one things they contain missing data, and we can see how pandas
will deal with that extra difficulty.
pop = df[['population 2001', 'population 2011']]
pop.columns = [2001, 2011]
pop
We also renamed the columns to $2001$ and $2011$ to make examples shorter.
Now we stack, and we get a Series
with a double valued index.
pop_year = pop.stack()
pop_year
Selecting only one part of the index provides a single indexed data frame, which may contain more than one value.
pop_year['Cardiff']
To get a single value we select a combined index value.
Note that the tuple syntax is not necessary for Series
but may be needed for data frames.
pop_year[('Cardiff', 2011)]
The slicing operators allow us to select parts of the index. For example, all places that do have data for $2001$.
pop_year[:, 2001]
By unstacking we get back the data frame.
pop_year.unstack()
If rows and columns are indexed in the same way we can exchange the row (index) labels with column labels and reposition data appropriately. Moreover, if several values are used to index a row or column we can exchange only some of the values between rows and columns, or vice-versa. That sounds horribly complicated but it is actually a common task within databases, notably data warehouses. The operation of changing labels between rows and columns whilst reordering the data accordingly is called pivoting or crosstabbing in database jargon. Database software extensions often provide pivot or crosstab operations.
Before we attempt pivoting let's try to move the index into the data frame itself.
pop.index.name = 'city'
pop_full = pop.reset_index()
pop_full
We moved the index into a column, good. But since the columns are an index too we can move them into the data frame too.
pop_melt = pop_full.melt(id_vars=['city'], var_name='year')
pop_melt
Melting a data frame produces spread data, i.e. we see the same data as before but instead of looking through a row and column we look at a combination of columns in a row to understand what the "value" means.
The pivot operation is the opposite of melting. We build meaningful columns from the data in the rows.
pop_full = pop_melt.pivot(index='city', columns='year', values='value')
pop_full
Pivoting can be powerful,
not only it can build new columns but it can aggregate the resulting values.
In pandas
the pivot_table
method accept aggregations.
We can output a mean and keep the number of values from which the mean was taken from.
pop_agg = pop_melt[['city', 'value']].pivot_table(
index='city', aggfunc=[np.mean, lambda x: np.sum(~np.isnan(x)), np.max, np.min])
pop_agg
There is a side effect here. Since we may aggregate on more than a single column at once we get a multi-index on the columns. Since we do not need it for this case we name the columns ourselves.
pop_agg.columns = ['mean', 'not null', 'max', 'min']
pop_agg
pandas
provide several ways to pivot columns and rows,
for example, stacking and unstacking can be performed on data frames.
Yet, the most important point of the pivot operation is the fact that we can represent several
dimensions in a smaller number of dimensions by labeling data with combinations of values.
Just like we can represent a function of the type $f(x, y) = z$ by either storing a 2-dimensional grid of $x$ and $y$ points mapping to values of $z$; or by building a long 1-dimensional list of points of the form $(x, y)$ and mapping it to the $z$ values.