04.00 Pandas Introduction

Pandas is a wrapper on top of NumPy (and several other libraries, including Matplotlib) to make up for the shortcomings of the vectorial computing when working with real-world data. Instead of working towards efficient numerical computing it attempts to make working with messy data less annoying. The name Pandas comes from the term Panel data which is derived from econometrics.

Let's import it, and also let's import NumPy to see how both libraries work with each other. The common name given to a pandas import is pd.

In [1]:
import numpy as np
import pandas as pd

General ideas behind pandas

Originally built as an enhanced version of R's data.frame, pandas incorporates several known APIs into a single structure. The DataFrame includes APIs that make it easy for use from different perspectives. The DataFrame:

  • Is an R data.frame like structure, extended by multi-indexes
  • Has SQL-like joins, without need for external libraries (e.g. sqldf in R)
  • Looks like a spreadsheet (yes, that is intentional)
  • Can move between two and multidimensional representations (stack, unstack)
  • Has aggregations across dimensions with groupby (similar to SQL)
  • Can define enhanced data types with many operations outside pure computation when compared with numpy.

You will use pandas (rather than NumPy) for tasks around messy data. pandas is built atop NumPy, and uses the continuous memory and broadcast operations of NumPy arrays to boost its performance. pandas excels at:

  • Importing data (very resilient compared to numpy.loadtxt)
  • Clean up messy data (dropna or fillna)
  • Quickly build insight into data (describe)

Let's use some data about the British Isles and the United Kingdom to demonstrate some of the features:

In [2]:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man', 'Ireland']
area = np.array([14130, 77933, 20779, 130279, 572, 70273])
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas', 'Dublin']
population2001 = [1.686e6, 5.064e6,  np.nan, 48.65e6, 77.703e3,  np.nan]
population2011 = [1.811e6, 5.281e6, 3.057e6, 53.01e6, 84.886e3, 4.571e6]
df = pd.DataFrame({'capital': capital,
                   'area': area,
                   'population 2001': population2001,
                   'population 2011': population2011,
                  },
                  index=country)
df
Out[2]:
capital area population 2001 population 2011
Northern Ireland Belfast 14130 1686000.0 1811000.0
Scotland Edinburgh 77933 5064000.0 5281000.0
Wales Cardiff 20779 NaN 3057000.0
England London 130279 48650000.0 53010000.0
Isle of Man Douglas 572 77703.0 84886.0
Ireland Dublin 70273 NaN 4571000.0

Domesday Book

pd-parchment.svg
After the defeat of Harold and conquest of England in 1066, William (the Conqueror) commissioned a survey of the lands under his control. The purpose of the survey was mostly to assert what taxes can be made on the conquered English. In 1086, the survey was complete, under the name "Liber de Wintonia". But a more fitting name for the book among the English was Domesday Book, as an analogy to the Last Judgement.

Series

The main feature of pandas is its DataFrame but that is just a collection of Series data structures. A Series is pretty similar to a NumPy array: it is a list of several data of the same data type. The difference is that the Series adds labels (an index) to the data.

In [3]:
series_area = pd.Series(area)
series_area
Out[3]:
0     14130
1     77933
2     20779
3    130279
4       572
5     70273
dtype: int64

Above the index is just the offset from the beginning of the series, as in a NumPy array. But with pandas we can give names to the index.

In [4]:
series_area = pd.Series(area, index=country)
series_area
Out[4]:
Northern Ireland     14130
Scotland             77933
Wales                20779
England             130279
Isle of Man            572
Ireland              70273
dtype: int64

Selection from the index

Selecting from a Series works both as a list or as a dictionary. You can say that a Series.index maps keys over Series.values.

In [5]:
series_area.values, series_area.values.dtype, series_area.index
Out[5]:
(array([ 14130,  77933,  20779, 130279,    572,  70273]),
 dtype('int64'),
 Index(['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man',
        'Ireland'],
       dtype='object'))

All of the following three forms of indexing produce the same record.

In [6]:
series_area['Wales'], series_area[2], series_area.values[2]
Out[6]:
(20779, 20779, 20779)

Slicing works too.

In [7]:
series_area[0:3]
Out[7]:
Northern Ireland    14130
Scotland            77933
Wales               20779
dtype: int64

And so does fancy indexing.

In [8]:
series_area[['Wales', 'Scotland']]
Out[8]:
Wales       20779
Scotland    77933
dtype: int64

Sorted and unsorted indexes

Slicing works on indexes (the labels of the Series) but it is only likely to produce meaningful results if the index is sorted.

Note: In older versions of pandas slicing over an unsorted index produced an error, this still happens over a multi-index (outlined in a later section). Since we did not care about the order when constructing the data frame our index is unsorted, therefore slicing it will produce strange results.

In [9]:
series_area['England':'Scotland']
Out[9]:
Series([], dtype: int64)

If we sort the index, the alphabetical order (or actually ASCIIbetical order) of the labels can be used for slicing.

In [10]:
sorted_area = series_area.sort_index()
sorted_area['England':'Scotland']
Out[10]:
England             130279
Ireland              70273
Isle of Man            572
Northern Ireland     14130
Scotland             77933
dtype: int64

Implicit indexes

If you do not define an index you can still select and slice series items. This is because apart from the normal index an implicit, positional, index is created. In other words, every pandas series has two indexes: the implicit and the explicit index.

In [11]:
series_area = pd.Series(area)
series_area[0:3]
Out[11]:
0    14130
1    77933
2    20779
dtype: int64

Moreover, when the explicit index is non-numeric, the implicit index is used for access. Here is a series with a sorted index.

In [12]:
series_area = pd.Series(area, index=country).sort_index()
series_area
Out[12]:
England             130279
Ireland              70273
Isle of Man            572
Northern Ireland     14130
Scotland             77933
Wales                20779
dtype: int64

Most of the time both indexes work in the same fashion but slicing is inconsistent between them: The explicit index includes the last slice element (unlike Python list slicing).

In [13]:
series_area['England':'Northern Ireland']
Out[13]:
England             130279
Ireland              70273
Isle of Man            572
Northern Ireland     14130
dtype: int64

But the implicit index works in the same way as Python slicing, it excludes the last slice element.

In [14]:
series_area[0:3]
Out[14]:
England        130279
Ireland         70273
Isle of Man       572
dtype: int64

This can give us a headache with numerical indexes, therefore pandas allows us to choose which index to select from:

  • loc always refers to the explicit index
  • iloc always refers to the implicit index

To allow for $1$-based indexing instead of $0$-based indexing one may be tempted to set the index as $1$-based numerical indexes. This can become very confusing very fast because the numerical index is explicit and follows the explicit index rules for slicing. Also, the implicit index remains $0$-based.

Do not do this unless you have very good reasons.

In [15]:
series_area = pd.Series(area)
series_area.index = range(1, len(area)+1)
series_area
Out[15]:
1     14130
2     77933
3     20779
4    130279
5       572
6     70273
dtype: int64

Note that one can set the index by simply assigning to it.

Nevertheless, with a $1$-based index selection differences between explicit and implicit indexes are apparent, if not puzzling.

In [16]:
series_area[1], series_area.loc[1], series_area.iloc[1]
Out[16]:
(14130, 14130, 77933)

Selection through the implicit index did still use $0$-based indexing. But selection without specifying the index used the explicit one.

Yet, when slicing the situation is different.

In [17]:
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])
Out[17]:
([77933, 20779], [14130, 77933, 20779], [77933, 20779])

By default, numeric slices use the implicit index and implicit index rules.

But there's more! If one does not define an index at all, slicing with .loc accesses the implicit index but it uses the explicit index rules of slicing.

In [18]:
series_area = pd.Series(area)
series_area
Out[18]:
0     14130
1     77933
2     20779
3    130279
4       572
5     70273
dtype: int64

Since there is just a single index selection is consistent.

In [19]:
series_area[1], series_area.loc[1], series_area.iloc[1]
Out[19]:
(77933, 77933, 77933)

But slicing can be quite confusing. Here .loc uses the explicit index rules - include the last slice element - whilst it accesses the implicit index because there is no explicit index.

In [20]:
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])
Out[20]:
([77933, 20779], [77933, 20779, 130279], [77933, 20779])

Always cross-check slicing operations and use .loc or .iloc explicitly. The same rules apply to data frames (seen in a moment).

A Series works like a NumPy array

The NumPy vectorized operations, selection and broadcasting work as if we were working on an array.

In [21]:
series_area = pd.Series(area, index=country)
series_area[series_area > 20000]
Out[21]:
Scotland     77933
Wales        20779
England     130279
Ireland      70273
dtype: int64

Let's compute the area in square miles instead of square kilometers.

$$ 0.386 \approx \frac{1}{1.61^2} $$
In [22]:
series_area * 0.386
Out[22]:
Northern Ireland     5454.180
Scotland            30082.138
Wales                8020.694
England             50287.694
Isle of Man           220.792
Ireland             27125.378
dtype: float64

And the total of the British Isles area in square miles.

In [23]:
(series_area * 0.386).sum()
Out[23]:
121190.876

A Series is more than a NumPy array

The Series aligns the indexes when performing operations.

In order to see that let's have a look at an array with missing values. The UK had a census in 2001 but part of the British Isles outside of the UK have no data since they did not participate.

In [24]:
p2001 = pd.Series(population2001, index=country)
p2001
Out[24]:
Northern Ireland     1686000.0
Scotland             5064000.0
Wales                      NaN
England             48650000.0
Isle of Man            77703.0
Ireland                    NaN
dtype: float64

The index holds the fact that there is missing data. Missing data may be represented in several ways. Here we use NaN (not a number), which is a standard value for unknowns in floating point values. Another option is to us the Python None value.

Since missing values can be represented by different values, instead of comparing against them pandas provides us with an isnull procedure that will catch common ways of representing missing data. Another name for isnull is isna, you may see any of the two procedures used to check for nulls.

In [25]:
p2001.isnull()
Out[25]:
Northern Ireland    False
Scotland            False
Wales                True
England             False
Isle of Man         False
Ireland              True
dtype: bool

For the year 2011 we have all population data.

In [26]:
p2011 = pd.Series(population2011, index=country)
p2011
Out[26]:
Northern Ireland     1811000.0
Scotland             5281000.0
Wales                3057000.0
England             53010000.0
Isle of Man            84886.0
Ireland              4571000.0
dtype: float64

What if we would like to know the population growth between 2001 and 2011?

We could manually filter for the values we have in both years and compute the growth using those values alone. Yet, if we use pandas, it will perform the computation by default between the correct values.

In [27]:
p2011 - p2001
Out[27]:
Northern Ireland     125000.0
Scotland             217000.0
Wales                     NaN
England             4360000.0
Isle of Man            7183.0
Ireland                   NaN
dtype: float64

But what if we did not have the NaN values in the correct places?

We can drop the missing data using the dropna procedure and see.

In [28]:
p2001clean = p2001.dropna()
p2001clean
Out[28]:
Northern Ireland     1686000.0
Scotland             5064000.0
England             48650000.0
Isle of Man            77703.0
dtype: float64

The new series for the year $2001$ has only $4$ values, whilst the series for $2011$ has $6$ values for population.

Still, pandas aligns the indexes and allows us to operate between the two series.

In [29]:
p2011 - p2001clean
Out[29]:
England             4360000.0
Ireland                   NaN
Isle of Man            7183.0
Northern Ireland     125000.0
Scotland             217000.0
Wales                     NaN
dtype: float64

When we perform the operation the indexes are matched, where a number cannot be found (i.e. the operation contains a NaN), pandas automatically inserts a NaN as the result.