Pandas is a wrapper on top of NumPy
(and several other libraries, including Matplotlib
)
to make up for the shortcomings of the vectorial computing when working with real-world data.
Instead of working towards efficient
numerical computing it attempts to make working with messy data less annoying.
The name Pandas comes from the term Panel data which is derived from econometrics.
Let's import it,
and also let's import NumPy
to see how both libraries work with each other.
The common name given to a pandas
import is pd
.
import numpy as np
import pandas as pd
Originally built as an enhanced version of R's data.frame
,
pandas
incorporates several known APIs into a single structure.
The DataFrame
includes APIs that make it easy for use from different perspectives.
The DataFrame
:
data.frame
like structure, extended by multi-indexessqldf
in R)stack
, unstack
)groupby
(similar to SQL)numpy
.You will use pandas
(rather than NumPy
) for tasks around messy data.
pandas
is built atop NumPy
, and uses the continuous memory and broadcast operations
of NumPy
arrays to boost its performance. pandas
excels at:
numpy.loadtxt
)dropna
or fillna
)describe
)Let's use some data about the British Isles and the United Kingdom to demonstrate some of the features:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man', 'Ireland']
area = np.array([14130, 77933, 20779, 130279, 572, 70273])
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas', 'Dublin']
population2001 = [1.686e6, 5.064e6, np.nan, 48.65e6, 77.703e3, np.nan]
population2011 = [1.811e6, 5.281e6, 3.057e6, 53.01e6, 84.886e3, 4.571e6]
df = pd.DataFrame({'capital': capital,
'area': area,
'population 2001': population2001,
'population 2011': population2011,
},
index=country)
df
The main feature of pandas
is its DataFrame
but that is just a collection of Series
data structures.
A Series
is pretty similar to a NumPy
array: it is a list of several data of the same data type.
The difference is that the Series
adds labels (an index) to the data.
series_area = pd.Series(area)
series_area
Above the index is just the offset from the beginning of the series,
as in a NumPy array.
But with pandas
we can give names to the index.
series_area = pd.Series(area, index=country)
series_area
Selecting from a Series
works both as a list or as a dictionary.
You can say that a Series.index
maps keys over Series.values
.
series_area.values, series_area.values.dtype, series_area.index
All of the following three forms of indexing produce the same record.
series_area['Wales'], series_area[2], series_area.values[2]
Slicing works too.
series_area[0:3]
And so does fancy indexing.
series_area[['Wales', 'Scotland']]
Slicing works on indexes (the labels of the Series) but it is only likely to produce meaningful results if the index is sorted.
Note: In older versions of pandas
slicing over an unsorted index produced an error,
this still happens over a multi-index (outlined in a later section).
Since we did not care about the order when constructing the data frame our index is unsorted,
therefore slicing it will produce strange results.
series_area['England':'Scotland']
If we sort the index, the alphabetical order (or actually ASCIIbetical order) of the labels can be used for slicing.
sorted_area = series_area.sort_index()
sorted_area['England':'Scotland']
If you do not define an index you can still select and slice series items.
This is because apart from the normal index an implicit, positional, index is created.
In other words, every pandas
series has two indexes: the implicit and the explicit index.
series_area = pd.Series(area)
series_area[0:3]
Moreover, when the explicit index is non-numeric, the implicit index is used for access. Here is a series with a sorted index.
series_area = pd.Series(area, index=country).sort_index()
series_area
Most of the time both indexes work in the same fashion but slicing is inconsistent between them: The explicit index includes the last slice element (unlike Python list slicing).
series_area['England':'Northern Ireland']
But the implicit index works in the same way as Python slicing, it excludes the last slice element.
series_area[0:3]
This can give us a headache with numerical indexes,
therefore pandas
allows us to choose which index to select from:
loc
always refers to the explicit indexiloc
always refers to the implicit indexTo allow for $1$-based indexing instead of $0$-based indexing one may be tempted to set the index as $1$-based numerical indexes. This can become very confusing very fast because the numerical index is explicit and follows the explicit index rules for slicing. Also, the implicit index remains $0$-based.
Do not do this unless you have very good reasons.
series_area = pd.Series(area)
series_area.index = range(1, len(area)+1)
series_area
Note that one can set the index by simply assigning to it.
Nevertheless, with a $1$-based index selection differences between explicit and implicit indexes are apparent, if not puzzling.
series_area[1], series_area.loc[1], series_area.iloc[1]
Selection through the implicit index did still use $0$-based indexing. But selection without specifying the index used the explicit one.
Yet, when slicing the situation is different.
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])
By default, numeric slices use the implicit index and implicit index rules.
But there's more!
If one does not define an index at all,
slicing with .loc
accesses the implicit index
but it uses the explicit index rules of slicing.
series_area = pd.Series(area)
series_area
Since there is just a single index selection is consistent.
series_area[1], series_area.loc[1], series_area.iloc[1]
But slicing can be quite confusing.
Here .loc
uses the explicit index rules - include the last slice element - whilst
it accesses the implicit index because there is no explicit index.
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])
Always cross-check slicing operations and use .loc
or .iloc
explicitly.
The same rules apply to data frames (seen in a moment).
Series
works like a NumPy array¶The NumPy
vectorized operations, selection and broadcasting work as if we were working on an array.
series_area = pd.Series(area, index=country)
series_area[series_area > 20000]
Let's compute the area in square miles instead of square kilometers.
$$ 0.386 \approx \frac{1}{1.61^2} $$series_area * 0.386
And the total of the British Isles area in square miles.
(series_area * 0.386).sum()
Series
is more than a NumPy array¶The Series
aligns the indexes when performing operations.
In order to see that let's have a look at an array with missing values. The UK had a census in 2001 but part of the British Isles outside of the UK have no data since they did not participate.
p2001 = pd.Series(population2001, index=country)
p2001
The index holds the fact that there is missing data.
Missing data may be represented in several ways.
Here we use NaN
(not a number), which is a standard value for
unknowns in floating point values.
Another option is to us the Python None
value.
Since missing values can be represented by different values,
instead of comparing against them pandas
provides us with
an isnull
procedure that will catch common ways of representing missing data.
Another name for isnull
is isna
, you may see any of the two procedures
used to check for nulls.
p2001.isnull()
For the year 2011 we have all population data.
p2011 = pd.Series(population2011, index=country)
p2011
What if we would like to know the population growth between 2001 and 2011?
We could manually filter for the values we have in both years and compute
the growth using those values alone.
Yet, if we use pandas
, it will perform the computation by default between the correct values.
p2011 - p2001
But what if we did not have the NaN
values in the correct places?
We can drop the missing data using the dropna
procedure and see.
p2001clean = p2001.dropna()
p2001clean
The new series for the year $2001$ has only $4$ values, whilst the series for $2011$ has $6$ values for population.
Still, pandas
aligns the indexes and allows us to operate between the two series.
p2011 - p2001clean
When we perform the operation the indexes are matched,
where a number cannot be found (i.e. the operation contains a NaN
),
pandas automatically inserts a NaN
as the result.