Pandas is a wrapper on top of NumPy (and several other libraries, including Matplotlib)
to make up for the shortcomings of the vectorial computing when working with real-world data.
Instead of working towards efficient
numerical computing it attempts to make working with messy data less annoying.
The name Pandas comes from the term Panel data which is derived from econometrics.
Let's import it,
and also let's import NumPy to see how both libraries work with each other.
The common name given to a pandas import is pd.
import numpy as np
import pandas as pd
Originally built as an enhanced version of R's data.frame,
pandas incorporates several known APIs into a single structure.
The DataFrame includes APIs that make it easy for use from different perspectives.
The DataFrame:
data.frame like structure, extended by multi-indexessqldf in R)stack, unstack)groupby (similar to SQL)numpy.You will use pandas (rather than NumPy) for tasks around messy data.
pandas is built atop NumPy, and uses the continuous memory and broadcast operations
of NumPy arrays to boost its performance. pandas excels at:
numpy.loadtxt)dropna or fillna)describe)Let's use some data about the British Isles and the United Kingdom to demonstrate some of the features:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man', 'Ireland']
area = np.array([14130, 77933, 20779, 130279, 572, 70273])
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas', 'Dublin']
population2001 = [1.686e6, 5.064e6, np.nan, 48.65e6, 77.703e3, np.nan]
population2011 = [1.811e6, 5.281e6, 3.057e6, 53.01e6, 84.886e3, 4.571e6]
df = pd.DataFrame({'capital': capital,
'area': area,
'population 2001': population2001,
'population 2011': population2011,
},
index=country)
df
The main feature of pandas is its DataFrame but that is just a collection of Series data structures.
A Series is pretty similar to a NumPy array: it is a list of several data of the same data type.
The difference is that the Series adds labels (an index) to the data.
series_area = pd.Series(area)
series_area
Above the index is just the offset from the beginning of the series,
as in a NumPy array.
But with pandas we can give names to the index.
series_area = pd.Series(area, index=country)
series_area
Selecting from a Series works both as a list or as a dictionary.
You can say that a Series.index maps keys over Series.values.
series_area.values, series_area.values.dtype, series_area.index
All of the following three forms of indexing produce the same record.
series_area['Wales'], series_area[2], series_area.values[2]
Slicing works too.
series_area[0:3]
And so does fancy indexing.
series_area[['Wales', 'Scotland']]
Slicing works on indexes (the labels of the Series) but it is only likely to produce meaningful results if the index is sorted.
Note: In older versions of pandas slicing over an unsorted index produced an error,
this still happens over a multi-index (outlined in a later section).
Since we did not care about the order when constructing the data frame our index is unsorted,
therefore slicing it will produce strange results.
series_area['England':'Scotland']
If we sort the index, the alphabetical order (or actually ASCIIbetical order) of the labels can be used for slicing.
sorted_area = series_area.sort_index()
sorted_area['England':'Scotland']
If you do not define an index you can still select and slice series items.
This is because apart from the normal index an implicit, positional, index is created.
In other words, every pandas series has two indexes: the implicit and the explicit index.
series_area = pd.Series(area)
series_area[0:3]
Moreover, when the explicit index is non-numeric, the implicit index is used for access. Here is a series with a sorted index.
series_area = pd.Series(area, index=country).sort_index()
series_area
Most of the time both indexes work in the same fashion but slicing is inconsistent between them: The explicit index includes the last slice element (unlike Python list slicing).
series_area['England':'Northern Ireland']
But the implicit index works in the same way as Python slicing, it excludes the last slice element.
series_area[0:3]
This can give us a headache with numerical indexes,
therefore pandas allows us to choose which index to select from:
loc always refers to the explicit indexiloc always refers to the implicit indexTo allow for $1$-based indexing instead of $0$-based indexing one may be tempted to set the index as $1$-based numerical indexes. This can become very confusing very fast because the numerical index is explicit and follows the explicit index rules for slicing. Also, the implicit index remains $0$-based.
Do not do this unless you have very good reasons.
series_area = pd.Series(area)
series_area.index = range(1, len(area)+1)
series_area
Note that one can set the index by simply assigning to it.
Nevertheless, with a $1$-based index selection differences between explicit and implicit indexes are apparent, if not puzzling.
series_area[1], series_area.loc[1], series_area.iloc[1]
Selection through the implicit index did still use $0$-based indexing. But selection without specifying the index used the explicit one.
Yet, when slicing the situation is different.
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])
By default, numeric slices use the implicit index and implicit index rules.
But there's more!
If one does not define an index at all,
slicing with .loc accesses the implicit index
but it uses the explicit index rules of slicing.
series_area = pd.Series(area)
series_area
Since there is just a single index selection is consistent.
series_area[1], series_area.loc[1], series_area.iloc[1]
But slicing can be quite confusing.
Here .loc uses the explicit index rules - include the last slice element - whilst
it accesses the implicit index because there is no explicit index.
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])
Always cross-check slicing operations and use .loc or .iloc explicitly.
The same rules apply to data frames (seen in a moment).
Series works like a NumPy array¶The NumPy vectorized operations, selection and broadcasting work as if we were working on an array.
series_area = pd.Series(area, index=country)
series_area[series_area > 20000]
Let's compute the area in square miles instead of square kilometers.
$$ 0.386 \approx \frac{1}{1.61^2} $$series_area * 0.386
And the total of the British Isles area in square miles.
(series_area * 0.386).sum()
Series is more than a NumPy array¶The Series aligns the indexes when performing operations.
In order to see that let's have a look at an array with missing values. The UK had a census in 2001 but part of the British Isles outside of the UK have no data since they did not participate.
p2001 = pd.Series(population2001, index=country)
p2001
The index holds the fact that there is missing data.
Missing data may be represented in several ways.
Here we use NaN (not a number), which is a standard value for
unknowns in floating point values.
Another option is to us the Python None value.
Since missing values can be represented by different values,
instead of comparing against them pandas provides us with
an isnull procedure that will catch common ways of representing missing data.
Another name for isnull is isna, you may see any of the two procedures
used to check for nulls.
p2001.isnull()
For the year 2011 we have all population data.
p2011 = pd.Series(population2011, index=country)
p2011
What if we would like to know the population growth between 2001 and 2011?
We could manually filter for the values we have in both years and compute
the growth using those values alone.
Yet, if we use pandas, it will perform the computation by default between the correct values.
p2011 - p2001
But what if we did not have the NaN values in the correct places?
We can drop the missing data using the dropna procedure and see.
p2001clean = p2001.dropna()
p2001clean
The new series for the year $2001$ has only $4$ values, whilst the series for $2011$ has $6$ values for population.
Still, pandas aligns the indexes and allows us to operate between the two series.
p2011 - p2001clean
When we perform the operation the indexes are matched,
where a number cannot be found (i.e. the operation contains a NaN),
pandas automatically inserts a NaN as the result.