Pandas is a wrapper on top of `NumPy`

(and several other libraries, including `Matplotlib`

)
to make up for the shortcomings of the vectorial computing when working with real-world data.
Instead of working towards efficient
numerical computing it attempts to make working with messy data less annoying.
The name Pandas comes from the term *Panel data* which is derived from econometrics.

Let's import it,
and also let's import `NumPy`

to see how both libraries work with each other.
The common name given to a `pandas`

import is `pd`

.

In [1]:

```
import numpy as np
import pandas as pd
```

Originally built as an enhanced version of R's `data.frame`

,
`pandas`

incorporates several known APIs into a single structure.
The `DataFrame`

includes APIs that make it easy for use from different perspectives.
The `DataFrame`

:

- Is an R
`data.frame`

like structure, extended by multi-indexes - Has SQL-like joins, without need for external libraries (e.g.
`sqldf`

in R) - Looks like a spreadsheet (yes, that is intentional)
- Can move between two and multidimensional representations (
`stack`

,`unstack`

) - Has aggregations across dimensions with
`groupby`

(similar to SQL) - Can define enhanced data types with many operations outside pure computation when compared with
`numpy`

.

You will use `pandas`

(rather than `NumPy`

) for tasks around messy data.
`pandas`

is built atop `NumPy`

, and uses the continuous memory and broadcast operations
of `NumPy`

arrays to boost its performance. `pandas`

excels at:

- Importing data (very resilient compared to
`numpy.loadtxt`

) - Clean up messy data (
`dropna`

or`fillna`

) - Quickly build insight into data (
`describe`

)

Let's use some data about the British Isles and the United Kingdom to demonstrate some of the features:

In [2]:

```
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man', 'Ireland']
area = np.array([14130, 77933, 20779, 130279, 572, 70273])
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas', 'Dublin']
population2001 = [1.686e6, 5.064e6, np.nan, 48.65e6, 77.703e3, np.nan]
population2011 = [1.811e6, 5.281e6, 3.057e6, 53.01e6, 84.886e3, 4.571e6]
df = pd.DataFrame({'capital': capital,
'area': area,
'population 2001': population2001,
'population 2011': population2011,
},
index=country)
df
```

Out[2]:

After the defeat of Harold and conquest of England in 1066,
William (the Conqueror) commissioned a survey of the lands under his control.
The purpose of the survey was mostly to assert what taxes can be
made on the conquered English.
In 1086, the survey was complete, under the name "Liber de Wintonia".
But a more fitting name for the book among the English was
Domesday Book, as an analogy to the Last Judgement.

The main feature of `pandas`

is its `DataFrame`

but that is just a collection of `Series`

data structures.
A `Series`

is pretty similar to a `NumPy`

array: it is a list of several data of the same data type.
The difference is that the `Series`

adds labels (an index) to the data.

In [3]:

```
series_area = pd.Series(area)
series_area
```

Out[3]:

Above the index is just the offset from the beginning of the series,
as in a NumPy array.
But with `pandas`

we can give names to the index.

In [4]:

```
series_area = pd.Series(area, index=country)
series_area
```

Out[4]:

Selecting from a `Series`

works both as a list or as a dictionary.
You can say that a `Series.index`

maps keys over `Series.values`

.

In [5]:

```
series_area.values, series_area.values.dtype, series_area.index
```

Out[5]:

All of the following three forms of indexing produce the same record.

In [6]:

```
series_area['Wales'], series_area[2], series_area.values[2]
```

Out[6]:

Slicing works too.

In [7]:

```
series_area[0:3]
```

Out[7]:

And so does fancy indexing.

In [8]:

```
series_area[['Wales', 'Scotland']]
```

Out[8]:

Slicing works on indexes (the labels of the Series) but it is only likely to produce meaningful results if the index is sorted.

Note: In older versions of `pandas`

slicing over an unsorted index produced an error,
this still happens over a multi-index (outlined in a later section).
Since we did not care about the order when constructing the data frame our index is unsorted,
therefore slicing it will produce strange results.

In [9]:

```
series_area['England':'Scotland']
```

Out[9]:

If we sort the index, the alphabetical order (or actually ASCIIbetical order) of the labels can be used for slicing.

In [10]:

```
sorted_area = series_area.sort_index()
sorted_area['England':'Scotland']
```

Out[10]:

If you do not define an index you can still select and slice series items.
This is because apart from the normal index an implicit, positional, index is created.
In other words, every `pandas`

series has two indexes: the implicit and the explicit index.

In [11]:

```
series_area = pd.Series(area)
series_area[0:3]
```

Out[11]:

Moreover, when the explicit index is non-numeric, the implicit index is used for access. Here is a series with a sorted index.

In [12]:

```
series_area = pd.Series(area, index=country).sort_index()
series_area
```

Out[12]:

Most of the time both indexes work in the same fashion but slicing is inconsistent between them: The explicit index includes the last slice element (unlike Python list slicing).

In [13]:

```
series_area['England':'Northern Ireland']
```

Out[13]:

But the implicit index works in the same way as Python slicing, it excludes the last slice element.

In [14]:

```
series_area[0:3]
```

Out[14]:

This can give us a headache with numerical indexes,
therefore `pandas`

allows us to choose which index to select from:

`loc`

always refers to the explicit index`iloc`

always refers to the implicit index

To allow for $1$-based indexing instead of $0$-based indexing one may be tempted to set the index as $1$-based numerical indexes. This can become very confusing very fast because the numerical index is explicit and follows the explicit index rules for slicing. Also, the implicit index remains $0$-based.

Do not do this unless you have very good reasons.

In [15]:

```
series_area = pd.Series(area)
series_area.index = range(1, len(area)+1)
series_area
```

Out[15]:

Note that one can set the index by simply assigning to it.

Nevertheless, with a $1$-based index selection differences between explicit and implicit indexes are apparent, if not puzzling.

In [16]:

```
series_area[1], series_area.loc[1], series_area.iloc[1]
```

Out[16]:

Selection through the implicit index did still use $0$-based indexing. But selection without specifying the index used the explicit one.

Yet, when slicing the situation is different.

In [17]:

```
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])
```

Out[17]:

By default, *numeric slices use the implicit index* and implicit index rules.

But there's more!
If one does not define an index at all,
slicing with `.loc`

accesses the implicit index
but it uses the explicit index rules of slicing.

In [18]:

```
series_area = pd.Series(area)
series_area
```

Out[18]:

Since there is just a single index selection is consistent.

In [19]:

```
series_area[1], series_area.loc[1], series_area.iloc[1]
```

Out[19]:

But slicing can be quite confusing.
Here `.loc`

uses the explicit index rules - include the last slice element - whilst
it accesses the implicit index because there is no explicit index.

In [20]:

```
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])
```

Out[20]:

Always cross-check slicing operations and use `.loc`

or `.iloc`

explicitly.
The same rules apply to data frames (seen in a moment).

`Series`

works like a NumPy array¶The `NumPy`

vectorized operations, selection and broadcasting work as if we were working on an array.

In [21]:

```
series_area = pd.Series(area, index=country)
series_area[series_area > 20000]
```

Out[21]:

Let's compute the area in square miles instead of square kilometers.

$$ 0.386 \approx \frac{1}{1.61^2} $$In [22]:

```
series_area * 0.386
```

Out[22]:

And the total of the British Isles area in square miles.

In [23]:

```
(series_area * 0.386).sum()
```

Out[23]:

`Series`

is more than a NumPy array¶The `Series`

aligns the indexes when performing operations.

In order to see that let's have a look at an array with missing values. The UK had a census in 2001 but part of the British Isles outside of the UK have no data since they did not participate.

In [24]:

```
p2001 = pd.Series(population2001, index=country)
p2001
```

Out[24]:

The index holds the fact that there is missing data.
Missing data may be represented in several ways.
Here we use `NaN`

(not a number), which is a standard value for
unknowns in floating point values.
Another option is to us the Python `None`

value.

Since missing values can be represented by different values,
instead of comparing against them `pandas`

provides us with
an `isnull`

procedure that will catch common ways of representing missing data.
Another name for `isnull`

is `isna`

, you may see any of the two procedures
used to check for nulls.

In [25]:

```
p2001.isnull()
```

Out[25]:

For the year 2011 we have all population data.

In [26]:

```
p2011 = pd.Series(population2011, index=country)
p2011
```

Out[26]:

What if we would like to know the population growth between 2001 and 2011?

We could manually filter for the values we have in both years and compute
the growth using those values alone.
Yet, if we use `pandas`

, it will perform the computation by default between the correct values.

In [27]:

```
p2011 - p2001
```

Out[27]:

But what if we did not have the `NaN`

values in the correct places?

We can drop the missing data using the `dropna`

procedure and see.

In [28]:

```
p2001clean = p2001.dropna()
p2001clean
```

Out[28]:

The new series for the year $2001$ has only $4$ values, whilst the series for $2011$ has $6$ values for population.

Still, `pandas`

aligns the indexes and allows us to operate between the two series.

In [29]:

```
p2011 - p2001clean
```

Out[29]:

When we perform the operation the indexes are matched,
where a number cannot be found (i.e. the operation contains a `NaN`

),
pandas automatically inserts a `NaN`

as the result.