The `DataFrame`

is just a collection of `Series`

with a common index.
It can be understood as a two-dimensional representation of data,
similar to a spreadsheet.
Contrary to a two dimensional `NumPy`

array,
**indexing a data frame with a single value produces the column** not the row.
Yet, indexing it with two values produces the row and the column
just like in a `NumPy`

array.

Let's import the libraries until now.

In [1]:

```
import numpy as np
import pandas as pd
```

Constructing the data frame can be performed in several ways,
below is the most common way of using a dictionary of arrays.
Each dictionary key-value pair becomes a column (a `Series`

).

We add more data to our example on the British Isles.
It is common to name a data frame as `df`

.

In [2]:

```
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man', 'Ireland']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas', 'Dublin']
state = ['United Kingdom', 'United Kingdom', 'United Kingdom',
'United Kingdom', 'Isle of Man', 'Republic of Ireland']
monarch = ['Elizabeth II', 'Elizabeth II', 'Elizabeth II',
'Elizabeth II', 'Elizabeth II', None]
area = np.array([14130, 77933, 20779, 130279, 572, 70273])
population2001 = [1.686e6, 5.064e6, np.nan, 48.65e6, 77.703e3, np.nan]
population2011 = [1.811e6, 5.281e6, 3.057e6, 53.01e6, 84.886e3, 4.571e6]
df = pd.DataFrame({'capital': capital,
'state': state,
'monarch': monarch,
'area': area,
'population 2001': population2001,
'population 2011': population2011,
},
index=country)
df
```

Out[2]:

The underlying `NumPy`

arrays can be concatenated
and returned using `.values`

.
Note however that since we have different data types,
the NumPy array will have the `dtype=object`

,
meaning an array of Python objects.
A memory inefficient representation.

In [3]:

```
array = df.values
array
```

Out[3]:

Since the `NumPy`

array can only hold one data type.
The array casted every data type to a Python object.
In reality `NumPy`

arrays support compound data types
but these are considerably more complicated to use the data frames.

We have the same data in `NumPy`

and `pandas`

, and we can index it.
In `NumPy`

a plain index produces a *row*.

In [4]:

```
array[0]
```

Out[4]:

Where in `pandas`

a single index produces a *column*.

In [5]:

```
df['area']
```

Out[5]:

Yet, there is a twist.
Using the implicit index (`.iloc`

) produces the same behavior as `NumPy`

.

In [6]:

```
df.iloc[0]
```

Out[6]:

Columns with simple names can be accessed as attributes.

In [7]:

```
df.area
```

Out[7]:

Finally, multi-indexing works in the same way as `NumPy`

:
One provides first the *row* and then the *column*.
And slicing works too.

In [8]:

```
df.loc['England', 'capital':'area']
```

Out[8]:

Since fancy indexing works, we can select columns and then take the underlying NumPy array with a single data type.

In [9]:

```
array = df[['area', 'population 2001', 'population 2011']].values
array, array.dtype
```

Out[9]:

Data frames have several useful methods to give a feel for the data. With a reasonable amount of data you'd rather not want thousands of rows to be printed. What you want are methods to give you the data you are after quickly.

For example, looking at the beginning or end of sorted values will show outliers. And in order to have sorted value we sort the index. In this case we do not sort it in place, although the final result is the same once we assign the variable back to itself.

In [10]:

```
df = df.sort_index()
df
```

Out[10]:

The index is sorted, Therefore we get the countries in alphabetical order.

In [11]:

```
df.head(3)
```

Out[11]:

Sorted by area, should give us the biggest countries.

In [12]:

```
df.sort_values('area').tail(3)
```

Out[12]:

The length of a data frame is the number of rows it has.

In [13]:

```
len(df)
```

Out[13]:

The `describe`

and `info`

methods print two distinct types of statistics about the data frame.
One gives the statistical view of each column.

In [14]:

```
df.describe()
```

Out[14]:

And the other gives you a memory layout.

In [15]:

```
df.info()
```

The data frame can also display plots (using `Matplotlib`

) directly.
That said, if we want to display the plots within the notebook or style them,
we need to perform the `matplotlib`

setup ourselves.

In [16]:

```
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-talk')
```

We can see the population growth in a graph.

Here `pandas`

gives us a handle to the plot,
which we then use to get better ticks on the vertical axis
with `matplotlib`

calls.
One can get the tick values from the axis itself
and then transform them into the labels of their liking.

In [17]:

```
fig, ax = plt.subplots(figsize=(16, 7))
plot = df[['population 2001', 'population 2011']].plot(kind='bar', ax=ax)
ticks = ['%.0f M' % (x / 1e6) for x in plot.yaxis.get_majorticklocs()]
plot.set_yticklabels(ticks);
```

And, on a logarithmic scale, we can see the relation between area and population.

Here we also use annotations, this is a `matplotlib`

feature.
It annotates the string (first argument) over a point on the graph
(two coordinates - as a tuple, list or series).

In [18]:

```
fig, ax = plt.subplots(figsize=(16, 9))
plot = df.plot(kind='scatter', x='population 2001', y='area', loglog=True, ax=ax)
for k, v in df[['population 2001', 'area']].iterrows():
plot.axes.annotate(k, xy=v, xytext=(v[0], v[1]*1.2), ha='center', size=12)
```

Do not be fooled that the population density is the same in all four countries is the same.
Population density is population divided by area,
and since we have a straight line on the graph we may be inclined to think
that the population and are grow in the same fashion.
That would be a lie, both graph axes are in logarithmic scales due to `loglog=True`

,
hence masking the real ratio between population and area.
Lying with graphs and statistics is easy, don't fall prey to it.
Plots in `pandas`

give the convenience of doing the logarithms on the fly,
there are also the options for `logx=`

and `logy=`

for finer control.

Plotting with `pandas`

is useful when one needs something quick and dirty to see the data.
Yet in order to make a presentable graph,
more often than not one will need to rely on `matplotlib`

.
Either by taking the data from the underlying NumPy array

More often than not real world data is incomplete in some way.
Missing data is represented in several ways,
numeric values often use NaN (Not a Number)
which are actual IEEE 754 float standard NaNs.
String values will sometimes use the Python `None`

value,
or sometime an empty string.
One must be careful with empty strings because the `isnull`

does not consider empty strings to be nulls.

This also means that the data type of the array underlying
a `Series`

with missing data must be either a float or a Python object.
For non-floating point data types NaNs become rather complicated.
Several solutions have been attempted over the decades on how to handle NaN
values, with varied degree of success.
`pandas`

attempts to use its own NaN solution, namely the value `pd.NA`

.
At the time of writing you are likely to encounter pandas
versions using that as its default NaN treatment for some data types,
whilst using the IEEE 754 standard for other data types.
It is too early to argue which types will settle for which values
for missing data, be wary.

Let's look at a dataset where missing data is a natural occurrence. Different countries in the British Isles have different governing structures, for example counties do not exist in either Scotland or the Republic of Ireland. Moreover, the $2001$ census of the United Kingdom was not performed in the countries outside of the United Kingdom.

In [19]:

```
city = ['Liverpool', 'Manchester', 'Cardiff',
'Swansea', 'Douglas', 'Belfast',
'Glasgow', 'Edinburgh', 'Dublin']
country = ['England', 'England', 'Wales',
'Wales', 'Isle of Man',
'Northern Ireland', 'Scotland',
'Scotland', 'Ireland']
county = ['Merseyside', 'Greater Manchester', 'South Glamorgan',
'West Glamorgan', None, 'County Antrim and County Down',
None, None, None]
population2001 = [435500, 405300, 305353,
270506, np.nan, 276459,
577869, 448624, np.nan]
population2011 = [466400, 503127, 335145,
239023, 27938, 333871,
593200, 476600, 554550]
df = pd.DataFrame({'county': county,
'country': country,
'population 2001': population2001,
'population 2011': population2011,
},
index=city)
df
```

Out[19]:

`pandas`

data frames have the `dropna`

an `fillna`

methods that
(unsurprisingly) drop or fill in values for NaNs.
Dropping can be done by row or column.

In [20]:

```
df.dropna(axis='rows')
```

Out[20]:

We lost the data for the Isle of Man, despite the fact that it has data for 2011. Instead we can drop the incomplete columns.

In [21]:

```
df.dropna(axis='columns')
```

Out[21]:

That's better. But we are still losing a lot of data.

Also note that instead of `NumPy`

s `axis=0`

and `axis=1`

,
in `pandas`

one can use `axis='index'`

and `axis='columns'`

.
That is, most of the time,
some `pandas`

functions do accept `axis='row'`

and `axis='col'`

, beware.

Filling NaNs instead of losing entire rows and columns can be performed in three different ways:
we can provide a value into `fillna`

to substitute the NaNs for (e.g. `.fillna(0)`

); or we can use
the `method=`

argument to use a predefined way of filling the NaNs from the data itself.
The `method=`

can be either `pad`

/`ffill`

which will fill each NaN with a previous (non-NaN) value seen; or it can be
`backfill`

/`bfill`

which will fill a NaN from the next value.
Filling can be performed column or row wise.
But column wise filling is almost never used since columns can have different data types.

In [22]:

```
df_fill = df.fillna(method='ffill', axis='index')
df_fill
```

Out[22]:

That seems to have worked but not quite. Perhaps we should leave the missing data in the county column. One can assign columns directly by name.

In [23]:

```
df_fill['county'] = df['county']
df_fill
```

Out[23]:

Once again better but a handful of things still look off.
The population for 2011 are integers whilst the population for 2001 are floats.
This is because the population for 2001 did contain `np.nan`

just a moment ago,
and that can only be represented as a IEEE 754 float number.

We can ask `pandas`

to convert all values to a more common representation.

In [24]:

```
df_fill = df_fill.convert_dtypes()
df_fill
```

Out[24]:

The numbers look alright.

But now the county column seems off.
The representation of `<NA>`

is the value `pd.NA`

,
a common missing value representation inside pandas.

We said that we can index a data frame as a two dimensional NumPy array.
Knowing that we can ask for the explicit index and verify
that the `<NA>`

in the county for Dublin is indeed `pd.NA`

.

In [25]:

```
df_fill.loc['Dublin', 'county'] is pd.NA
```

Out[25]:

Another extra feature that does not exist in `NumPy`

arrays are methods that work
on string content, just like Python string methods. The `str`

object of a `Series`

(of a column of a data frame) is used to call string methods on each element, efficiently.
The result is either a boolean `Series`

that can then be used to retrieve rows from the data frame,
or a new string `Series`

modified by the operation.

In [26]:

```
df_fill['country'].str.contains('land')
```

Out[26]:

Here we see the cities (the index) of countries which contain the string
"land" in their names, Scot*land*, Eng*land* and both cases of Ire*land*.

Several regular expression methods are supported as well. And just as with NumPy masking we can use the boolean data frames to index the original data frame. When masking a data frame the booleans are considered as rows.

In [27]:

```
df_fill[df_fill.country.str.contains('[re]')]
```

Out[27]:

We got all countries that have the *lowercase* letter "r"
or the *lowercase* letter "e".
Notably England is excluded because the search is case sensitive.
Moreover, the search is using a regular expression, or regex for short.

The purpose of regular expressions is to either match or not match strings or parts of strings. The string is matched character by character with some regex special characters forming expressions to match several possibilities at once. A simple character matches itself, whilst a regex expression can match a plethora of things. Some common regex expressions follow.

expression | match |
---|---|

`abc` |
exact match "abc" |

`.` |
any single character |

`[abc]` |
one character cited |

`[^abc]` |
one character NOT cited |

`a*` |
zero or more times "a" |

`a+` |
one or more times "a" |

`a?` |
zero or one times "a" |

`\w` |
any word character |

`\W` |
any not word character |

`\s` |
any space character |

`\S` |
any not space character |

`(ab)` |
grouping |

`(ab)+` |
zero or more times entire match |

`\1` or `$1` |
repeat grouped match |

Regular expressions can be complex, entire books have been written on them. But just a couple of minutes can get you started. For example:

`[yzs][ao].*`

Matches "yay", "zoink" and "kazoom" but does not match "bang", "flush" or "slap".

Several non-regex Python string functions are present too. And we can use the index as a search base as well.

In [28]:

```
df[df.index.str.startswith('Liv')]
```

Out[28]:

Most string Python methods are available,
including procedures that change the string instead
of just returning boolean values.
A changed `Series`

is returned.

In [29]:

```
df['county'].str.upper()
```

Out[29]:

Note that this respects the missing data, although it may require extra work with data types.

Often one does not want to modify the data in place,
one wants to keep the data in its original form.
Since the generated `Series`

one can just assign the
modified column as a new column
For example, the first letter of the country.

In [30]:

```
df['initial'] = df['country'].str[0].str.upper()
df
```

Out[30]:

Note above that the `str`

attribute has been used two times,
the first use generated a `Series`

and the second use
of `str`

has been on this second series.

The indexing on the `str`

worked the same way as
indexing within Python strings.