The DataFrame
is just a collection of Series
with a common index.
It can be understood as a two-dimensional representation of data,
similar to a spreadsheet.
Contrary to a two dimensional NumPy
array,
indexing a data frame with a single value produces the column not the row.
Yet, indexing it with two values produces the row and the column
just like in a NumPy
array.
Let's import the libraries until now.
import numpy as np
import pandas as pd
Constructing the data frame can be performed in several ways,
below is the most common way of using a dictionary of arrays.
Each dictionary key-value pair becomes a column (a Series
).
We add more data to our example on the British Isles.
It is common to name a data frame as df
.
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man', 'Ireland']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas', 'Dublin']
state = ['United Kingdom', 'United Kingdom', 'United Kingdom',
'United Kingdom', 'Isle of Man', 'Republic of Ireland']
monarch = ['Elizabeth II', 'Elizabeth II', 'Elizabeth II',
'Elizabeth II', 'Elizabeth II', None]
area = np.array([14130, 77933, 20779, 130279, 572, 70273])
population2001 = [1.686e6, 5.064e6, np.nan, 48.65e6, 77.703e3, np.nan]
population2011 = [1.811e6, 5.281e6, 3.057e6, 53.01e6, 84.886e3, 4.571e6]
df = pd.DataFrame({'capital': capital,
'state': state,
'monarch': monarch,
'area': area,
'population 2001': population2001,
'population 2011': population2011,
},
index=country)
df
The underlying NumPy
arrays can be concatenated
and returned using .values
.
Note however that since we have different data types,
the NumPy array will have the dtype=object
,
meaning an array of Python objects.
A memory inefficient representation.
array = df.values
array
Since the NumPy
array can only hold one data type.
The array casted every data type to a Python object.
In reality NumPy
arrays support compound data types
but these are considerably more complicated to use the data frames.
We have the same data in NumPy
and pandas
, and we can index it.
In NumPy
a plain index produces a row.
array[0]
Where in pandas
a single index produces a column.
df['area']
Yet, there is a twist.
Using the implicit index (.iloc
) produces the same behavior as NumPy
.
df.iloc[0]
Columns with simple names can be accessed as attributes.
df.area
Finally, multi-indexing works in the same way as NumPy
:
One provides first the row and then the column.
And slicing works too.
df.loc['England', 'capital':'area']
Since fancy indexing works, we can select columns and then take the underlying NumPy array with a single data type.
array = df[['area', 'population 2001', 'population 2011']].values
array, array.dtype
Data frames have several useful methods to give a feel for the data. With a reasonable amount of data you'd rather not want thousands of rows to be printed. What you want are methods to give you the data you are after quickly.
For example, looking at the beginning or end of sorted values will show outliers. And in order to have sorted value we sort the index. In this case we do not sort it in place, although the final result is the same once we assign the variable back to itself.
df = df.sort_index()
df
The index is sorted, Therefore we get the countries in alphabetical order.
df.head(3)
Sorted by area, should give us the biggest countries.
df.sort_values('area').tail(3)
The length of a data frame is the number of rows it has.
len(df)
The describe
and info
methods print two distinct types of statistics about the data frame.
One gives the statistical view of each column.
df.describe()
And the other gives you a memory layout.
df.info()
The data frame can also display plots (using Matplotlib
) directly.
That said, if we want to display the plots within the notebook or style them,
we need to perform the matplotlib
setup ourselves.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-talk')
We can see the population growth in a graph.
Here pandas
gives us a handle to the plot,
which we then use to get better ticks on the vertical axis
with matplotlib
calls.
One can get the tick values from the axis itself
and then transform them into the labels of their liking.
fig, ax = plt.subplots(figsize=(16, 7))
plot = df[['population 2001', 'population 2011']].plot(kind='bar', ax=ax)
ticks = ['%.0f M' % (x / 1e6) for x in plot.yaxis.get_majorticklocs()]
plot.set_yticklabels(ticks);
And, on a logarithmic scale, we can see the relation between area and population.
Here we also use annotations, this is a matplotlib
feature.
It annotates the string (first argument) over a point on the graph
(two coordinates - as a tuple, list or series).
fig, ax = plt.subplots(figsize=(16, 9))
plot = df.plot(kind='scatter', x='population 2001', y='area', loglog=True, ax=ax)
for k, v in df[['population 2001', 'area']].iterrows():
plot.axes.annotate(k, xy=v, xytext=(v[0], v[1]*1.2), ha='center', size=12)
Do not be fooled that the population density is the same in all four countries is the same.
Population density is population divided by area,
and since we have a straight line on the graph we may be inclined to think
that the population and are grow in the same fashion.
That would be a lie, both graph axes are in logarithmic scales due to loglog=True
,
hence masking the real ratio between population and area.
Lying with graphs and statistics is easy, don't fall prey to it.
Plots in pandas
give the convenience of doing the logarithms on the fly,
there are also the options for logx=
and logy=
for finer control.
Plotting with pandas
is useful when one needs something quick and dirty to see the data.
Yet in order to make a presentable graph,
more often than not one will need to rely on matplotlib
.
Either by taking the data from the underlying NumPy array
More often than not real world data is incomplete in some way.
Missing data is represented in several ways,
numeric values often use NaN (Not a Number)
which are actual IEEE 754 float standard NaNs.
String values will sometimes use the Python None
value,
or sometime an empty string.
One must be careful with empty strings because the isnull
does not consider empty strings to be nulls.
This also means that the data type of the array underlying
a Series
with missing data must be either a float or a Python object.
For non-floating point data types NaNs become rather complicated.
Several solutions have been attempted over the decades on how to handle NaN
values, with varied degree of success.
pandas
attempts to use its own NaN solution, namely the value pd.NA
.
At the time of writing you are likely to encounter pandas
versions using that as its default NaN treatment for some data types,
whilst using the IEEE 754 standard for other data types.
It is too early to argue which types will settle for which values
for missing data, be wary.
Let's look at a dataset where missing data is a natural occurrence. Different countries in the British Isles have different governing structures, for example counties do not exist in either Scotland or the Republic of Ireland. Moreover, the $2001$ census of the United Kingdom was not performed in the countries outside of the United Kingdom.
city = ['Liverpool', 'Manchester', 'Cardiff',
'Swansea', 'Douglas', 'Belfast',
'Glasgow', 'Edinburgh', 'Dublin']
country = ['England', 'England', 'Wales',
'Wales', 'Isle of Man',
'Northern Ireland', 'Scotland',
'Scotland', 'Ireland']
county = ['Merseyside', 'Greater Manchester', 'South Glamorgan',
'West Glamorgan', None, 'County Antrim and County Down',
None, None, None]
population2001 = [435500, 405300, 305353,
270506, np.nan, 276459,
577869, 448624, np.nan]
population2011 = [466400, 503127, 335145,
239023, 27938, 333871,
593200, 476600, 554550]
df = pd.DataFrame({'county': county,
'country': country,
'population 2001': population2001,
'population 2011': population2011,
},
index=city)
df
pandas
data frames have the dropna
an fillna
methods that
(unsurprisingly) drop or fill in values for NaNs.
Dropping can be done by row or column.
df.dropna(axis='rows')
We lost the data for the Isle of Man, despite the fact that it has data for 2011. Instead we can drop the incomplete columns.
df.dropna(axis='columns')
That's better. But we are still losing a lot of data.
Also note that instead of NumPy
s axis=0
and axis=1
,
in pandas
one can use axis='index'
and axis='columns'
.
That is, most of the time,
some pandas
functions do accept axis='row'
and axis='col'
, beware.
Filling NaNs instead of losing entire rows and columns can be performed in three different ways:
we can provide a value into fillna
to substitute the NaNs for (e.g. .fillna(0)
); or we can use
the method=
argument to use a predefined way of filling the NaNs from the data itself.
The method=
can be either pad
/ffill
which will fill each NaN with a previous (non-NaN) value seen; or it can be
backfill
/bfill
which will fill a NaN from the next value.
Filling can be performed column or row wise.
But column wise filling is almost never used since columns can have different data types.
df_fill = df.fillna(method='ffill', axis='index')
df_fill
That seems to have worked but not quite. Perhaps we should leave the missing data in the county column. One can assign columns directly by name.
df_fill['county'] = df['county']
df_fill
Once again better but a handful of things still look off.
The population for 2011 are integers whilst the population for 2001 are floats.
This is because the population for 2001 did contain np.nan
just a moment ago,
and that can only be represented as a IEEE 754 float number.
We can ask pandas
to convert all values to a more common representation.
df_fill = df_fill.convert_dtypes()
df_fill
The numbers look alright.
But now the county column seems off.
The representation of <NA>
is the value pd.NA
,
a common missing value representation inside pandas.
We said that we can index a data frame as a two dimensional NumPy array.
Knowing that we can ask for the explicit index and verify
that the <NA>
in the county for Dublin is indeed pd.NA
.
df_fill.loc['Dublin', 'county'] is pd.NA
Another extra feature that does not exist in NumPy
arrays are methods that work
on string content, just like Python string methods. The str
object of a Series
(of a column of a data frame) is used to call string methods on each element, efficiently.
The result is either a boolean Series
that can then be used to retrieve rows from the data frame,
or a new string Series
modified by the operation.
df_fill['country'].str.contains('land')
Here we see the cities (the index) of countries which contain the string "land" in their names, Scotland, England and both cases of Ireland.
Several regular expression methods are supported as well. And just as with NumPy masking we can use the boolean data frames to index the original data frame. When masking a data frame the booleans are considered as rows.
df_fill[df_fill.country.str.contains('[re]')]
We got all countries that have the lowercase letter "r" or the lowercase letter "e". Notably England is excluded because the search is case sensitive. Moreover, the search is using a regular expression, or regex for short.
The purpose of regular expressions is to either match or not match strings or parts of strings. The string is matched character by character with some regex special characters forming expressions to match several possibilities at once. A simple character matches itself, whilst a regex expression can match a plethora of things. Some common regex expressions follow.
expression | match |
---|---|
abc |
exact match "abc" |
. |
any single character |
[abc] |
one character cited |
[^abc] |
one character NOT cited |
a* |
zero or more times "a" |
a+ |
one or more times "a" |
a? |
zero or one times "a" |
\w |
any word character |
\W |
any not word character |
\s |
any space character |
\S |
any not space character |
(ab) |
grouping |
(ab)+ |
zero or more times entire match |
\1 or $1 |
repeat grouped match |
Regular expressions can be complex, entire books have been written on them. But just a couple of minutes can get you started. For example:
[yzs][ao].*
Matches "yay", "zoink" and "kazoom" but does not match "bang", "flush" or "slap".
Several non-regex Python string functions are present too. And we can use the index as a search base as well.
df[df.index.str.startswith('Liv')]
Most string Python methods are available,
including procedures that change the string instead
of just returning boolean values.
A changed Series
is returned.
df['county'].str.upper()
Note that this respects the missing data, although it may require extra work with data types.
Often one does not want to modify the data in place,
one wants to keep the data in its original form.
Since the generated Series
one can just assign the
modified column as a new column
For example, the first letter of the country.
df['initial'] = df['country'].str[0].str.upper()
df
Note above that the str
attribute has been used two times,
the first use generated a Series
and the second use
of str
has been on this second series.
The indexing on the str
worked the same way as
indexing within Python strings.