Let's look at an example where we perform analysis of a time series. We will check out some new techniques for plotting and aggregating along the way. We start with the common suspects.
import numpy as np
import pandas as pd
The Metro Transit in Minneapolis has a branch that runs between the two main cities in Minnesota: Minneapolis and Saint Paul. It forms the main Metropolitan area of Minnesota. The branch crosses the Mississippi river on its way. We will look at traffic data on the ATR $301$ station, more commonly known as Victoria Street Station.
The dataset has been donated to, and can be downloaded from, the Irvine Machine Learning Repository. But it has some duplicates, which I have culled before building the comma separated value (CSV) file we import below.
A full transit dataset can always be downloaded from the Minnesota Department of Transport but out dataset has more information that has been added to the plain traffic data.
The read_csv
procedure in pandas
is the de facto standard
for data imports in PyData.
NumPy provides the loadtxt
procedure but read_csv
can
process missing data and many more flavors of data formats.
Notably, CSV is a badly standardized format,
and some clever heuristics are needed to parse some files.
Moreover, pandas
can parse the dates in the file automatically.
df = pd.read_csv('pd-metro-traffic.csv', index_col='date_time', parse_dates=True)
df.head()
We have hourly data on the passenger traffic on the westbound trains: from Saint Paul to Minneapolis. We also have a considerable amount of weather data.
For a start let's see what we have.
df.describe()
The traffic looks alright, three thousand people per hour is a reasonable number. The rain and snow data would be quite a lot of work to deal with so we will ignore those. And the temperature seems alright but a tad off in value. The temperature is in Kelvin but we will convert it to Celsius because it is easier to think about temperature in that scale. We also rename the index to a shorter name.
df_traffic = df[['temp', 'traffic_volume']].copy()
df_traffic.index.name = 'date'
df_traffic.columns = ['temp', 'traffic']
df_traffic['temp'] = df_traffic['temp'] - 273
df_traffic
The data is aggregated by the hour, something that will be important to keep in mind.
We should plot it to get a better understanding.
We will need matplotlib
configuration.
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
The pandas
library wraps over matplotlib
with its plot
procedure,
Most arguments are passed directly into matplotlib
although
there are several exceptions for that behavior.
Whether to use pandas
, matplotlib
or a combination of both for plotting
is a personal preference.
Here we will use a combination of both to get some understanding on how they work together.
We have two very distinct pieces of data to plot: traffic volume and temperature.
One can build two separate vertical axes, one on the left another on the right
with maplotlib
's twinx
(there is also twiny
but far less common).
We then use one of the axes to plot the traffic and another to plot the temperature.
We pass the axis we want to use in the ax=
argument.
fig, axl = plt.subplots(figsize=(20, 9))
axr = axl.twinx()
df_traffic['traffic'].plot(alpha=0.6, ax=axl, style='.', color='limegreen')
df_traffic['temp'].plot(alpha=0.6, ax=axr, style='.', color='deeppink')
axl.set_ylabel('total traffic')
axr.set_ylim(-50, 50)
axr.set_ylabel('temperature');
This is a good representation of a real dataset, a good deal of missing data can be seen. The temperature changes regularly with the year but we can tell little about the traffic volume.
The data is also too granular.
If we aggregate by week we should see more.
The resample
procedure will allow us to aggregate on subsets of the time,
here the W
means week (a period definition in pandas
).
Before we used a scatter because we had $40$ thousand points,
now we should be able to use lines.