04.06 Metro Traffic

Let's look at an example where we perform analysis of a time series. We will check out some new techniques for plotting and aggregating along the way. We start with the common suspects.

In [1]:
import numpy as np
import pandas as pd

Metro Minneapolis


The dataset

The Metro Transit in Minneapolis has a branch that runs between the two main cities in Minnesota: Minneapolis and Saint Paul. It forms the main Metropolitan area of Minnesota. The branch crosses the Mississippi river on its way. We will look at traffic data on the ATR $301$ station, more commonly known as Victoria Street Station.

The dataset has been donated to, and can be downloaded from, the Irvine Machine Learning Repository. But it has some duplicates, which I have culled before building the comma separated value (CSV) file we import below.

A full transit dataset can always be downloaded from the Minnesota Department of Transport but out dataset has more information that has been added to the plain traffic data.

The read_csv procedure in pandas is the de facto standard for data imports in PyData. NumPy provides the loadtxt procedure but read_csv can process missing data and many more flavors of data formats. Notably, CSV is a badly standardized format, and some clever heuristics are needed to parse some files. Moreover, pandas can parse the dates in the file automatically.

In [2]:
df = pd.read_csv('pd-metro-traffic.csv', index_col='date_time', parse_dates=True)
temp rain_1h snow_1h clouds_all traffic_volume
2012-10-02 09:00:00 288.28 0.0 0.0 40.0 5545
2012-10-02 10:00:00 289.36 0.0 0.0 75.0 4516
2012-10-02 11:00:00 289.58 0.0 0.0 90.0 4767
2012-10-02 12:00:00 290.13 0.0 0.0 90.0 5026
2012-10-02 13:00:00 291.14 0.0 0.0 75.0 4918

We have hourly data on the passenger traffic on the westbound trains: from Saint Paul to Minneapolis. We also have a considerable amount of weather data.

For a start let's see what we have.

In [3]:
temp rain_1h snow_1h clouds_all traffic_volume
count 40575.000000 40575.000000 40575.000000 40575.000000 40575.000000
mean 281.315882 0.318629 0.000117 44.201215 3290.650474
std 13.817217 48.812640 0.005676 38.681283 1984.772909
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 271.840000 0.000000 0.000000 1.000000 1248.500000
50% 282.860000 0.000000 0.000000 40.000000 3427.000000
75% 292.280000 0.000000 0.000000 90.000000 4952.000000
max 310.070000 9831.300000 0.510000 100.000000 7280.000000

The traffic looks alright, three thousand people per hour is a reasonable number. The rain and snow data would be quite a lot of work to deal with so we will ignore those. And the temperature seems alright but a tad off in value. The temperature is in Kelvin but we will convert it to Celsius because it is easier to think about temperature in that scale. We also rename the index to a shorter name.

In [4]:
df_traffic = df[['temp', 'traffic_volume']].copy()
df_traffic.index.name = 'date'
df_traffic.columns = ['temp', 'traffic']
df_traffic['temp'] = df_traffic['temp'] - 273
temp traffic
2012-10-02 09:00:00 15.28 5545
2012-10-02 10:00:00 16.36 4516
2012-10-02 11:00:00 16.58 4767
2012-10-02 12:00:00 17.13 5026
2012-10-02 13:00:00 18.14 4918
... ... ...
2018-09-30 19:00:00 10.45 3543
2018-09-30 20:00:00 9.76 2781
2018-09-30 21:00:00 9.73 2159
2018-09-30 22:00:00 9.09 1450
2018-09-30 23:00:00 9.12 954

40575 rows × 2 columns

The data is aggregated by the hour, something that will be important to keep in mind.

We should plot it to get a better understanding. We will need matplotlib configuration.

In [5]:
import matplotlib.pyplot as plt
%matplotlib inline

The pandas library wraps over matplotlib with its plot procedure, Most arguments are passed directly into matplotlib although there are several exceptions for that behavior. Whether to use pandas, matplotlib or a combination of both for plotting is a personal preference. Here we will use a combination of both to get some understanding on how they work together.

We have two very distinct pieces of data to plot: traffic volume and temperature. One can build two separate vertical axes, one on the left another on the right with maplotlib's twinx (there is also twiny but far less common). We then use one of the axes to plot the traffic and another to plot the temperature. We pass the axis we want to use in the ax= argument.

In [6]:
fig, axl = plt.subplots(figsize=(20, 9))
axr = axl.twinx()
df_traffic['traffic'].plot(alpha=0.6, ax=axl, style='.', color='limegreen')
df_traffic['temp'].plot(alpha=0.6, ax=axr, style='.', color='deeppink')
axl.set_ylabel('total traffic')
axr.set_ylim(-50, 50)

This is a good representation of a real dataset, a good deal of missing data can be seen. The temperature changes regularly with the year but we can tell little about the traffic volume.

The data is also too granular. If we aggregate by week we should see more. The resample procedure will allow us to aggregate on subsets of the time, here the W means week (a period definition in pandas). Before we used a scatter because we had $40$ thousand points, now we should be able to use lines.