Let's look at an example where we perform analysis of a time series. We will check out some new techniques for plotting and aggregating along the way. We start with the common suspects.
import numpy as np import pandas as pd
The Metro Transit in Minneapolis has a branch that runs between the two main cities in Minnesota: Minneapolis and Saint Paul. It forms the main Metropolitan area of Minnesota. The branch crosses the Mississippi river on its way. We will look at traffic data on the ATR $301$ station, more commonly known as Victoria Street Station.
The dataset has been donated to, and can be downloaded from, the Irvine Machine Learning Repository. But it has some duplicates, which I have culled before building the comma separated value (CSV) file we import below.
A full transit dataset can always be downloaded from the Minnesota Department of Transport but out dataset has more information that has been added to the plain traffic data.
read_csv procedure in
pandas is the de facto standard
for data imports in PyData.
NumPy provides the
loadtxt procedure but
process missing data and many more flavors of data formats.
Notably, CSV is a badly standardized format,
and some clever heuristics are needed to parse some files.
pandas can parse the dates in the file automatically.
df = pd.read_csv('pd-metro-traffic.csv', index_col='date_time', parse_dates=True) df.head()
We have hourly data on the passenger traffic on the westbound trains: from Saint Paul to Minneapolis. We also have a considerable amount of weather data.
For a start let's see what we have.
The traffic looks alright, three thousand people per hour is a reasonable number. The rain and snow data would be quite a lot of work to deal with so we will ignore those. And the temperature seems alright but a tad off in value. The temperature is in Kelvin but we will convert it to Celsius because it is easier to think about temperature in that scale. We also rename the index to a shorter name.
df_traffic = df[['temp', 'traffic_volume']].copy() df_traffic.index.name = 'date' df_traffic.columns = ['temp', 'traffic'] df_traffic['temp'] = df_traffic['temp'] - 273 df_traffic
40575 rows × 2 columns
The data is aggregated by the hour, something that will be important to keep in mind.
We should plot it to get a better understanding.
We will need
import matplotlib.pyplot as plt %matplotlib inline plt.style.use('seaborn-talk')
pandas library wraps over
matplotlib with its
Most arguments are passed directly into
there are several exceptions for that behavior.
Whether to use
matplotlib or a combination of both for plotting
is a personal preference.
Here we will use a combination of both to get some understanding on how they work together.
We have two very distinct pieces of data to plot: traffic volume and temperature.
One can build two separate vertical axes, one on the left another on the right
twinx (there is also
twiny but far less common).
We then use one of the axes to plot the traffic and another to plot the temperature.
We pass the axis we want to use in the
fig, axl = plt.subplots(figsize=(20, 9)) axr = axl.twinx() df_traffic['traffic'].plot(alpha=0.6, ax=axl, style='.', color='limegreen') df_traffic['temp'].plot(alpha=0.6, ax=axr, style='.', color='deeppink') axl.set_ylabel('total traffic') axr.set_ylim(-50, 50) axr.set_ylabel('temperature');
This is a good representation of a real dataset, a good deal of missing data can be seen. The temperature changes regularly with the year but we can tell little about the traffic volume.
The data is also too granular.
If we aggregate by week we should see more.
resample procedure will allow us to aggregate on subsets of the time,
W means week (a period definition in
Before we used a scatter because we had $40$ thousand points,
now we should be able to use lines.