Let's look at an example where we perform analysis of a time series. We will check out some new techniques for plotting and aggregating along the way. We start with the common suspects.
import numpy as np
import pandas as pd
The Metro Transit in Minneapolis has a branch that runs between the two main cities in Minnesota: Minneapolis and Saint Paul. It forms the main Metropolitan area of Minnesota. The branch crosses the Mississippi river on its way. We will look at traffic data on the ATR $301$ station, more commonly known as Victoria Street Station.
The dataset has been donated to, and can be downloaded from, the Irvine Machine Learning Repository. But it has some duplicates, which I have culled before building the comma separated value (CSV) file we import below.
A full transit dataset can always be downloaded from the Minnesota Department of Transport but out dataset has more information that has been added to the plain traffic data.
The read_csv
procedure in pandas
is the de facto standard
for data imports in PyData.
NumPy provides the loadtxt
procedure but read_csv
can
process missing data and many more flavors of data formats.
Notably, CSV is a badly standardized format,
and some clever heuristics are needed to parse some files.
Moreover, pandas
can parse the dates in the file automatically.
df = pd.read_csv('pd-metro-traffic.csv', index_col='date_time', parse_dates=True)
df.head()
We have hourly data on the passenger traffic on the westbound trains: from Saint Paul to Minneapolis. We also have a considerable amount of weather data.
For a start let's see what we have.
df.describe()
The traffic looks alright, three thousand people per hour is a reasonable number. The rain and snow data would be quite a lot of work to deal with so we will ignore those. And the temperature seems alright but a tad off in value. The temperature is in Kelvin but we will convert it to Celsius because it is easier to think about temperature in that scale. We also rename the index to a shorter name.
df_traffic = df[['temp', 'traffic_volume']].copy()
df_traffic.index.name = 'date'
df_traffic.columns = ['temp', 'traffic']
df_traffic['temp'] = df_traffic['temp'] - 273
df_traffic
The data is aggregated by the hour, something that will be important to keep in mind.
We should plot it to get a better understanding.
We will need matplotlib
configuration.
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
The pandas
library wraps over matplotlib
with its plot
procedure,
Most arguments are passed directly into matplotlib
although
there are several exceptions for that behavior.
Whether to use pandas
, matplotlib
or a combination of both for plotting
is a personal preference.
Here we will use a combination of both to get some understanding on how they work together.
We have two very distinct pieces of data to plot: traffic volume and temperature.
One can build two separate vertical axes, one on the left another on the right
with maplotlib
's twinx
(there is also twiny
but far less common).
We then use one of the axes to plot the traffic and another to plot the temperature.
We pass the axis we want to use in the ax=
argument.
fig, axl = plt.subplots(figsize=(20, 9))
axr = axl.twinx()
df_traffic['traffic'].plot(alpha=0.6, ax=axl, style='.', color='limegreen')
df_traffic['temp'].plot(alpha=0.6, ax=axr, style='.', color='deeppink')
axl.set_ylabel('total traffic')
axr.set_ylim(-50, 50)
axr.set_ylabel('temperature');
This is a good representation of a real dataset, a good deal of missing data can be seen. The temperature changes regularly with the year but we can tell little about the traffic volume.
The data is also too granular.
If we aggregate by week we should see more.
The resample
procedure will allow us to aggregate on subsets of the time,
here the W
means week (a period definition in pandas
).
Before we used a scatter because we had $40$ thousand points,
now we should be able to use lines.
fig, axl = plt.subplots(figsize=(20, 9))
axr = axl.twinx()
weekly = df_traffic.resample('W').mean()
weekly['traffic'].plot(ax=axl, color='limegreen')
weekly['temp'].plot(ax=axr, color='deeppink')
axl.set_ylabel('total traffic')
axr.set_ylim(-50, 50)
axr.set_ylabel('temperature');
We can start to see a pattern here but it is still very rigged.
Another aggregation technique we can use is a rolling window.
A rolling window will aggregate over each period in consideration
with a number of periods close to it.
Here we will take days (using resample
) and make a rolling window over $30$ days.
Each day will have the mean traffic of itself and $29$ closest days.
The center=
argument asks for days both in the past and in the future
when aggregating.
fig, axl = plt.subplots(figsize=(20, 9))
axr = axl.twinx()
daily = df_traffic.resample('D').mean()
daily = daily.rolling(30, center=True).mean()
daily['traffic'].plot(ax=axl, color='limegreen')
daily['temp'].plot(ax=axr, color='deeppink')
axl.set_ylabel('total traffic')
axr.set_ylim(-50, 50)
axr.set_ylabel('temperature');
That's good enough. The rolling window flattens the noise. The only thing we could improve towards a presentation of the graph is the number of points.
Above every day is still a single point on the graph, we could smoothen it by making a window over $5$ weeks instead of $30$ days. For an even smoother curve we could do a rolling window over $10$ weeks.
fig, axl = plt.subplots(figsize=(20, 9))
axr = axl.twinx()
weekly = df_traffic.resample('W').mean()
weekly = weekly.rolling(5, center=True).mean()
weekly['traffic'].plot(ax=axl, color='limegreen')
weekly['temp'].plot(ax=axr, color='deeppink')
axl.set_ylabel('mean weekly count')
axr.set_ylim(-50, 50)
axr.set_ylabel('temperature');
In the initial graph we could only see a mesh of points. Here we can see some patterns: The usage of the metro decreases during the holiday period, and during a short period in the summer.
What we have been doing with the resampling was a from of grouping and then aggregating. We did the grouping of times close to each other but grouping of things away from each other is also possible and useful.
On a time series grouping by can produce completely different time frames, known as dicing and slicing the frame. On dates and times this division is quite evident: each day is formed of hours, each hour of minutes and so on. Let's first try to get a series of time stamps from our data and see how we can divide it into divisions such as weeks, days or hours.
series = df_traffic.index.to_series()
series
We know a year has $365$ or $366$ days, which makes it $8760$ or $8784$ hours during a year.
Knowing this we can check how much missing data we have each year.
series.groupby(series.dt.year).count()
Not a single complete year. Let us look at $2015$ and see how many days we actually have.
s2015 = series['2015']
s2015.groupby(s2015.dt.year).count() / 24
Less than $150$ days is not much to go about. Thankfully the other years are better.
Now, one common pitfall to evaluate missing data in time series is to use the info
procedure or the isnull
procedure.
df_traffic.info()
Both info
and isnull
will tell us that there are no nulls.
That is correct, there are no nulls.
But a time series defines an initial time and a final time,
and any points missing in the middle or not appearing at all is missing data.
This also means that dropping missing data in a time series
is a rather meaningless operation.
Knowing about grouping by we can slice the data in more ways now.
Remember that several of the properties of the dt
object are directly available from the index.
For example, grouping by time (we only have hours) we can get the mean hourly traffic.
Note the hourly ticks we setup in order to make a readable graph.
fig, ax = plt.subplots(figsize=(20, 9))
by_time = df_traffic['traffic'].groupby(df_traffic.index.time).mean()
hourly_ticks = 4 * 60 * 60 * np.arange(6)
by_time.plot(xticks=hourly_ticks, ax=ax, color='forestgreen');
Well, yeah, those are people riding to work and riding back home.
Grouping by the day of the week we get the amount of traffic across weekdays. Again, we set ticks to get a nice graph.
fig, ax = plt.subplots(figsize=(20, 9))
by_weekday = df_traffic['traffic'].groupby(df_traffic.index.dayofweek).mean()
weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
by_weekday.plot(xticks=np.arange(len(weekdays)), ax=ax, color='forestgreen')
ax.set_xticklabels(weekdays);
The weekend is very different, as we'd expect. We can also check how the hourly distribution is on the weekend. We can build a multi-index by grouping by over two elements.
Remember that the where
procedure from NumPy works similar to an if-else loop.
Also, weekday
is an alias for dayoftheweek
, as well as day_name
.
pandas
often has aliases for the same attribute.
weekend = np.where(df_traffic.index.weekday < 5, 'Weekday', 'Weekend')
by_time = df_traffic['traffic'].groupby([weekend, df_traffic.index.time]).mean()
by_time
This resulted in a series with a two-level index, weekday of weekend and the hours. Now one can select from the first index and plot each piece of data side to side.
fig, ax = plt.subplots(figsize=(20, 9))
hourly_ticks = 4 * 60 * 60 * np.arange(6)
by_time.loc['Weekday'].plot(ax=ax, label='Weekdays', style='-', color='forestgreen')
by_time.loc['Weekend'].plot(ax=ax, label='Weekends', style=':', color='darkorange')
ax.legend()
ax.set_xticks(hourly_ticks);
The weekday behavior of riding to work and riding home is even more pronounced. But during the weekend we see a very different pattern.