04.05 Time Series

Working with data we find ourselves defining dimensions over which we want to analyze it (aggregate it). Dimension is a well known word in data warehousing and in analytic queries over such warehouses. One such dimension that always appear for data analysis is the time dimension. Windowing, changing granularity or aggregating over specific times in the time dimension is called time series analysis.

Both NumPy and pandas have facilities to work with time series.

In [1]:
import numpy as np
import pandas as pd

Computers have almost always encoded dates as seconds since January 1st 1970, at least since that date - earlier computers obviously used different encodings. That date is called the epoch of computer time.

NumPy date and time representation performs exactly that, it counts seconds since epoch by default.

In [2]:
np.datetime64('2020-01-03T10:00:00')
Out[2]:
numpy.datetime64('2020-01-03T10:00:00')

There is one more complication about dates: timezones. The epoch is in the GMT (Greenwich Mean Time), or UTC (Coordinated Universal Time) which is just a different name for the same thing. The implementation of timezones by NumPy has been an argument for a log time. In the end NumPy abandoned any use of timezones and made all dates to not have any timezone information, a computer time called naive.

Moreover, NumPy date parser does not accept many time formats. Which made the extensions made by pandas welcome and very needed. pandas can parse a wide range of date formats out of the box.

In [3]:
pd.to_datetime('January, 2017'), pd.to_datetime('3rd of February 2016'), pd.to_datetime('6:31PM, Nov 11th, 2017')
Out[3]:
(Timestamp('2017-01-01 00:00:00'),
 Timestamp('2016-02-03 00:00:00'),
 Timestamp('2017-11-11 18:31:00'))

Python has the datetime object built into the standard library but it is quite limited. The pandas library wraps around dateutil for a comprehensive date parser; and pytz for localizing dates and times within timezones. pandas makes use of these modules to build its Timestamp, Period and Timedelta data frame indexes.

For example, the three dates above are parsed with the dateutil module behind the scenes. And the localization below is done with pytz under the hood.

In [4]:
date = pd.to_datetime('3rd of January 2020').tz_localize('Europe/London')
date.strftime('%Z, %B %-d, %Y')
Out[4]:
'GMT, January 3, 2020'

Date operations, e.g. the common cross-language strftime, work on pandas dates juts like on Python dates.

Another difference between NumPy and pandas is the internal data type. By default pandas will use a nanosecond resolution, which has better granularity but cannot count all the time until epoch. All when building dates and times pandas will attempt to choose a good granularity: nanoseconds, days, years; one can change the dtype manually to force a granularity.

Just like date operations, time deltas work too. And broadcasting works on deltas too (that 'D' means "day", see next).

In [5]:
date = pd.to_datetime('3rd of January 2020')
date + pd.to_timedelta(np.arange(7), 'D')
Out[5]:
DatetimeIndex(['2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06',
               '2020-01-07', '2020-01-08', '2020-01-09'],
              dtype='datetime64[ns]', freq=None)

Time series analysis requires us to be able to change the time dimension quickly, and tailor it to our current needs with little computation overhead. pandas provides the tools for this through its time indexes: time stamps, time periods and time deltas. Let's see how we build these.

Indexes on dates

We will use an airport with planes landing on it to understand different ways of thinking about time. We can distinguish between three time definitions: the timestamp, e.g. at which time the plane did land; the time period, e.g. how many planes did land this Wednesday; and time deltas (or durations), e.g. how long ago did the last plane land. Each of these has a pandas object and index type:

  • The DatetimeIndex is composed of Timestamp objects and is the most basic date index type.
  • PeriodIndex uses Period objects which contain start_time and end_time, and attributes to check whether a timestamp falls within the period.
  • The TimedeltaIndex is composed of Timedelta objects, which represent a duration of time.

We also have DateOffset for calendar arithmetic but this is less prevalent in data manipulation and does not have its index type. Calendar arithmetic is an arithmetic which accounts for special dates, e.g. holidays. Calendar arithmetic is very specific depending on which country one lives in or which country a dataset has been collected in.

We can understand periods as aggregates of timestamps and are internally defined as a single timestamp (start of period) and a frequency (duration of the period). All periods within a PeriodIndex must have the same frequency. The frequency (or duration, or offset) in pandas can be defined in many ways, with letter codes. The most important ones are:

  • D - day
  • B - day, business days only
  • W - week
  • M - month
  • A/Y - year
  • H - hour
  • T/min - minute
  • S - second

And these can be combined in several ways (e.g. BAS-APR mean a year starting on 1st of April as the first business day). It is nearly impossible to remember all combinations, do have a link to the offset documentation handy. Let's see how to create time based indexes:

In [6]:
dates = pd.to_datetime(['3rd of January 2010', '1960-Jan-3', '20200708'])
dates
Out[6]:
DatetimeIndex(['2010-01-03', '1960-01-03', '2020-07-08'], dtype='datetime64[ns]', freq=None)

Those are Timestamps, we can convert that to Periods.

In [7]:
dates.to_period('D')
Out[7]:
PeriodIndex(['2010-01-03', '1960-01-03', '2020-07-08'], dtype='period[D]', freq='D')

And operations between dates result in Timedeltas.

In [8]:
dates - dates[0]
Out[8]:
TimedeltaIndex(['0 days', '-18263 days', '3839 days'], dtype='timedelta64[ns]', freq=None)

Instead of writing all dates we want in an index we can (and should) use the _range methods. The default period is days.

In [9]:
pd.date_range('2019-12-03', periods=3)
Out[9]:
DatetimeIndex(['2019-12-03', '2019-12-04', '2019-12-05'], dtype='datetime64[ns]', freq='D')

But the period definition can be quite customizable, for example we can have timestamps every $6$ hours.

In [10]:
pd.date_range('2019-12-03', freq='6H', periods=6)
Out[10]:
DatetimeIndex(['2019-12-03 00:00:00', '2019-12-03 06:00:00',
               '2019-12-03 12:00:00', '2019-12-03 18:00:00',
               '2019-12-04 00:00:00', '2019-12-04 06:00:00'],
              dtype='datetime64[ns]', freq='6H')

Periods can be generated by themselves. Periods are printed in a short version if possible, for example yearly periods will prints as just the year.

In [11]:
pd.period_range('2019', freq='Y', periods=3)
Out[11]:
PeriodIndex(['2019', '2020', '2021'], dtype='period[A-DEC]', freq='A-DEC')

And time deltas can be generated with the same complexity of frequency creation.

In [12]:
idx = pd.timedelta_range('2min', freq='7min', periods=3)
idx
Out[12]:
TimedeltaIndex(['00:02:00', '00:09:00', '00:16:00'], dtype='timedelta64[ns]', freq='7T')

And we can force that index to use minutes as the unit.

In [13]:
idx.astype('timedelta64[m]')
Out[13]:
Int64Index([2, 9, 16], dtype='int64')

We now have the tools to look at some data that is indexed by time, a time series. Note that analysis of a time series is always faulty, one cannot predict the future after all. All one does in analyzing a time series is to attempt to predict the future by accepting that the future will be similar to what happened in the past.

Analyzing a time series is similar to drive a car with the windscreen and side windows painted black, and only the rear mirror as your only source of information. And the reverse gear does not work, since time travel is not viable either.

Optimism

pd-road.svg

To build a toy dataset for us we will the dt attribute of a pandas series containing timestamps. The dt attribute works in a similar way as the str attribute works for strings, we can operate on all dates within the series at once.

Here we take the hour out of the time and the day of the week from the $3$ days ($72$ hours) in our set. The days of the week are numbered. Numbering of days of the week has never been standardized, different programming languages and libraries do it in different ways. In pandas we have Monday numbered as $0$ and Sunday numbered $6$.

In [14]:
dates = pd.Series(pd.date_range('2019-12-02', periods=72, freq='H'))
df = pd.DataFrame({'hour': list(dates.dt.hour),
                   'dayofweek': list(dates.dt.dayofweek),
                  },
                 index=dates)
df
Out[14]:
hour dayofweek
2019-12-02 00:00:00 0 0
2019-12-02 01:00:00 1 0
2019-12-02 02:00:00 2 0
2019-12-02 03:00:00 3 0
2019-12-02 04:00:00 4 0
... ... ...
2019-12-04 19:00:00 19 2
2019-12-04 20:00:00 20 2
2019-12-04 21:00:00 21 2
2019-12-04 22:00:00 22 2
2019-12-04 23:00:00 23 2

72 rows × 2 columns

Selecting from the index is quite intuitive. One can select using strings that will be converted to times. This works well for continuous selection.

In [15]:
df['2019-12-03']
Out[15]:
hour dayofweek
2019-12-03 00:00:00 0 1
2019-12-03 01:00:00 1 1
2019-12-03 02:00:00 2 1
2019-12-03 03:00:00 3 1
2019-12-03 04:00:00 4 1
2019-12-03 05:00:00 5 1
2019-12-03 06:00:00 6 1
2019-12-03 07:00:00 7 1
2019-12-03 08:00:00 8 1
2019-12-03 09:00:00 9 1
2019-12-03 10:00:00 10 1
2019-12-03 11:00:00 11 1
2019-12-03 12:00:00 12 1
2019-12-03 13:00:00 13 1
2019-12-03 14:00:00 14 1
2019-12-03 15:00:00 15 1
2019-12-03 16:00:00 16 1
2019-12-03 17:00:00 17 1
2019-12-03 18:00:00 18 1
2019-12-03 19:00:00 19 1
2019-12-03 20:00:00 20 1
2019-12-03 21:00:00 21 1
2019-12-03 22:00:00 22 1
2019-12-03 23:00:00 23 1

For more complex solutions we will need making and fancy indexing, all on top of the functions from the dt attribute. Yet, since we are working on selecting from an index, the values normally on the dt attribute can be retrieved directly from the index. For example we have $24$ hours on each day in our $72$ hour dataset.

In [16]:
df.index.dayofweek
Out[16]:
Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
            2, 2, 2, 2, 2, 2],
           dtype='int64')

And withing each of those $24$ hours there is one point at which the clock hits $6$AM. We can select the data at $6$AM on each day by masking the index.

In [17]:
df[df.index.hour == 6]
Out[17]:
hour dayofweek
2019-12-02 06:00:00 6 0
2019-12-03 06:00:00 6 1
2019-12-04 06:00:00 6 2

We have the tools to work on a real time series. The tools themselves are simple: dates, times, and some selection tools. Yet, combining these tools with tools we seen previously make for very complex behavior.