Working with data we find ourselves defining dimensions over which we want to analyze it (aggregate it). Dimension is a well known word in data warehousing and in analytic queries over such warehouses. One such dimension that always appear for data analysis is the time dimension. Windowing, changing granularity or aggregating over specific times in the time dimension is called time series analysis.
Both NumPy and pandas
have facilities to work with time series.
import numpy as np
import pandas as pd
Computers have almost always encoded dates as seconds since January 1st 1970, at least since that date - earlier computers obviously used different encodings. That date is called the epoch of computer time.
NumPy date and time representation performs exactly that, it counts seconds since epoch by default.
np.datetime64('2020-01-03T10:00:00')
There is one more complication about dates: timezones. The epoch is in the GMT (Greenwich Mean Time), or UTC (Coordinated Universal Time) which is just a different name for the same thing. The implementation of timezones by NumPy has been an argument for a log time. In the end NumPy abandoned any use of timezones and made all dates to not have any timezone information, a computer time called naive.
Moreover, NumPy date parser does not accept many time formats.
Which made the extensions made by pandas
welcome
and very needed.
pandas
can parse a wide range of date formats out of the box.
pd.to_datetime('January, 2017'), pd.to_datetime('3rd of February 2016'), pd.to_datetime('6:31PM, Nov 11th, 2017')
Python has the datetime
object built into the standard library but it is quite limited.
The pandas
library wraps around dateutil for a comprehensive date parser;
and pytz for localizing dates and times within timezones.
pandas
makes use of these modules to build its
Timestamp
, Period
and Timedelta
data frame indexes.
For example, the three dates above are parsed with the dateutil
module behind the scenes.
And the localization below is done with pytz
under the hood.
date = pd.to_datetime('3rd of January 2020').tz_localize('Europe/London')
date.strftime('%Z, %B %-d, %Y')
Date operations, e.g. the common cross-language strftime
,
work on pandas
dates juts like on Python dates.
Another difference between NumPy and pandas
is the internal data type.
By default pandas
will use a nanosecond resolution,
which has better granularity but cannot count all the time until epoch.
All when building dates and times pandas
will attempt to choose
a good granularity: nanoseconds, days, years;
one can change the dtype
manually to force a granularity.
Just like date operations, time deltas work too.
And broadcasting works on deltas too (that 'D'
means "day", see next).
date = pd.to_datetime('3rd of January 2020')
date + pd.to_timedelta(np.arange(7), 'D')
Time series analysis requires us to be able to change the time dimension quickly, and tailor
it to our current needs with little computation overhead. pandas
provides the tools for this
through its time indexes: time stamps, time periods and time deltas. Let's see how we build these.
We will use an airport with planes landing on it to understand
different ways of thinking about time.
We can distinguish between three time definitions:
the timestamp, e.g. at which time the plane did land;
the time period, e.g. how many planes did land this Wednesday;
and time deltas (or durations), e.g. how long ago did the last plane land.
Each of these has a pandas
object and index type:
DatetimeIndex
is composed of Timestamp
objects and is the most basic date index type.PeriodIndex
uses Period
objects which contain start_time
and end_time
,
and attributes to check whether a timestamp falls within the period.TimedeltaIndex
is composed of Timedelta
objects, which represent a duration of time.We also have DateOffset
for calendar arithmetic but this is less prevalent
in data manipulation and does not have its index type.
Calendar arithmetic is an arithmetic which accounts for special dates, e.g. holidays.
Calendar arithmetic is very specific depending on which country one lives in
or which country a dataset has been collected in.
We can understand periods as aggregates of timestamps and are internally defined as a single
timestamp (start of period) and a frequency (duration of the period). All periods within a
PeriodIndex
must have the same frequency. The frequency (or duration, or offset) in pandas
can be defined in many ways, with letter codes. The most important ones are:
D
- dayB
- day, business days onlyW
- weekM
- monthA
/Y
- yearH
- hourT
/min
- minuteS
- secondAnd these can be combined in several ways
(e.g. BAS-APR
mean a year starting on 1st of April as the first business day).
It is nearly impossible to remember all combinations,
do have a link to the offset documentation handy.
Let's see how to create time based indexes:
dates = pd.to_datetime(['3rd of January 2010', '1960-Jan-3', '20200708'])
dates
Those are Timestamps
, we can convert that to Periods
.
dates.to_period('D')
And operations between dates result in Timedelta
s.
dates - dates[0]
Instead of writing all dates we want in an index we can (and should) use the _range
methods.
The default period is days.
pd.date_range('2019-12-03', periods=3)
But the period definition can be quite customizable, for example we can have timestamps every $6$ hours.
pd.date_range('2019-12-03', freq='6H', periods=6)
Periods can be generated by themselves. Periods are printed in a short version if possible, for example yearly periods will prints as just the year.
pd.period_range('2019', freq='Y', periods=3)
And time deltas can be generated with the same complexity of frequency creation.
idx = pd.timedelta_range('2min', freq='7min', periods=3)
idx
And we can force that index to use minutes as the unit.
idx.astype('timedelta64[m]')
We now have the tools to look at some data that is indexed by time, a time series. Note that analysis of a time series is always faulty, one cannot predict the future after all. All one does in analyzing a time series is to attempt to predict the future by accepting that the future will be similar to what happened in the past.
Analyzing a time series is similar to drive a car with the windscreen and side windows painted black, and only the rear mirror as your only source of information. And the reverse gear does not work, since time travel is not viable either.
To build a toy dataset for us we will the dt
attribute
of a pandas
series containing timestamps.
The dt
attribute works in a similar way as the str
attribute works for strings,
we can operate on all dates within the series at once.
Here we take the hour out of the time and the day of the week
from the $3$ days ($72$ hours) in our set.
The days of the week are numbered.
Numbering of days of the week has never been standardized,
different programming languages and libraries do it in
different ways.
In pandas
we have Monday numbered as $0$ and
Sunday numbered $6$.
dates = pd.Series(pd.date_range('2019-12-02', periods=72, freq='H'))
df = pd.DataFrame({'hour': list(dates.dt.hour),
'dayofweek': list(dates.dt.dayofweek),
},
index=dates)
df
Selecting from the index is quite intuitive. One can select using strings that will be converted to times. This works well for continuous selection.
df['2019-12-03']
For more complex solutions we will need making and fancy indexing,
all on top of the functions from the dt
attribute.
Yet, since we are working on selecting from an index,
the values normally on the dt
attribute can be
retrieved directly from the index.
For example we have $24$ hours on each day in
our $72$ hour dataset.
df.index.dayofweek
And withing each of those $24$ hours there is one point at which the clock hits $6$AM. We can select the data at $6$AM on each day by masking the index.
df[df.index.hour == 6]
We have the tools to work on a real time series. The tools themselves are simple: dates, times, and some selection tools. Yet, combining these tools with tools we seen previously make for very complex behavior.