Working with data we find ourselves defining dimensions over which we want to analyze it (aggregate it). Dimension is a well known word in data warehousing and in analytic queries over such warehouses. One such dimension that always appear for data analysis is the time dimension. Windowing, changing granularity or aggregating over specific times in the time dimension is called time series analysis.

Both NumPy and `pandas`

have facilities to work with time series.

In [1]:

```
import numpy as np
import pandas as pd
```

Computers have almost always encoded dates as seconds since January 1st 1970,
at least since that date - earlier computers obviously used different encodings.
That date is called the **epoch** of computer time.

NumPy date and time representation performs exactly that, it counts seconds since epoch by default.

In [2]:

```
np.datetime64('2020-01-03T10:00:00')
```

Out[2]:

There is one more complication about dates: timezones.
The *epoch* is in the GMT (Greenwich Mean Time),
or UTC (Coordinated Universal Time) which is just a different name for the same thing.
The implementation of timezones by NumPy has been an argument for a log time.
In the end NumPy abandoned any use of timezones and made all dates to not
have any timezone information, a computer time called naive.

Moreover, NumPy date parser does not accept many time formats.
Which made the extensions made by `pandas`

welcome
and very needed.
`pandas`

can parse a wide range of date formats out of the box.

In [3]:

```
pd.to_datetime('January, 2017'), pd.to_datetime('3rd of February 2016'), pd.to_datetime('6:31PM, Nov 11th, 2017')
```

Out[3]:

Python has the `datetime`

object built into the standard library but it is quite limited.
The `pandas`

library wraps around dateutil for a comprehensive date parser;
and pytz for localizing dates and times within timezones.
`pandas`

makes use of these modules to build its
`Timestamp`

, `Period`

and `Timedelta`

data frame indexes.

For example, the three dates above are parsed with the `dateutil`

module behind the scenes.
And the localization below is done with `pytz`

under the hood.

In [4]:

```
date = pd.to_datetime('3rd of January 2020').tz_localize('Europe/London')
date.strftime('%Z, %B %-d, %Y')
```

Out[4]:

Date operations, e.g. the common cross-language `strftime`

,
work on `pandas`

dates juts like on Python dates.

Another difference between NumPy and `pandas`

is the internal data type.
By default `pandas`

will use a nanosecond resolution,
which has better granularity but cannot count all the time until *epoch*.
All when building dates and times `pandas`

will attempt to choose
a good granularity: nanoseconds, days, years;
one can change the `dtype`

manually to force a granularity.

Just like date operations, time deltas work too.
And broadcasting works on deltas too (that `'D'`

means "day", see next).

In [5]:

```
date = pd.to_datetime('3rd of January 2020')
date + pd.to_timedelta(np.arange(7), 'D')
```

Out[5]:

Time series analysis requires us to be able to change the time dimension quickly, and tailor
it to our current needs with little computation overhead. `pandas`

provides the tools for this
through its time indexes: time stamps, time periods and time deltas. Let's see how we build these.

We will use an airport with planes landing on it to understand
different ways of thinking about time.
We can distinguish between three time definitions:
the **timestamp**, e.g. at which time the plane did land;
the **time period**, e.g. how many planes did land this Wednesday;
and **time deltas** (or durations), e.g. how long ago did the last plane land.
Each of these has a `pandas`

object and index type:

- The
`DatetimeIndex`

is composed of`Timestamp`

objects and is the most basic date index type. `PeriodIndex`

uses`Period`

objects which contain`start_time`

and`end_time`

, and attributes to check whether a timestamp falls within the period.- The
`TimedeltaIndex`

is composed of`Timedelta`

objects, which represent a duration of time.

We also have `DateOffset`

for calendar arithmetic but this is less prevalent
in data manipulation and does not have its index type.
Calendar arithmetic is an arithmetic which accounts for special dates, e.g. holidays.
Calendar arithmetic is very specific depending on which country one lives in
or which country a dataset has been collected in.

We can understand periods as aggregates of timestamps and are internally defined as a single
timestamp (start of period) and a frequency (duration of the period). All periods within a
`PeriodIndex`

must have the same frequency. The frequency (or duration, or offset) in `pandas`

can be defined in many ways, with letter codes. The most important ones are:

`D`

- day`B`

- day, business days only`W`

- week`M`

- month`A`

/`Y`

- year`H`

- hour`T`

/`min`

- minute`S`

- second

And these can be combined in several ways
(e.g. `BAS-APR`

mean a year starting on 1st of April as the first business day).
It is nearly impossible to remember all combinations,
do have a link to the offset documentation handy.
Let's see how to create time based indexes:

In [6]:

```
dates = pd.to_datetime(['3rd of January 2010', '1960-Jan-3', '20200708'])
dates
```

Out[6]:

Those are `Timestamps`

, we can convert that to `Periods`

.

In [7]:

```
dates.to_period('D')
```

Out[7]:

And operations between dates result in `Timedelta`

s.

In [8]:

```
dates - dates[0]
```

Out[8]:

Instead of writing all dates we want in an index we can (and should) use the `_range`

methods.
The default period is days.

In [9]:

```
pd.date_range('2019-12-03', periods=3)
```

Out[9]:

But the period definition can be quite customizable, for example we can have timestamps every $6$ hours.

In [10]:

```
pd.date_range('2019-12-03', freq='6H', periods=6)
```

Out[10]:

Periods can be generated by themselves. Periods are printed in a short version if possible, for example yearly periods will prints as just the year.

In [11]:

```
pd.period_range('2019', freq='Y', periods=3)
```

Out[11]:

And time deltas can be generated with the same complexity of frequency creation.

In [12]:

```
idx = pd.timedelta_range('2min', freq='7min', periods=3)
idx
```

Out[12]:

And we can force that index to use minutes as the unit.

In [13]:

```
idx.astype('timedelta64[m]')
```

Out[13]:

We now have the tools to look at some data that is indexed by time,
a *time series*.
Note that analysis of a time series is always faulty,
one cannot predict the future after all.
All one does in analyzing a time series is to attempt to predict
the future by accepting that the future will be similar to what
happened in the past.

Analyzing a time series is similar to drive a car with the windscreen and side windows painted black, and only the rear mirror as your only source of information. And the reverse gear does not work, since time travel is not viable either.

To build a toy dataset for us we will the `dt`

attribute
of a `pandas`

series containing timestamps.
The `dt`

attribute works in a similar way as the `str`

attribute works for strings,
we can operate on all dates within the series at once.

Here we take the hour out of the time and the day of the week
from the $3$ days ($72$ hours) in our set.
The days of the week are numbered.
Numbering of days of the week has never been standardized,
different programming languages and libraries do it in
different ways.
In `pandas`

we have Monday numbered as $0$ and
Sunday numbered $6$.

In [14]:

```
dates = pd.Series(pd.date_range('2019-12-02', periods=72, freq='H'))
df = pd.DataFrame({'hour': list(dates.dt.hour),
'dayofweek': list(dates.dt.dayofweek),
},
index=dates)
df
```

Out[14]:

Selecting from the index is quite intuitive. One can select using strings that will be converted to times. This works well for continuous selection.

In [15]:

```
df['2019-12-03']
```

Out[15]:

For more complex solutions we will need making and fancy indexing,
all on top of the functions from the `dt`

attribute.
Yet, since we are working on selecting from an index,
the values normally on the `dt`

attribute can be
retrieved directly from the index.
For example we have $24$ hours on each day in
our $72$ hour dataset.

In [16]:

```
df.index.dayofweek
```

Out[16]:

And withing each of those $24$ hours there is one point at which the clock hits $6$AM. We can select the data at $6$AM on each day by masking the index.

In [17]:

```
df[df.index.hour == 6]
```

Out[17]:

We have the tools to work on a real time series. The tools themselves are simple: dates, times, and some selection tools. Yet, combining these tools with tools we seen previously make for very complex behavior.