Just so we have this in one place, let's group together all the higher level issues that concern machine learning models. These issues happen almost every time we want to use a machine learning model to solve a real world problem (i.e. not a toy problem). We have already discussed several of them but a list works better as a reference. We also will go a bit deeper in mathematical terms on some of the topics, these terms will prove needed when we approach online learning.

The *bias vs variance* trade-off argues that a model that is not complex enough
will *underfit* the data, and a model that is too complex will *overfit* the data.
We control model complexity through model hyperparameters,
and we can estimate a good complexity by trying several hyperparameters and
cross-validating their performance.
The difficult part of the art of machine learning is not making a model work,
it is to prove that is works and that it will work for new data.
Moreover, depending of the problem we may want to validate a model for different things,
e.g. in a fraud detection model we want the recall
of fraud data points to be the most important validation.

If your model will work with new data, just cross-validating it is not enough. Cross-validation allows us to select the best hyperparameters, and gives us a good estimate of how well a model performs; but it does not give us an estimate of how badly our model can perform on new data, i.e. we do not have a generalization baseline.

To estimate how our model performs against new data, we need to separate our data
into a training and test sets and only then perform cross-validation on the training set alone.
The resulting model's generalization can then be evaluated on the test set.
In other words, we now have a test set, and several folds
which are the training and validation sets.
This ensures that the model sees only the training set during the tuning of its parameters,
and sees only the training and validation sets during the tuning of its hyperparameters.
In `sklearn`

the test set splitting and cross-validation are done with similar procedures.
In other libraries, notably neural network libraries, the training (fitting) of the model
will *always perform (cross) validation* as part of the training.
As the bracket suggest it may not be a full cross-validation depending on the library
but do not get fooled by examples using such libraries in which
one does not worry about the validation.
The (cross) validation happens behind the scenes in many ML libraries.

Machine Learning algorithms work on numbers, and assume that if a number is bigger
it means that this number is more important.
Yet, that is often not what we actually want.
Most models are sensitive to the magnitude of the features,
and scaling the features to have a similar magnitude will,
more often than not, give better results.
Borrowing from `sklearn`

two common ways of scaling features are:

StandardScaler subtracts the mean and then divides by the maximum (absolute) to achieve mean zero and variance one for all features.

MinMaxScaler forces all features to have values between zero and one.

Normalizer considers a sample to be vector in as many dimensions as there are features and then normalizes each sample vector to unit length.

We did indeed use several of these or wrote equivalent code.

Ensemble methods are powerful. Depending on how you setup the ensemble it can bestow the performance of models or work around limitations of certain models.

A *voting* technique (akin of, but no limited to, random forests or ada boost)
takes several models, trains each and predicts by majority vote across all models.
The internal models are often trained on subsets of the data, or with different
randomization and hyperparameters.
This way one can increase the performance and generalization of the models,
and makes up for models which suffer from in-built overfitting.

*One vs One* (OVO) and *One vs Rest* (OVR) techniques are used to allow binary classifiers
to perform multiclass classification.
OVO trains a classifier for every pair of classes,
this means that a big number of binary classifiers will be trained on
subsets of data (only the samples for the two classes).
The OVO runs all competing classifiers and decides on the classes with most wins.
In OVR (also called OVA, One vs All) the number of trained classifiers is the number of classes,
each classifier is trained on the samples of one class as the positive class
and all other samples as the negative class.
OVR then selects the answer by picking the class with the higher probability.

Being able to explain why your model classifies things the way it does, may or may not be important for the problem you are solving. Classification (and often regression) can be performed in two ways: by constructing a crisp decision function and deciding upon classes/values based on distance to this function; or by assigning probabilities to each class/value and deciding based on the higher probability.

These two methods are not necessarily opposite to each other. Probabilities are still distances to a decision function but they can be weighted by some density - i.e. where distances across a higher density weight more. The opposite is also true, one can estimate probabilities based on the distance from a crisp decision function - and that is often performed in may algorithms. Note that this may mean that the probabilities are just estimates. The quality of probability estimation varies across models, and there are technique to better calibrate such probability estimates.

One thing we did not touch yet is the concept of *online learning*.
This is to differ it from *offline* learning.
In *offline learning* we can work with all data we will ever
feed the model training with at once,
i.e. we can load a dataset in memory, use cross-validation
to tune a model over this dataset and test against a test set.
Offline learning is incredibly common in data analysis.

The problem starts when we plan to build a model (say, classifier) on top of data that
is continuously entering or flowing through the system.
We never have the full dataset in such a case,
we may have all data until today but even that will not necessarily be complete.
What we need is a model that can learn and **re-learn** from new data,
such is an online learning model.

To perform online learning we need to be capable of tuning model
parameters to new data without looking back at the previous data.
In other words,
the model parameters must *represent the data seen until now* and if we change such
a parameter slightly it will not affect the overall model too much.
Decision functions, when cleverly parametrized, are much easy to re-tune to new data.
A small change to a parameter in the decision function can bring the model closer
new data with very little computational effort.

But how small a change? That answer depends on the problem.
This small (or not so small) change is called the **learning rate** of an online learning model.
A big learning rate will make the model forget about old data quickly,
a very small learning rate will make the model have a lot of inertia when adapting to new data.

There is a big cost in achieving actual online learning: one must define and test a learning rate, and several models can only use a subset of its capabilities as online learning models (e.g. SVMs can only use the linear kernel because it has a tunable decision function). And in most real world problems a model does not receive data all the time in order to justify online learning.

Most machine learning problems will receive data at known intervals, e.g. a daily snapshot of a database or a time series of the last 30 days of trades. This means that you can retrain your model with the new data every time you receive it. You may need to slightly re-tune the hyperparameters but the grid search should be close to the current hyperparameters, since the new data is unlikely to be very different from the old one.

Retraining a new model at certain intervals does not need to create downtimes, you can automate the training and only when the training and validation finishes point a load balancer to the new model. Training a new model at every batch of data also allow you to test it for obvious inconsistencies. Since most machine learning is performed as a service: a trained model sits in memory waits for input and sends back predictions, a newly trained model can be easily compared to the currently deployed model. Training and testing a new model at server startup is a completely valid and often used technique.

In summary, you will only need an online learning model if either:

The entire dataset cannot fit in memory, in this case you need online learning to train the model using parts of the dataset at a time.

You need very quick adaptation to new data, e.g. slow algorithmic trading (fast algorithmic trading is performed by network cards).

An ML model is, as the name suggests, a mathematical technique which estimates the behavior of the real world. Unfortunately (fortunately?) the real world changes, and if our model does not change in response it will soon perform worse and worse.

Of course, this is not relevant to self-closed datasets, as the ones seen in competitions or in toy problems. But most models are expected to work on real data. Recent data may have new trends, and a performance estimate without that trend will be overoptimistic. In other words, the performance of your model will decrease over time if you do not update it.

Since the performance will decrease over time you need to know when it decreases to a point in which your model is not good enough anymore to perform its job. Even if you do not retrain your model in, say, daily batches, you still need to check the model's performance against new data; i.e. you need to cross check whether a model trained on new data would classify in the same manner as the current running model.

Another reason to monitor your model is that you cannot test for every behavior during model validation (if you could you would not be using ML to solve the problem!). A new model, once trained on new data or an online model that has been battered with bad input for a while, may perform abysmally once deployed. In such a case you need a way of reverting to a previous model. For offline learning this is often easy as long as you did store previous data. For online learning you will need to store snapshots of your model at certain intervals.

First of all ask whether it makes sense to save the model. Often retraining on the most recent data, or a previous data snapshot at server startup is more convenient and even faster for some models (e.g. KNN).

Since we are working with Python we can simply use Python's default way of storing
serializable memory objects: pickle.
The default `dumps`

and `loads`

work on pretty much all `sklearn`

models,
and, since NumPy arrays are serializable, `pickle`

works as well on most other ML libraries.
That said, `pickle`

may result in quite bloated objects,
this was one of the reasons joblib was developed.
The `pickle`

bloat is due to the fact that it converts NumPy arrays into lists,
which `joblib`

performs much more efficiently by storing the array as a binary lump.