# 10.00 Real World Problems¶

Just so we have this in one place, let's group together all the higher level issues that concern machine learning models. These issues happen almost every time we want to use a machine learning model to solve a real world problem (i.e. not a toy problem). We have already discussed several of them but a list works better as a reference. We also will go a bit deeper in mathematical terms on some of the topics, these terms will prove needed when we approach online learning.

ol-earth.svg

The bias vs variance trade-off argues that a model that is not complex enough will underfit the data, and a model that is too complex will overfit the data. We control model complexity through model hyperparameters, and we can estimate a good complexity by trying several hyperparameters and cross-validating their performance. The difficult part of the art of machine learning is not making a model work, it is to prove that is works and that it will work for new data. Moreover, depending of the problem we may want to validate a model for different things, e.g. in a fraud detection model we want the recall of fraud data points to be the most important validation.

If your model will work with new data, just cross-validating it is not enough. Cross-validation allows us to select the best hyperparameters, and gives us a good estimate of how well a model performs; but it does not give us an estimate of how badly our model can perform on new data, i.e. we do not have a generalization baseline.

To estimate how our model performs against new data, we need to separate our data into a training and test sets and only then perform cross-validation on the training set alone. The resulting model's generalization can then be evaluated on the test set. In other words, we now have a test set, and several folds which are the training and validation sets. This ensures that the model sees only the training set during the tuning of its parameters, and sees only the training and validation sets during the tuning of its hyperparameters. In sklearn the test set splitting and cross-validation are done with similar procedures. In other libraries, notably neural network libraries, the training (fitting) of the model will always perform (cross) validation as part of the training. As the bracket suggest it may not be a full cross-validation depending on the library but do not get fooled by examples using such libraries in which one does not worry about the validation. The (cross) validation happens behind the scenes in many ML libraries.

## Scaling Data¶

Machine Learning algorithms work on numbers, and assume that if a number is bigger it means that this number is more important. Yet, that is often not what we actually want. Most models are sensitive to the magnitude of the features, and scaling the features to have a similar magnitude will, more often than not, give better results. Borrowing from sklearn two common ways of scaling features are:

• StandardScaler subtracts the mean and then divides by the maximum (absolute) to achieve mean zero and variance one for all features.

• MinMaxScaler forces all features to have values between zero and one.

• Normalizer considers a sample to be vector in as many dimensions as there are features and then normalizes each sample vector to unit length.

We did indeed use several of these or wrote equivalent code.

## Ensembles, Voting and OVO vs OVR¶

Ensemble methods are powerful. Depending on how you setup the ensemble it can bestow the performance of models or work around limitations of certain models.

A voting technique (akin of, but no limited to, random forests or ada boost) takes several models, trains each and predicts by majority vote across all models. The internal models are often trained on subsets of the data, or with different randomization and hyperparameters. This way one can increase the performance and generalization of the models, and makes up for models which suffer from in-built overfitting.

One vs One (OVO) and One vs Rest (OVR) techniques are used to allow binary classifiers to perform multiclass classification. OVO trains a classifier for every pair of classes, this means that a big number of binary classifiers will be trained on subsets of data (only the samples for the two classes). The OVO runs all competing classifiers and decides on the classes with most wins. In OVR (also called OVA, One vs All) the number of trained classifiers is the number of classes, each classifier is trained on the samples of one class as the positive class and all other samples as the negative class. OVR then selects the answer by picking the class with the higher probability.

## Probabilities and Decision Functions¶

Being able to explain why your model classifies things the way it does, may or may not be important for the problem you are solving. Classification (and often regression) can be performed in two ways: by constructing a crisp decision function and deciding upon classes/values based on distance to this function; or by assigning probabilities to each class/value and deciding based on the higher probability.

These two methods are not necessarily opposite to each other. Probabilities are still distances to a decision function but they can be weighted by some density - i.e. where distances across a higher density weight more. The opposite is also true, one can estimate probabilities based on the distance from a crisp decision function - and that is often performed in may algorithms. Note that this may mean that the probabilities are just estimates. The quality of probability estimation varies across models, and there are technique to better calibrate such probability estimates.

## Online Learning¶

One thing we did not touch yet is the concept of online learning. This is to differ it from offline learning. In offline learning we can work with all data we will ever feed the model training with at once, i.e. we can load a dataset in memory, use cross-validation to tune a model over this dataset and test against a test set. Offline learning is incredibly common in data analysis.

The problem starts when we plan to build a model (say, classifier) on top of data that is continuously entering or flowing through the system. We never have the full dataset in such a case, we may have all data until today but even that will not necessarily be complete. What we need is a model that can learn and re-learn from new data, such is an online learning model.

To perform online learning we need to be capable of tuning model parameters to new data without looking back at the previous data. In other words, the model parameters must represent the data seen until now and if we change such a parameter slightly it will not affect the overall model too much. Decision functions, when cleverly parametrized, are much easy to re-tune to new data. A small change to a parameter in the decision function can bring the model closer new data with very little computational effort.

But how small a change? That answer depends on the problem. This small (or not so small) change is called the learning rate of an online learning model. A big learning rate will make the model forget about old data quickly, a very small learning rate will make the model have a lot of inertia when adapting to new data.

### Do you need Online Learning?¶

There is a big cost in achieving actual online learning: one must define and test a learning rate, and several models can only use a subset of its capabilities as online learning models (e.g. SVMs can only use the linear kernel because it has a tunable decision function). And in most real world problems a model does not receive data all the time in order to justify online learning.

Most machine learning problems will receive data at known intervals, e.g. a daily snapshot of a database or a time series of the last 30 days of trades. This means that you can retrain your model with the new data every time you receive it. You may need to slightly re-tune the hyperparameters but the grid search should be close to the current hyperparameters, since the new data is unlikely to be very different from the old one.

Retraining a new model at certain intervals does not need to create downtimes, you can automate the training and only when the training and validation finishes point a load balancer to the new model. Training a new model at every batch of data also allow you to test it for obvious inconsistencies. Since most machine learning is performed as a service: a trained model sits in memory waits for input and sends back predictions, a newly trained model can be easily compared to the currently deployed model. Training and testing a new model at server startup is a completely valid and often used technique.

In summary, you will only need an online learning model if either:

• The entire dataset cannot fit in memory, in this case you need online learning to train the model using parts of the dataset at a time.

• You need very quick adaptation to new data, e.g. slow algorithmic trading (fast algorithmic trading is performed by network cards).

## Models Rot¶

An ML model is, as the name suggests, a mathematical technique which estimates the behavior of the real world. Unfortunately (fortunately?) the real world changes, and if our model does not change in response it will soon perform worse and worse.

Of course, this is not relevant to self-closed datasets, as the ones seen in competitions or in toy problems. But most models are expected to work on real data. Recent data may have new trends, and a performance estimate without that trend will be overoptimistic. In other words, the performance of your model will decrease over time if you do not update it.

Since we are working with Python we can simply use Python's default way of storing serializable memory objects: pickle. The default dumps and loads work on pretty much all sklearn models, and, since NumPy arrays are serializable, pickle works as well on most other ML libraries. That said, pickle may result in quite bloated objects, this was one of the reasons joblib was developed. The pickle bloat is due to the fact that it converts NumPy arrays into lists, which joblib performs much more efficiently by storing the array as a binary lump.