07.00 A Step Back¶

Let's take a step back and have an overview of the two main applications of supervised learning, and see why these are so useful and important to today's intelligent systems. We already used ML but let us repeat a couple of the most important concepts so that they remain firm in our minds.

Vincent van Gogh

^{fe-van-gogh.svg}

Classification¶

Is the most common for of ML because several problems can be thought of as a classification. Notably problems where we want to know if something is or is not an instance of something. Some examples:

Whether a transaction is or is not a fraud
Whether an image has a face or has not
Whether an email is spam or not
Probability of an iceberg floating in each of three sea currents
Given the symptoms, if a patient may or may not have certain sickness
As a bigger example: in a recommendation engine, you may not want to calculate a score for each product versus each customer, you can classify products as interesting and not interesting at all, and then calculate a score only on the interesting ones

Regression¶

The final result of a regression is a model that generates continuous numbers, and comparing these generated numbers makes sense. This outlines the power of the regression techniques: given a set of unknown samples we can order them. Examples of regression use follow:

Estimate the effect of a physics law (possibly proving or disproving a theory)
Estimate speed limits for roads
The amount of fish depending on season and weather
Risk estimation between different courses of action
In a recommendation engine, the score of recommended items (so we can order them)

Evaluation¶

There are plethora of ways of evaluating a supervised model. And it is not possible to create a single evaluation methods for all models because models trained on different data are inherently different. The way how you evaluate a model often depends more on what data you are working on than on what type of model is being used.

In classification the F1 score is a pretty good general evaluation in which we make sure that we do not miss certain classes from appearing. But that evaluation is not a panacea for every classification problem, for example, fraud and diagnostic classification need a score in which false negatives weight much more than false positives.

Also, most statistical fallacies do apply to machine learning models. The most common fallacy that make machine learning models to fail is the base rate fallacy. This fallacy is very well illustrated by Alex Reinhart in his book Statistics Done Wrong (see link to full text below):

Suppose 0.8% of women who get mammograms have breast cancer. In 90% of women with breast cancer, the mammogram will correctly detect it. (That’s the statistical power of the test. This is an estimate, since it’s hard to tell how many cancers are missed if we don’t know they’re there.) However, among women with no breast cancer at all, about 7% will get a positive reading on the mammogram, leading to further tests and biopsies and so on. If you get a positive mammogram result, what are the chances you have breast cancer?

Ignoring the chance that you, the reader, are male, the answer is 9%.

This could not be more true for machine learning. More often than not one is faced with datasets in which the prevalence of one class (normally negative, e.g. non-fraud, healthy) over others. It is often better to reduce the dataset and train on a sample where classes do not have orders of magnitude differences in population. In other words, a dataset is often as good (and as big) as the number of samples of the smaller class in the data. You cannot train a classifier to recognize Van Gogh paintings by giving it 1 Van Gogh painting and 10 million non-Van Gogh paintings.

SciKit Learn¶

We used SciKit Learn for some machine learning tasks but we are yet to outline the main objectives of the library. SciKit Learn is really a framework to perform machine learning solutions in, not just a collection of machine learning models. Originally SciKit Learn was part of Scipy but since it grew too large and was split.

Data preprocessing and model selection are the strong points of the SciKit Learn framework. You will spend $90\%$ of your time preprocessing the data and evaluating models instead of hacking model specific code, sklearn attempts to automate as much as possible of that time.

References¶

Base Rate Fallacy - Statistics Done Wrong