06.04 - Labels

We went through a classification example but we left several loose ends. Machine Learning algorithms work only on numbers and that is also true for the classes that can be predicted. We didn't predict colors with our $k$ nearest neighbors algorithms, we did predict the index of the color - in that case it was either $0$ or $1$. In order to predict anything using an ML algorithm we need to encode the classes we are to predict as indexes. sklearn has tools to help us in this endeavor.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

The LabelEncoder works in a similar fashion to an sklearn model and preforms the encoding of classes as indexes. In order to perform some label encoding we will argue on a classification of vehicles. Based on its length (in meters) and its weight (in kilograms) a vehicle in the following dataset can be classified as a car or a ship.

In [2]:
df = pd.DataFrame(dict(
    length=[3.0, 5.0, 320.0, 3.2, 250.0, 4.0],
    weight=[1200, 2500, 25000000, 1500, 1000000, 1000],
    vehicle=['car', 'car', 'ship', 'car', 'ship', 'car']))
df
Out[2]:
length weight vehicle
0 3.0 1200 car
1 5.0 2500 car
2 320.0 25000000 ship
3 3.2 1500 car
4 250.0 1000000 ship
5 4.0 1000 car

The actual classification task is trivial: ships are considerably longer and heavier than cars. Our interest is not in the classification itself but in the way of encoding these words ("car" and "ship") in order to feed it to a machine learning algorithm. And moreover, in order to evaluate the quality of the ML algirithm used.

The LabelEncoder stores state and has fit and transform methods just as an sklearn model. This is a common pattern in sklearn, tools for data preprocessing will fit the data and then transform it.

In [3]:
enc = LabelEncoder()
enc.fit(df.vehicle)
enc.transform(df.vehicle)
Out[3]:
array([0, 0, 1, 0, 1, 0])

Label Buckets

skl-labels.svg

And the reason for splitting the fitting and transforming stages, in a task as simple as enumerating a list of two values, lies in the fact that the transformation can be reused.

For example, if new data becomes available on top of the set we have defined, we can simply apply the transformation directly to the new data.

In [4]:
enc.transform(['car', 'ship', 'ship'])
Out[4]:
array([0, 1, 1])

Let us see now why working with numbers only can make things easier. We copied the set of correct labels and performed a transform of them into indexes. We call these correct labels y_true.

Then we imagine that a prediction has been made, we transform the labels of the prediction we made up and call is y_hat. The name y_hat comes from the convention where predictions are called $\hat{y}$ in statistical modeling. The contents of y_hat are in the format just as an ML model would have spat it out.

In [5]:
y_true = enc.transform(['car', 'car', 'ship', 'car', 'ship', 'car'])
y_hat = enc.transform(['car', 'ship', 'ship', 'car', 'ship', 'car'])
pd.DataFrame(dict(y_true=y_true, y_hat=y_hat))
Out[5]:
y_true y_hat
0 0 0
1 0 1
2 1 1
3 0 0
4 1 1
5 0 0

We did use a handful of scores when attempting to find the best hyperparameter value for our model in the blue and yellow points. Some of the scores only work for two classes: $0$ and $1$, whilst others work for many classes. This division is another grouping of scoring and classifiers. There are binary classifiers that only distinguish between two classes, and multiclass classifiers that can deal with many classes at once. We will come back to these concepts eventually, for now let us just argue that the vast majority of real life problems are binary. Is this river above pollution threshold? Is this transaction a fraud?

There are techniques to make binary classifiers work for several classes and, eventually, we will look at these techniques. For now let us use the binary case of cars and ships. In order to understand the next couple of scores we first need to rephrase our classification problem: we are not "identifying cars and ships"; We are "identifying ships among cars". The change in phrasing mean that only the ship identification is important, the value of $1$ after encoding. This is often called the positive class.

With the positive class defined we can not define a prediction that is:

  • True Positive (TP), as a prediction that is $1$ and the true label is $1$.
  • False Positive (FP), as a prediction that is $1$ but the true label is $0$.
  • True Negative (TN), as a prediction that is $0$ and the true label is $0$.
  • False Negative (FN), as a prediction that is $0$ but the true label is $1$.

In our y_hat we have $TP = 2$, $FP = 1$, $TN = 3$ and $FN = 0$. With that in mind we can look at the scores.

precision

$$P = \frac{true\:positives}{true\:positives + false\:positives}$$

Precision asks whether identifying a positive is sure to really be a positive. Whether once we identify a ship we can be confident that it is a ship. Note that for a perfect precision score all that is needed is to correctly identify one ship and say that everything else are cars. If we identify one ship correctly and no other ships means that all ships we identified are indeed ships, any ships we may have missed do not matter.

In [6]:
metrics.precision_score(y_true, y_hat)
Out[6]:
0.6666666666666666

We can replicate the same calculation by hand.

In [7]:
tp = 2
fp = 1
tp / (tp+fp)
Out[7]:
0.6666666666666666

recall

$$R = \frac{true\:positives}{true\:positives + false\:negatives}$$

Recall is to some extent the opposite of precision. It asks whether we missed any ships, i.e. if we miss to classify any ship as a ship then the recall score deteriorates. Note that a perfect recall score can be achieved by classifying everything as a ship We certainly do not miss any ships that way but it is not a useful model.

In [8]:
metrics.recall_score(y_true, y_hat)
Out[8]:
1.0

And again we can replicate the calculation by hand.

In [9]:
tp = 2
fn = 0
tp / (tp+fn)
Out[9]:
1.0

F1 score

$$F_1 = \frac{2}{\frac{1}{P} + \frac{1}{R}}=2 \frac{P \cdot R}{P + R}$$

F1 score is one of the scores we did use until now. The score is a harmonic mean between precision ($P$) and recall ($R$). The F1 score does not allow either precision or recall to go too low. The final result is a reasonable balance between recall and precision.

In [10]:
metrics.f1_score(y_true, y_hat)
Out[10]:
0.8

And once again we replicate by ourselves.

In [11]:
tp = 2
fp = 1
fn = 0
p = tp / (tp+fp)
r = tp / (tp+fn)
2/(1/p + 1/r)
Out[11]:
0.8

Accuracy Score

$$ \frac{1}{N} \sum_{i=1}^{N} 1 (y = \hat{y}) $$

The accuracy score is the other score we used, and it is also the default score used by sklearn for classifiers. The expression $y = \hat{y}$ evaluates to $1$ if the prediction and the true label are the same and to $0$ otherwise.

In [12]:
metrics.accuracy_score(y_true, y_hat)
Out[12]:
0.8333333333333334

And as we have been doing we evaluate the same values by hand, this is order to prove to ourselves that sklearn is not doing anything out of the ordinary. We got $5$ our of the $6$ labels correctly, hence.

In [13]:
(y_true == y_hat).sum()/len(y_hat)
Out[13]:
0.8333333333333334

The accuracy score can deal with more than two classes. If we attempt to classify cars, ships and planes in the same dataset; the accuracy score will be a good score to evaluate the model. Here we do not need any clever phrasing or positive class, we only need the classes and whether the labels match to the predictions.

That said the accuracy score is often quite overoptimistic about model performance. Next we quickly build an encoder for $3$ classes and evaluate a case in which we get $6$ out of $7$ predictions right.

In [14]:
enc = LabelEncoder()
enc.fit(['car', 'ship', 'plane'])
y_true = enc.transform(['car', 'car', 'ship', 'car', 'ship', 'car', 'plane'])
y_hat = enc.transform(['car', 'ship', 'ship', 'car', 'ship', 'car', 'plane'])
metrics.accuracy_score(y_true, y_hat), (y_true == y_hat).sum()/len(y_hat)
Out[14]:
(0.8571428571428571, 0.8571428571428571)

Other common scoring functions include log_loss for evaluating probabilities or the roc_auc_score for the area under the ROC curve. There are often reasons to use one scoring function over another. For example, in a medical test or in a lawsuit classifier one would want to prioritize recall. This is because in medicine we want a test which almost certainly will identify a sick patient, even if there are some non-negligible false positives. And in law one cannot miss relevant cases, even if some false positives creep in.

Another concept to not that a metric that is named a score will have its best value - e.g. a good classifier close to $1$ - as a high value, and a bad value at a low score. A loss or error on the other hand will have the best value as $0$,

In the model we have been using we have only a single free parameter, and it is integer valued which makes it easy to select by hand. Some models have dozens of real valued parameters making the search for an optimal parameter much harder. We will see that when we look at these models.