06.02 - sklearn Introduction

The sklearn module in Python tries to standardize the way machine learning techniques are performed. Instead of technique specific setup and manual application of the resulting model, sklearn proposes a standard class-based API. Every ML technique in sklean uses the following steps to produce a single model.

  1. Every ML technique (model type) is a class that you can import
  2. You feed the hyperparameters to the constructor, a new instance of a model is bound to the hyperparameters passed in
  3. You call the fit method with the training data, which has a standardized form, see below
  4. For models that predict data you use the predict method to apply the model
  5. For models that transform the data you use transform method, some models have a fit_transform method which will fit and then transform the same data


Dover Castle is has stood in one form or another since before any records began. The have been defensive structures on Dover hills before the times of Romans in Britain. The castle is a good example of sometimes tense relations between neighbors across a small body of water, the English Channel. The Roman Lighthouse stands next to the church on the back hill. William the Conqueror took and re-fortified the castle. Henry II built the current keep, and the castle stood battle against the French. The castle has been again rebuilt by Henry VIII as part of the sea defenses of Britain. During Napoleonic Wars the outer areas of the castle defenses took the shape we see today. And even during World War II Dover Castle was used as an Admiralty base, due to the protection and good line of sight.

Input and Output

Injecting data and reading the outputs of a model can often be troublesome, figuring out in which format a model wants its parameters is no easy task. Developers of sklearn tried to sort the misery of different libraries requiring parameters in different formats, and created a standard way in which all sklearn models receive the input.

Model Call


At first one may argue that this will only work for two dimensions, but if you remember the ideas of group by and pivot we can see that we can represent several dimensions in a two dimensional table.


What the heck are those hyperparameters? These are simply free parameters that interfere in the inner workings of the algorithm. By free we mean that the algorithm works independently of what we set these parameters to. Yet, the values we give influence how well the resulting model performs against a specific dataset.

Therefore yes, the entire practice of machine learning is about finding the right model and the right model hyperparameters. This could not be harder though, the combination of several free parameters goes into millions or thousands of millions for some models.

As a simple example let's implement $k$ nearest neighbors, a model that has a single hyperparameter ($k$). The training (fit) of a $k$ nearest neighbors algorithm is trivial: the training just saves the training data so we can refer to it during prediction. This is a wasteful use of memory, and more advanced algorithms will be able to store specific measures about the data - which are fewer - instead of the entire training data. But for our purposes we argue that our trained model is a $k$ nearest neighbors object which know its value of $k$, from the constructor, and its training data, from the fit method.

We pass the hyperparameter as an argument to the constructor, then we pass the training data to a fit method, and we pass the new data to a predict method. We also leave the hyperparameter and parameters (the training data) in attributes that end with an underscore: k_, X_ and y_. There are all patterns of how a class representing an ML model behaves in sklearn.

In [2]:
import numpy as np

class MyKNN(object):
    def __init__(self, k=1):
        self.k_ = k

    def fit(self, X, y):
        self.X_ = X
        self.y_ = y

    def predict(self, new_X):
        sq_dist_dim = (self.X_[:, np.newaxis, :] - new_X[np.newaxis, :, :])**2
        sq_dist = sq_dist_dim.sum(axis=-1)
        self.nearest_ = np.argpartition(sq_dist, self.k_, axis=0)
        new_y = self.y_[self.nearest_[:self.k_, :]]
        new_y = np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=0, arr=new_y)
        return new_y.argmax(axis=0)

kNN Distances


The prediction phase is where the work happens. We perform a series of broadcast operations: first we make the stored training data and the new data to be predicted into three dimensional cubes. But we perform this in a careful way so that when they broadcast over their difference each point of the training data has its difference taken against every point in the new provided data. This difference happens according to each coordinate of all points, and the end result is a cube with the difference between each coordinate as a full matrix of points. We then square the differences and sum across the coordinates to achieve the euclidean distance between all points. Note that we use only two dimensions in our implementation, we could use more dimensions but would require some more complex code to control the projections.

The heavy lifting in our model happen during argpartition, which sorts the points according to their distances from each point stored in the class. Here we finally need the value of $k$. Sorting is an expensive operation, if one can sort less values the code performs faster. If you remember that is what the partition procedure does, it sorts up to a number of elements (e.g. sorts the smaller first $3$ elements in a list, once we have the $3$ smallest values in the list and these are in the correct positions at the front the sorting algorithm ceases and the rest of the list is not sorted). Here we use partition to find the $k$ closest points in the training data from each point in the new data. Moreover, we use argpartition to get the index of these points instead of the points themselves.

With the index of the closest $k$ points we index the solutions - classes, labels - of the training data. Then we just bundle together the classes of all close points with bincount. The mincount=2 in bincount build an array of $2$ classes: class $0$ and class $1$, sand sums the occurrence of each in the class values of the $k$ neighbors. Finally, since the number of neighbors of class $0$ is at index $0$ and the number of neighbors of class $1$ is at index $1$, argmax returns us the most common class among the neighbors of the point.

Our implementation is imperfect, in case of a tie the algorithm prefers the class with the lowest index - class $0$. This is irrelevant if we use our algorithm as a binary classifier and with an odd $k$. But there are more complications: we can only deal with points in $2$ dimensions and can only classify $2$ classes. A full implementation will need to deal with the ties. It is an interesting challenge to improve our $k$ nearest neighbors implementation to deal with at least these limitations.