The sklearn
module in Python tries to standardize the way machine learning
techniques are performed. Instead of technique specific setup and manual application
of the resulting model, sklearn
proposes a standard class-based API.
Every ML technique in sklean
uses the following steps to produce a single model.
fit
method with the training data, which has a standardized form, see belowpredict
method to apply the modeltransform
method, some models have
a fit_transform
method which will fit and then transform the same dataInjecting data and reading the outputs of a model can often be troublesome,
figuring out in which format a model wants its parameters is no easy task.
Developers of sklearn
tried to sort the misery of different libraries
requiring parameters in different formats, and created a standard way in which
all sklearn
models receive the input.
At first one may argue that this will only work for two dimensions, but if you remember the ideas of group by and pivot we can see that we can represent several dimensions in a two dimensional table.
What the heck are those hyperparameters? These are simply free parameters that interfere in the inner workings of the algorithm. By free we mean that the algorithm works independently of what we set these parameters to. Yet, the values we give influence how well the resulting model performs against a specific dataset.
Therefore yes, the entire practice of machine learning is about finding the right model and the right model hyperparameters. This could not be harder though, the combination of several free parameters goes into millions or thousands of millions for some models.
As a simple example let's implement $k$ nearest neighbors,
a model that has a single hyperparameter ($k$).
The training (fit
) of a $k$ nearest neighbors algorithm is trivial:
the training just saves the training data so we can refer to it during prediction.
This is a wasteful use of memory,
and more advanced algorithms will be able to store specific measures about
the data - which are fewer - instead of the entire training data.
But for our purposes we argue that our trained model
is a $k$ nearest neighbors object which know its value of $k$,
from the constructor, and its training data, from the fit
method.
We pass the hyperparameter as an argument to the constructor,
then we pass the training data to a fit
method,
and we pass the new data to a predict
method.
We also leave the hyperparameter and parameters (the training data)
in attributes that end with an underscore: k_
, X_
and y_
.
There are all patterns of how a class representing an ML model
behaves in sklearn
.
import numpy as np
class MyKNN(object):
def __init__(self, k=1):
self.k_ = k
def fit(self, X, y):
self.X_ = X
self.y_ = y
def predict(self, new_X):
sq_dist_dim = (self.X_[:, np.newaxis, :] - new_X[np.newaxis, :, :])**2
sq_dist = sq_dist_dim.sum(axis=-1)
self.nearest_ = np.argpartition(sq_dist, self.k_, axis=0)
new_y = self.y_[self.nearest_[:self.k_, :]]
new_y = np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=0, arr=new_y)
return new_y.argmax(axis=0)
The prediction phase is where the work happens. We perform a series of broadcast operations: first we make the stored training data and the new data to be predicted into three dimensional cubes. But we perform this in a careful way so that when they broadcast over their difference each point of the training data has its difference taken against every point in the new provided data. This difference happens according to each coordinate of all points, and the end result is a cube with the difference between each coordinate as a full matrix of points. We then square the differences and sum across the coordinates to achieve the euclidean distance between all points. Note that we use only two dimensions in our implementation, we could use more dimensions but would require some more complex code to control the projections.
The heavy lifting in our model happen during argpartition
,
which sorts the points according to their distances from each point stored in the class.
Here we finally need the value of $k$.
Sorting is an expensive operation,
if one can sort less values the code performs faster.
If you remember that is what the partition
procedure does,
it sorts up to a number of elements
(e.g. sorts the smaller first $3$ elements in a list,
once we have the $3$ smallest values in the list and
these are in the correct positions at the front
the sorting algorithm ceases and the rest of the list is not sorted).
Here we use partition
to find the $k$ closest points in the
training data from each point in the new data.
Moreover, we use argpartition
to get the index of these
points instead of the points themselves.
With the index of the closest $k$ points we index
the solutions - classes, labels - of the training data.
Then we just bundle together the classes of all close points with bincount
.
The mincount=2
in bincount
build an array of $2$ classes:
class $0$ and class $1$, sand sums the occurrence of each
in the class values of the $k$ neighbors.
Finally, since the number of neighbors of class $0$ is at index $0$
and the number of neighbors of class $1$ is at index $1$,
argmax
returns us the most common class among the neighbors of the point.
Our implementation is imperfect, in case of a tie the algorithm prefers the class with the lowest index - class $0$. This is irrelevant if we use our algorithm as a binary classifier and with an odd $k$. But there are more complications: we can only deal with points in $2$ dimensions and can only classify $2$ classes. A full implementation will need to deal with the ties. It is an interesting challenge to improve our $k$ nearest neighbors implementation to deal with at least these limitations.