sklearn module in Python tries to standardize the way machine learning
techniques are performed. Instead of technique specific setup and manual application
of the resulting model,
sklearn proposes a standard class-based API.
Every ML technique in
sklean uses the following steps to produce a single model.
fitmethod with the training data, which has a standardized form, see below
predictmethod to apply the model
transformmethod, some models have a
fit_transformmethod which will fit and then transform the same data
Injecting data and reading the outputs of a model can often be troublesome,
figuring out in which format a model wants its parameters is no easy task.
sklearn tried to sort the misery of different libraries
requiring parameters in different formats, and created a standard way in which
sklearn models receive the input.
At first one may argue that this will only work for two dimensions, but if you remember the ideas of group by and pivot we can see that we can represent several dimensions in a two dimensional table.
What the heck are those hyperparameters? These are simply free parameters that interfere in the inner workings of the algorithm. By free we mean that the algorithm works independently of what we set these parameters to. Yet, the values we give influence how well the resulting model performs against a specific dataset.
Therefore yes, the entire practice of machine learning is about finding the right model and the right model hyperparameters. This could not be harder though, the combination of several free parameters goes into millions or thousands of millions for some models.
As a simple example let's implement $k$ nearest neighbors,
a model that has a single hyperparameter ($k$).
The training (
fit) of a $k$ nearest neighbors algorithm is trivial:
the training just saves the training data so we can refer to it during prediction.
This is a wasteful use of memory,
and more advanced algorithms will be able to store specific measures about
the data - which are fewer - instead of the entire training data.
But for our purposes we argue that our trained model
is a $k$ nearest neighbors object which know its value of $k$,
from the constructor, and its training data, from the
We pass the hyperparameter as an argument to the constructor,
then we pass the training data to a
and we pass the new data to a
We also leave the hyperparameter and parameters (the training data)
in attributes that end with an underscore:
There are all patterns of how a class representing an ML model
import numpy as np class MyKNN(object): def __init__(self, k=1): self.k_ = k def fit(self, X, y): self.X_ = X self.y_ = y def predict(self, new_X): sq_dist_dim = (self.X_[:, np.newaxis, :] - new_X[np.newaxis, :, :])**2 sq_dist = sq_dist_dim.sum(axis=-1) self.nearest_ = np.argpartition(sq_dist, self.k_, axis=0) new_y = self.y_[self.nearest_[:self.k_, :]] new_y = np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=0, arr=new_y) return new_y.argmax(axis=0)
The prediction phase is where the work happens. We perform a series of broadcast operations: first we make the stored training data and the new data to be predicted into three dimensional cubes. But we perform this in a careful way so that when they broadcast over their difference each point of the training data has its difference taken against every point in the new provided data. This difference happens according to each coordinate of all points, and the end result is a cube with the difference between each coordinate as a full matrix of points. We then square the differences and sum across the coordinates to achieve the euclidean distance between all points. Note that we use only two dimensions in our implementation, we could use more dimensions but would require some more complex code to control the projections.
The heavy lifting in our model happen during
which sorts the points according to their distances from each point stored in the class.
Here we finally need the value of $k$.
Sorting is an expensive operation,
if one can sort less values the code performs faster.
If you remember that is what the
partition procedure does,
it sorts up to a number of elements
(e.g. sorts the smaller first $3$ elements in a list,
once we have the $3$ smallest values in the list and
these are in the correct positions at the front
the sorting algorithm ceases and the rest of the list is not sorted).
Here we use
partition to find the $k$ closest points in the
training data from each point in the new data.
Moreover, we use
argpartition to get the index of these
points instead of the points themselves.
With the index of the closest $k$ points we index
the solutions - classes, labels - of the training data.
Then we just bundle together the classes of all close points with
bincount build an array of $2$ classes:
class $0$ and class $1$, sand sums the occurrence of each
in the class values of the $k$ neighbors.
Finally, since the number of neighbors of class $0$ is at index $0$
and the number of neighbors of class $1$ is at index $1$,
argmax returns us the most common class among the neighbors of the point.
Our implementation is imperfect, in case of a tie the algorithm prefers the class with the lowest index - class $0$. This is irrelevant if we use our algorithm as a binary classifier and with an odd $k$. But there are more complications: we can only deal with points in $2$ dimensions and can only classify $2$ classes. A full implementation will need to deal with the ties. It is an interesting challenge to improve our $k$ nearest neighbors implementation to deal with at least these limitations.