The `sklearn`

module in Python tries to *standardize the way machine learning
techniques are performed*. Instead of technique specific setup and manual application
of the resulting model, `sklearn`

proposes a standard class-based API.
Every ML technique in `sklean`

uses the following steps to produce a single model.

- Every ML technique (model type) is a class that you can import
- You feed the
*hyperparameters*to the constructor, a new instance of a model is bound to the hyperparameters passed in - You call the
`fit`

method with the training data, which has a standardized form, see below - For models that predict data you use the
`predict`

method to apply the model - For models that transform the data you use
`transform`

method, some models have a`fit_transform`

method which will fit and then transform the same data

Dover Castle is has stood in one form or another since before any records began.
The have been defensive structures on Dover hills before the times of Romans in Britain.
The castle is a good example of sometimes tense relations between neighbors
across a small body of water, the English Channel.
The Roman Lighthouse stands next to the church on the back hill.
William the Conqueror took and re-fortified the castle.
Henry II built the current keep, and the castle stood battle against the French.
The castle has been again rebuilt by Henry VIII as part
of the sea defenses of Britain.
During Napoleonic Wars the outer areas of the castle defenses took
the shape we see today.
And even during World War II Dover Castle was used as an Admiralty base,
due to the protection and good line of sight.

Injecting data and reading the outputs of a model can often be troublesome,
figuring out in which format a model wants its parameters is no easy task.
Developers of `sklearn`

tried to sort the misery of different libraries
requiring parameters in different formats, and created a standard way in which
all `sklearn`

models receive the input.

At first one may argue that this will only work for two dimensions,
but if you remember the ideas of *group by* and *pivot* we can see that
we can represent several dimensions in a two dimensional table.

What the heck are those *hyperparameters*?
These are simply free parameters that interfere in the inner workings of the algorithm.
By *free* we mean that the algorithm works independently of what we set these parameters to.
Yet, the values we give influence how well the resulting model performs against a specific dataset.

Therefore yes, the entire practice of machine learning is about finding the right model and the right model hyperparameters. This could not be harder though, the combination of several free parameters goes into millions or thousands of millions for some models.

As a simple example let's implement $k$ nearest neighbors,
a model that has a single hyperparameter ($k$).
The training (`fit`

) of a $k$ nearest neighbors algorithm is trivial:
the training just saves the training data so we can refer to it during prediction.
This is a wasteful use of memory,
and more advanced algorithms will be able to store specific measures about
the data - which are fewer - instead of the entire training data.
But for our purposes we argue that our trained model
is a $k$ nearest neighbors object which know its value of $k$,
from the constructor, and its training data, from the `fit`

method.

We pass the hyperparameter as an argument to the constructor,
then we pass the training data to a `fit`

method,
and we pass the new data to a `predict`

method.
We also leave the hyperparameter and parameters (the training data)
in attributes that end with an underscore: `k_`

, `X_`

and `y_`

.
There are all patterns of how a class representing an ML model
behaves in `sklearn`

.

In [2]:

```
import numpy as np
class MyKNN(object):
def __init__(self, k=1):
self.k_ = k
def fit(self, X, y):
self.X_ = X
self.y_ = y
def predict(self, new_X):
sq_dist_dim = (self.X_[:, np.newaxis, :] - new_X[np.newaxis, :, :])**2
sq_dist = sq_dist_dim.sum(axis=-1)
self.nearest_ = np.argpartition(sq_dist, self.k_, axis=0)
new_y = self.y_[self.nearest_[:self.k_, :]]
new_y = np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=0, arr=new_y)
return new_y.argmax(axis=0)
```

The prediction phase is where the work happens. We perform a series of broadcast operations: first we make the stored training data and the new data to be predicted into three dimensional cubes. But we perform this in a careful way so that when they broadcast over their difference each point of the training data has its difference taken against every point in the new provided data. This difference happens according to each coordinate of all points, and the end result is a cube with the difference between each coordinate as a full matrix of points. We then square the differences and sum across the coordinates to achieve the euclidean distance between all points. Note that we use only two dimensions in our implementation, we could use more dimensions but would require some more complex code to control the projections.

The heavy lifting in our model happen during `argpartition`

,
which sorts the points according to their distances from each point stored in the class.
Here we finally need the value of $k$.
Sorting is an expensive operation,
if one can sort less values the code performs faster.
If you remember that is what the `partition`

procedure does,
it sorts up to a number of elements
(e.g. sorts the smaller first $3$ elements in a list,
once we have the $3$ smallest values in the list and
these are in the correct positions at the front
the sorting algorithm ceases and the rest of the list is not sorted).
Here we use `partition`

to find the $k$ closest points in the
training data from each point in the new data.
Moreover, we use `argpartition`

to get the index of these
points instead of the points themselves.

With the index of the closest $k$ points we index
the solutions - classes, labels - of the training data.
Then we just bundle together the classes of all close points with `bincount`

.
The `mincount=2`

in `bincount`

build an array of $2$ classes:
class $0$ and class $1$, sand sums the occurrence of each
in the class values of the $k$ neighbors.
Finally, since the number of neighbors of class $0$ is at index $0$
and the number of neighbors of class $1$ is at index $1$,
`argmax`

returns us the most common class among the neighbors of the point.

Our implementation is imperfect, in case of a tie the algorithm prefers the class with the lowest index - class $0$. This is irrelevant if we use our algorithm as a binary classifier and with an odd $k$. But there are more complications: we can only deal with points in $2$ dimensions and can only classify $2$ classes. A full implementation will need to deal with the ties. It is an interesting challenge to improve our $k$ nearest neighbors implementation to deal with at least these limitations.