10.01 The Perceptron

The perceptron is a quite old idea. It was born as one of the alternatives for electronic gates but computers with perceptron gates have never been built. Instead, a perceptron is a very good model for online learning.

Neurons

ol-neurons.svg
Neurons come in many shapes, forms and sizes. We are often taught that all neurons have dendrites which receive the signal, signal which is passed through the axon and then further to other neurons. That is a classical multi-polar neuron. And such a multi-polar neuron is the basis for a perceptron alright. But nature has created many more types of neuron that are connected in different ways. To cite few, we have bipolar and even unipolar neurons which just have a single dendrite. Our current models for machine learning systems based on neurons are limited to very few types. We still have a lot to learn from nature.

The perceptron receives an arbitrary number of inputs, multiplies each by a weight (a model parameter), sums them, and then applies an activation function to the result of this sum. Given the number of inputs as the vector $x$, the weights as the vector $w$, and the activation function as $f$; For a single prediction all the perceptron does is:

$$\hat{y} = f \left( \sum w^T \cdot x \right)$$

Note that the vector $x$ has actually one extra input that is always of unit size. This is called a bias term, and is similar to an amplifier in electronics, and it has its related weight $w_0$. This bias term ensures that the outputs out of the perceptron are not bound by, possibly small, input values.

To train a perceptron we need to find the correct vector $w$ of weights that gives the best classification or regression. There are two ways of doing it:

  • We can use an optimizer to find this vector.
  • We can perturb each weight by the size of it input and the error of the classification

The second option is much cheaper computationally, therefore the most common perceptron learning rule is:

$$w_{i+1} = w_i + \eta(y - \hat{y})x_i$$

Where $\eta$ is the learning rate. The $y$ and $\hat{y}$ are either two classes encoded as $-1$ and $1$ or as $0$ and $1$. Note that the training is so simple that both way of encoding the classes work well. For regression one can place the regression values directly into $y$ and $\hat{y}$ as well.

Perceptron

ol-perceptron-activation.svg

Here $x_1$ and $x_2$ are the features of the data, and $w_n$ the parts of the perceptron vector. And $\hat{y}$ is the output of the perceptron, the values of $\hat{y}$ make for the the decision function of the model. The most common activation function for a perceptron is the $sign$ function, therefore the perceptron becomes a binary classifier. The outputs of the sign function are $-1$ and $1$ which we often re-encode as $0$ and $1$.

Electronic Gates

We will attempt to learn boolean logic with our perceptron, this was the original idea for the use of perceptrons but they were implemented in hardware, not software. The interesting part about this classification is that we can generate the full world of data (not a sample) that will pass through our model, therefore we can achieve perfect F1 score.

In [1]:
import numpy as np
import pandas as pd

data = np.array([[0, 0],
                 [0, 1],
                 [1, 0],
                 [1, 1]])
y_or = data.any(axis=1).astype(np.int)
y_and = data.all(axis=1).astype(np.int)
y_nand = (~data.all(axis=1)).astype(np.int)
y_xor = (data[:, 0] != data[:, 1]).astype(np.int)
df_true = pd.DataFrame({'left': data[:, 0], 'rigth': data[:, 1],
                        'OR': y_or, 'AND': y_and, 'XOR': y_xor, 'NAND': y_nand},
                       columns=['left', 'rigth', 'OR', 'XOR', 'AND', 'NAND'])
df_true
Out[1]:
left rigth OR XOR AND NAND
0 0 0 0 0 0 1
1 0 1 1 1 0 1
2 1 0 1 1 0 1
3 1 1 1 0 1 0

OK, we have a full truth table with the correct outputs to each logic function.

Just like any other model in sklearn we can follow the initialize and fit method to train our perceptron. We then feed the inputs and expected outputs for each logic function. The max_iter=10 hyperparameter will prevent the training to go through too many iterations.

In [2]:
from sklearn.linear_model import Perceptron
from sklearn.metrics import f1_score

gates = {'or': y_or, 'and': y_and, 'xor': y_xor, 'nand': y_nand}
test = np.array([
    [0, 0], [0, 1], [1, 0], [1, 1],
    [1, 0], [0, 1], [1, 1], [0, 0],
    [1, 1], [1, 0], [0, 1], [0, 0],
])
test_y = {
    'or': test.any(axis=1).astype(np.int),
    'and': test.all(axis=1).astype(np.int),
    'xor': (test[:, 0] != test[:, 1]).astype(np.int),
    'nand': (~test.all(axis=1)).astype(np.int)
}
all_preds = test.copy()
for g in ['or', 'and', 'nand', 'xor']:
    pcpt = Perceptron(penalty='l2', max_iter=10)
    pcpt.fit(data, gates[g])
    y_pred = pcpt.predict(test)
    all_preds = np.c_[all_preds, y_pred[:, np.newaxis]]
    f1 = f1_score(test_y[g], y_pred)
    print('###### Gate:', g.upper(), 'F1:', f1)
pd.DataFrame(all_preds, columns=['left', 'right', 'OR', 'AND', 'NAND', 'XOR'])
###### Gate: OR F1: 1.0
###### Gate: AND F1: 1.0
###### Gate: NAND F1: 1.0
###### Gate: XOR F1: 0.0
/home/grochmal/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_stochastic_gradient.py:557: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)
Out[2]:
left right OR AND NAND XOR
0 0 0 0 0 1 0
1 0 1 1 0 1 0
2 1 0 1 0 1 0
3 1 1 1 1 0 0
4 1 0 1 0 1 0
5 0 1 1 0 1 0
6 1 1 1 1 0 0
7 0 0 0 0 1 0
8 1 1 1 1 0 0
9 1 0 1 0 1 0
10 0 1 1 0 1 0
11 0 0 0 0 1 0

We cannot learn XOR with a single perceptron, why is that? The perceptron is a linear model and XOR is not a linear function. The activation function of the perceptron that we are using is the sign function.

$$ \text{sign}(x) = \begin{cases} -1 \text{ if } x \leq 0 \\ 1 \text{ if } x > 0 \\ \end{cases} $$

Which is often a bit mangled by ML libraries and we get.

$$ \text{sign}(x) = \begin{cases} 0 \text{ if } x \leq 0 \\ 1 \text{ if } x > 0 \\ \end{cases} $$

There are cases where perceptrons are used with $-1$ vs $1$ and other cases where they are used with $0$ and $1$, we need to get used to this inconsistency. Both are valid ways of using a perceptron after all. From here on we will used classes encoded as $0$ and $1$.

Each of our electronic gate percpetrons had two inputs. We can draw the full world of these boolean operations in two dimensions one input on a horizontal axis (left input) and another on the vertical axis (right input). The perceptron performs a sum and the a clip (sign) operation, this is a linear operation and in this world the decision function that the perceptron performs will be a line.

In a world with points $(0, 0)$, $(0, 1)$, $(1, 0)$ and $(1, 1)$ we can imagine a single line that will perform the operation of $AND$, $OR$ and $NAND$. But in order to perform $XOR$ we would need at least two lines, one to separate $(0, 0)$ from the rest of the points and one to separate $(1, 1)$ from the rest of the points. Since $XOR$ cannot be linearly separated we cannot train a perceptron to execute this operation.

Perceptron Decision

ol-perceptron-decision.svg

Yet we have more tricks in our sleeves, we can use more than one perceptron. We know that:

$$\text{XOR}(a, b) = \text{AND}(\text{NAND}(a, b), \text{OR}(a, b))$$

Therefore we can build a XOR model with three perceptrons. We train one $OR$ perceptron one $NAND$ perceptron and one $AND$ perceptron, and link them together in the same way as in the equation.

In [3]:
pcpt_or = Perceptron(penalty='l2', max_iter=1e6)
pcpt_or.fit(data, y_or)
pcpt_nand = Perceptron(penalty='l2', max_iter=1e6)
pcpt_nand.fit(data, y_nand)
pcpt_and = Perceptron(penalty='l2', max_iter=1e6)
pcpt_and.fit(data, y_and)


def predict_xor(x):
    left = pcpt_or.predict(x)
    right = pcpt_nand.predict(x)
    return pcpt_and.predict(np.array(list(zip(left, right))))


y_pred = predict_xor(test)
f1_score(test_y['xor'], y_pred)
Out[3]:
1.0

Perceptron XOR

ol-perceptron-xor.svg

The three perceptrons together can be thought off as a simple Neural Network. The difference is that here we knew exactly how to train every perceptron separately, which is not as easy in (possibly) complex neural networks. In order to train perceptrons for which we do not know the exact function they may represent we need better training techniques.