The perceptron is a quite old idea. It was born as one of the alternatives for electronic gates but computers with perceptron gates have never been built. Instead, a perceptron is a very good model for online learning.
The perceptron receives an arbitrary number of inputs, multiplies each by a weight (a model parameter), sums them, and then applies an activation function to the result of this sum. Given the number of inputs as the vector $x$, the weights as the vector $w$, and the activation function as $f$; For a single prediction all the perceptron does is:
$$\hat{y} = f \left( \sum w^T \cdot x \right)$$Note that the vector $x$ has actually one extra input that is always of unit size. This is called a bias term, and is similar to an amplifier in electronics, and it has its related weight $w_0$. This bias term ensures that the outputs out of the perceptron are not bound by, possibly small, input values.
To train a perceptron we need to find the correct vector $w$ of weights that gives the best classification or regression. There are two ways of doing it:
The second option is much cheaper computationally, therefore the most common perceptron learning rule is:
$$w_{i+1} = w_i + \eta(y - \hat{y})x_i$$Where $\eta$ is the learning rate. The $y$ and $\hat{y}$ are either two classes encoded as $-1$ and $1$ or as $0$ and $1$. Note that the training is so simple that both way of encoding the classes work well. For regression one can place the regression values directly into $y$ and $\hat{y}$ as well.
Here $x_1$ and $x_2$ are the features of the data, and $w_n$ the parts of the perceptron vector. And $\hat{y}$ is the output of the perceptron, the values of $\hat{y}$ make for the the decision function of the model. The most common activation function for a perceptron is the $sign$ function, therefore the perceptron becomes a binary classifier. The outputs of the sign function are $-1$ and $1$ which we often re-encode as $0$ and $1$.
We will attempt to learn boolean logic with our perceptron, this was the original idea for the use of perceptrons but they were implemented in hardware, not software. The interesting part about this classification is that we can generate the full world of data (not a sample) that will pass through our model, therefore we can achieve perfect F1 score.
import numpy as np
import pandas as pd
data = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y_or = data.any(axis=1).astype(np.int)
y_and = data.all(axis=1).astype(np.int)
y_nand = (~data.all(axis=1)).astype(np.int)
y_xor = (data[:, 0] != data[:, 1]).astype(np.int)
df_true = pd.DataFrame({'left': data[:, 0], 'rigth': data[:, 1],
'OR': y_or, 'AND': y_and, 'XOR': y_xor, 'NAND': y_nand},
columns=['left', 'rigth', 'OR', 'XOR', 'AND', 'NAND'])
df_true
OK, we have a full truth table with the correct outputs to each logic function.
Just like any other model in sklearn
we can follow
the initialize and fit
method to train our perceptron.
We then feed the inputs and expected outputs for each logic function.
The max_iter=10
hyperparameter will prevent the training to go through too many iterations.
from sklearn.linear_model import Perceptron
from sklearn.metrics import f1_score
gates = {'or': y_or, 'and': y_and, 'xor': y_xor, 'nand': y_nand}
test = np.array([
[0, 0], [0, 1], [1, 0], [1, 1],
[1, 0], [0, 1], [1, 1], [0, 0],
[1, 1], [1, 0], [0, 1], [0, 0],
])
test_y = {
'or': test.any(axis=1).astype(np.int),
'and': test.all(axis=1).astype(np.int),
'xor': (test[:, 0] != test[:, 1]).astype(np.int),
'nand': (~test.all(axis=1)).astype(np.int)
}
all_preds = test.copy()
for g in ['or', 'and', 'nand', 'xor']:
pcpt = Perceptron(penalty='l2', max_iter=10)
pcpt.fit(data, gates[g])
y_pred = pcpt.predict(test)
all_preds = np.c_[all_preds, y_pred[:, np.newaxis]]
f1 = f1_score(test_y[g], y_pred)
print('###### Gate:', g.upper(), 'F1:', f1)
pd.DataFrame(all_preds, columns=['left', 'right', 'OR', 'AND', 'NAND', 'XOR'])
We cannot learn XOR with a single perceptron, why is that? The perceptron is a linear model and XOR is not a linear function. The activation function of the perceptron that we are using is the sign function.
$$ \text{sign}(x) = \begin{cases} -1 \text{ if } x \leq 0 \\ 1 \text{ if } x > 0 \\ \end{cases} $$Which is often a bit mangled by ML libraries and we get.
$$ \text{sign}(x) = \begin{cases} 0 \text{ if } x \leq 0 \\ 1 \text{ if } x > 0 \\ \end{cases} $$There are cases where perceptrons are used with $-1$ vs $1$ and other cases where they are used with $0$ and $1$, we need to get used to this inconsistency. Both are valid ways of using a perceptron after all. From here on we will used classes encoded as $0$ and $1$.
Each of our electronic gate percpetrons had two inputs. We can draw the full world of these boolean operations in two dimensions one input on a horizontal axis (left input) and another on the vertical axis (right input). The perceptron performs a sum and the a clip (sign) operation, this is a linear operation and in this world the decision function that the perceptron performs will be a line.
In a world with points $(0, 0)$, $(0, 1)$, $(1, 0)$ and $(1, 1)$ we can imagine a single line that will perform the operation of $AND$, $OR$ and $NAND$. But in order to perform $XOR$ we would need at least two lines, one to separate $(0, 0)$ from the rest of the points and one to separate $(1, 1)$ from the rest of the points. Since $XOR$ cannot be linearly separated we cannot train a perceptron to execute this operation.
Yet we have more tricks in our sleeves, we can use more than one perceptron. We know that:
$$\text{XOR}(a, b) = \text{AND}(\text{NAND}(a, b), \text{OR}(a, b))$$Therefore we can build a XOR model with three perceptrons. We train one $OR$ perceptron one $NAND$ perceptron and one $AND$ perceptron, and link them together in the same way as in the equation.
pcpt_or = Perceptron(penalty='l2', max_iter=1e6)
pcpt_or.fit(data, y_or)
pcpt_nand = Perceptron(penalty='l2', max_iter=1e6)
pcpt_nand.fit(data, y_nand)
pcpt_and = Perceptron(penalty='l2', max_iter=1e6)
pcpt_and.fit(data, y_and)
def predict_xor(x):
left = pcpt_or.predict(x)
right = pcpt_nand.predict(x)
return pcpt_and.predict(np.array(list(zip(left, right))))
y_pred = predict_xor(test)
f1_score(test_y['xor'], y_pred)
The three perceptrons together can be thought off as a simple Neural Network. The difference is that here we knew exactly how to train every perceptron separately, which is not as easy in (possibly) complex neural networks. In order to train perceptrons for which we do not know the exact function they may represent we need better training techniques.