OK, so how does all that work?
We managed to train perceptrons because we knew what the output of each perceptron should be.
And we saw that a perceptron can only output only two values with the sign
activation function.
In neural networks the perceptrons/neurons are organized in layers.
Each neuron will have its own weights, and these will be trained simultaneously
across the entire network with a GD variant.
This is often referred as a process called backpropagation.
But before we dive into backpropagation let's discuss activation functions.
The sign
function is only one viable activation function,
we can make our NN work with several others.
Although this is not possible in sklearn
, many NN libraries allow
one to set a different activation function for each layer of the NN.
All functions but the linear one allow for a touch of non-linearity
during the processing.
Note also how many of the functions clip the output to be either
between $0$ and $1$ or between $-1$ and $1$.
The clipping allow for the NN outputs for one single layer to be
similar in magnitude to other layers,
and the derivatives do not grow too high either.
Unfortunately when we add more layers this clipping is not enough
to prevent overly high or overly low derivatives and further tricks are needed
but these are out of scope here.
The equations of the functions shown are below.
$$ \text{linear}(x) = x \\ \text{sign}(x) = \begin{cases} -1 \text{ if } x \leq 0 \\ 1 \text{ if } x > 0 \\ \end{cases} \\ \text{relu}(x) = \begin{cases} 0 \text{ if } x \leq 0 \\ x \text{ if } x > 0 \\ \end{cases} \\ \text{tanh}(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}\\ \text{sigmoid}(x) = \frac{1}{1 + e^{-x}} $$The exponential functions have very easy derivatives but may be
expensive to compute when there are many neurons and many samples.
The relu
stands for Rectified Linear Unit and tanh
for Hyperbolic Tangent.
And that's not all, for the last layer of the network other functions are often used.
The most common of which is the softmax
function,
which scales all values so that the higher value moves apart from the lower ones.
A typical NN is organized in layers and there are connections between all neurons in adjacent layers. As in a perceptron the weights (model parameters) exist for every single connection. And for every neuron there exist a bias connection which always inputs a value of $1$ multiplied by its weight into the neuron.
How can we train the perceptrons in the hidden layers? We do not really know what their output should be. Enters backpropagation, a technique to train a NN without the need to know what the outputs of every neuron should be.
Backpropagation argues that we can divide the error or the NN according to the contribution of every weight. To find the contribution of every weight we use the partial derivative.
$$ \nabla E = \frac{\partial E}{\partial w_1}\hat{\imath} + \frac{\partial E}{\partial w_2}\hat{\jmath} + \ldots $$But we did see this before! That is the same equation as for Gradient Descent. Hence, to train a NN, all we need to do it to apply a variant of GD to it.
Inside a NN library the NN is composed of several matrices. Calculation of derivatives on matrices is efficient due to some tricks in Vector Calculus. But for us the important part is that we can define the entire NN as just a set of operations, as a single function performing multiplications, additions and applying the activation functions.
Since we see that the NN itself is just a function.
And if we subtract the expected values (the labels) we get the error function,
we argue that all we need to train it is GD or a variant of it.
In most cases we square, take the absolute value or even a more complex
operation on the error to make it positive but it is still just a function.
We also often add an L2
or L1
regularization term to this function.
To perform GD all we need are derivatives. And it turns out that by a clever use of the derivation chain rule a computer can automate the differentiation of even quite complicated functions.
A complete explanation of the chain rule is beyond our scope. But we can do an example using the chain rule on a reasonably complex function to see how a computer can perform the derivation.
$$ f(x, y) = (xy + 3)^2 + 1 $$Working by hand we can find that the partial derivative against $x$ is:
$$ \frac{\partial f}{\partial x} = 2xy^2 + 6y $$But that requires a human to do the thinking on how to derivate. For a computer to do it we divide $f$ into several simple functions. Each function performs only a single operation.
$$ g(h) = h + 1 \\ h(k) = k^2 \\ k(m) = m + 3 \\ m(x, y) = xy $$This way we can write $f$ as.
$$ f(x, y) = g(h(k(m(x, y)))) $$By the chain rule we know that.
$$ \frac{\partial f}{\partial x} = \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial k} \cdot \frac{\partial k}{\partial m} \cdot \frac{\partial m}{\partial x} $$So we calculate each of them.
$$ \frac{\partial g}{\partial h} = 1 \\ \frac{\partial h}{\partial k} = 2k = 2(m + 3) = 2(xy + 3) = 2xy + 6 \\ \frac{\partial k}{\partial m} = 1 \\ \frac{\partial m}{\partial x} = y $$Substitute in the equation above and we get.
$$ \frac{\partial f}{\partial x} = 1 \cdot (2xy + 6) \cdot 1 \cdot y = 2xy^2 + 6y $$The same result.
Hence GD can be performed easily on complex functions. Computationally there are techniques to calculate $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$ at once but the operations are exactly the ones described above in the chain rule example.
It turns out that computationally it is more efficient to calculate the derivatives backward, i.e. in the order: $\frac{\partial m}{\partial x}$, $\frac{\partial k}{\partial m}$, $\frac{\partial h}{\partial k}$ and $\frac{\partial g}{\partial h}$. The term backpropagation comes from this backward order through the chain rule. Not backward through the NN as it is incorrectly said in many texts. OK, fair enough, perhaps the analogy of backpropagation going through the network itself is not completely incorrect - there is some truth in it if you think about the internal computational graph that performs the derivation. But be wary of the analogy of the backpropagation going backward through the network, the mindset presented by that analogy can hinder one's attempts at understanding the mathematics behind the chain rule use for NN training.
The forward propagation i.e. calculating the derivatives in the order $\frac{\partial g}{\partial h}$, $\frac{\partial h}{\partial k}$ $\frac{\partial k}{\partial m}$ and $\frac{\partial m}{\partial x}$, is called forward propagation and it is a viable way of training the NN. The forward propagation requires a lot more memory than backpropagation to store intermediate values. Hence backpropagation is preferred in almost all cases.