OK, so how does all that work?
We managed to train perceptrons because we knew what the output of each perceptron should be.
And we saw that a perceptron can only output only two values with the `sign`

activation function.

In neural networks the perceptrons/neurons are organized in layers.

- One
*input layer*which has one perceptron per feature - At least one
*hidden layer*with fully connected perceptrons - One output layer with one perceptron per output class (or per output for regression)

Each neuron will have its own weights, and these will be trained simultaneously
across the entire network with a GD variant.
This is often referred as a process called *backpropagation*.
But before we dive into backpropagation let's discuss activation functions.
The `sign`

function is only one viable activation function,
we can make our NN work with several others.

Although this is not possible in `sklearn`

, many NN libraries allow
one to set a different activation function for each layer of the NN.
All functions but the linear one allow for a touch of non-linearity
during the processing.
Note also how many of the functions clip the output to be either
between $0$ and $1$ or between $-1$ and $1$.
The clipping allow for the NN outputs for one single layer to be
similar in magnitude to other layers,
and the derivatives do not grow too high either.
Unfortunately when we add more layers this clipping is not enough
to prevent overly high or overly low derivatives and further tricks are needed
but these are out of scope here.

The equations of the functions shown are below.

$$ \text{linear}(x) = x \\ \text{sign}(x) = \begin{cases} -1 \text{ if } x \leq 0 \\ 1 \text{ if } x > 0 \\ \end{cases} \\ \text{relu}(x) = \begin{cases} 0 \text{ if } x \leq 0 \\ x \text{ if } x > 0 \\ \end{cases} \\ \text{tanh}(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}\\ \text{sigmoid}(x) = \frac{1}{1 + e^{-x}} $$The exponential functions have very easy derivatives but may be
expensive to compute when there are many neurons and many samples.
The `relu`

stands for Rectified Linear Unit and `tanh`

for Hyperbolic Tangent.

And that's not all, for the last layer of the network other functions are often used.
The most common of which is the `softmax`

function,
which scales all values so that the higher value moves apart from the lower ones.

A typical NN is organized in layers and there are connections between all neurons in adjacent layers. As in a perceptron the weights (model parameters) exist for every single connection. And for every neuron there exist a bias connection which always inputs a value of $1$ multiplied by its weight into the neuron.

How can we train the perceptrons in the hidden layers?
We do not really know what their output should be.
Enters **backpropagation**,
a technique to train a NN without the need to know
what the outputs of every neuron should be.

Backpropagation argues that we can divide the error or the NN according to the contribution of every weight. To find the contribution of every weight we use the partial derivative.

$$ \nabla E = \frac{\partial E}{\partial w_1}\hat{\imath} + \frac{\partial E}{\partial w_2}\hat{\jmath} + \ldots $$But we did see this before! That is the same equation as for Gradient Descent. Hence, to train a NN, all we need to do it to apply a variant of GD to it.

Inside a NN library the NN is composed of several matrices. Calculation of derivatives on matrices is efficient due to some tricks in Vector Calculus. But for us the important part is that we can define the entire NN as just a set of operations, as a single function performing multiplications, additions and applying the activation functions.

Since we see that the NN itself is just a function.
And if we subtract the expected values (the labels) we get the error function,
we argue that all we need to train it is GD or a variant of it.
In most cases we square, take the absolute value or even a more complex
operation on the error to make it positive but it is still just a function.
We also often add an `L2`

or `L1`

regularization term to this function.

To perform GD all we need are derivatives. And it turns out that by a clever use of the derivation chain rule a computer can automate the differentiation of even quite complicated functions.

A complete explanation of the chain rule is beyond our scope. But we can do an example using the chain rule on a reasonably complex function to see how a computer can perform the derivation.

$$ f(x, y) = (xy + 3)^2 + 1 $$Working by hand we can find that the partial derivative against $x$ is:

$$ \frac{\partial f}{\partial x} = 2xy^2 + 6y $$But that requires a human to do the thinking on how to derivate. For a computer to do it we divide $f$ into several simple functions. Each function performs only a single operation.

$$ g(h) = h + 1 \\ h(k) = k^2 \\ k(m) = m + 3 \\ m(x, y) = xy $$This way we can write $f$ as.

$$ f(x, y) = g(h(k(m(x, y)))) $$By the chain rule we know that.

$$ \frac{\partial f}{\partial x} = \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial k} \cdot \frac{\partial k}{\partial m} \cdot \frac{\partial m}{\partial x} $$So we calculate each of them.

$$ \frac{\partial g}{\partial h} = 1 \\ \frac{\partial h}{\partial k} = 2k = 2(m + 3) = 2(xy + 3) = 2xy + 6 \\ \frac{\partial k}{\partial m} = 1 \\ \frac{\partial m}{\partial x} = y $$Substitute in the equation above and we get.

$$ \frac{\partial f}{\partial x} = 1 \cdot (2xy + 6) \cdot 1 \cdot y = 2xy^2 + 6y $$The same result.

Hence GD can be performed easily on complex functions. Computationally there are techniques to calculate $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$ at once but the operations are exactly the ones described above in the chain rule example.

It turns out that computationally it is more efficient to calculate the derivatives
backward, i.e. in the order: $\frac{\partial m}{\partial x}$,
$\frac{\partial k}{\partial m}$, $\frac{\partial h}{\partial k}$
and $\frac{\partial g}{\partial h}$.
The term **backpropagation** comes from this backward order through the chain rule.
*Not* backward through the NN as it is *incorrectly* said in many texts.
OK, fair enough, perhaps the analogy of backpropagation going through the network
itself is not completely incorrect - there is some truth in it if you think about
the internal computational graph that performs the derivation.
But be wary of the analogy of the backpropagation going backward through the network,
the mindset presented by that analogy can hinder one's attempts at understanding the
mathematics behind the chain rule use for NN training.

The forward propagation i.e. calculating the derivatives in the order $\frac{\partial g}{\partial h}$, $\frac{\partial h}{\partial k}$ $\frac{\partial k}{\partial m}$ and $\frac{\partial m}{\partial x}$, is called forward propagation and it is a viable way of training the NN. The forward propagation requires a lot more memory than backpropagation to store intermediate values. Hence backpropagation is preferred in almost all cases.