When fitting a neural network we compare the output layer with the actual labeled data. Thereby we want to find the best setting for all weights and biases, that minimize a certain loss function.
A cost function can be designed in various ways. A popular choice is the sum of the squared differences between the activity $a_i^{(L)}$ from output layer $L$ and the labeled data $y$. The output layer has $i$ number of nodes.
$ \text{Loss}= \sum_i \big(a_i^{(L)}-y_i \big)^2 $
Since we want to find the minimum of the cost function, we need to get the derivative $\frac{\partial \text{Loss}}{\partial w}, \frac{\partial \text{Loss}}{\partial b}$ for the steepest slope towards the (hopefully) global minimum.
## No Hidden Layer
In a network with no hidden layers, we have the following chain of functions between the $\text{Loss}$ and the weights $w$ and biases $b$:
$ \begin{aligned}
\text{Loss} &=0.5*(a^{(1)}-y_i)^2 \\[6pt] a^{(1)}&= \sigma (z^{(1)}) \\[6pt] z^{(1)}&=W^{(1)} \cdot a^{(0)}+b^{(1)}
\end{aligned} $
Applying the multivariate chain rule on the weights and biases:
$
\begin{align}
\frac{\partial \text{Loss}}{\partial w}&= \frac{\partial \text{Loss}}{\partial a^{(1)}} * \frac{\partial a^{(1)}}{\partial z^{(1)}} * \frac{\partial z^{(1)}}{\partial W^{(1)}} \\[10pt]
\frac{\partial \text{Loss}}{\partial b}&= \frac{\partial \text{Loss}}{\partial a^{(1)}} * \frac{\partial a^{(1)}}{\partial z^{(1)}} * \frac{\partial z^{(1)}}{\partial b^{(1)}}
\end{align}
$
## Hidden Layers
In this example, we have multiple layers (hidden layers in the middle). To still keep it simple each layers consists of just one node. Again we want to find all $w^{(i)}$ parameters that minimize $\text{Loss}$.
![[hidden-layer.png|center|600]]
For a network with $l$ layers, we can use the chain rule and break down the derivatives in separate parts.
$ \frac{\text{Loss}}{\partial w^{(1)}}= \frac{\text{Loss}}{\partial a^{(l)}} * \prod_{i=2}^l\frac{\partial a^{(i)}}{\partial a^{(i-1)}}*\frac{\partial a^{(1)}}{\partial w^{(1)}} $
Output layer:
$ \begin{aligned}
\text{Loss}\big(y;a(x,w)\big) &=\frac{1}{2}(y-a^{(2)})^2 \\[8pt] \frac{\text{Loss}}{\partial a^{(2)}} &= -(y-a^{(2)})
\end{aligned} $
Hidden layer:
$
\begin{align}
a^{(2)} &= \tanh(a^{(1)}\cdot w^{(2)})\\[4pt]
\frac{\partial a^{(2)}}{\partial a^{(1)}} &= 1- \tanh^2(a^{(1)}\cdot w^{(2)})_w^{(2)} \\[2pt] &=1-(a^{(2)})^2*w^{(2)}
\end{align}
$
Input layer:
$
\begin{align}
a^{(1)} &= \tanh(x \cdot w^{(1)}) \\[4pt]
\frac{\partial a^{(1)}}{\partial w{(1)}} &= 1- \tanh^2(x \cdot w^{(1)})_x\\[2pt]
&=1-(a^{(1)})^2_x
\end{align}
$