A neural network consists of a network of nodes that are connected to each other. Each node takes scalar inputs from other nodes and returns a scalar output.
## Node with Single Input
In this simple case, node $a^{(1)}$ only receives input from $a^{(0)}$ where the superscript indicates the level of the layer within the network.
![[feed-forward-nn-1.png|center|400]]
$ a^{(1)} = \sigma (w a^{(0)}+b) $
|Symbol|Name|Comment|
|---|---|---|
|$a^{(i)}$|activity at layer $i$|value of the node|
|w|weight|scalar to be optimized|
|b|bias|scalar to be optimized|
|$\sigma$|activation function|e.g. tanh|
## Node with Multiple Inputs
There can also be multiple input nodes to $a^{(1)}$, that are weighted by the respective $w_i$ parameter.
![[feed-forward-nn-2.png|center|400]]
$ a^{(1)} = \sigma (w_0 * a_0^{(0)}+ \cdots + w_n * a_n^{(0)}+b) $
**Summation notation:**
$ a^{(1)}= \sigma\Big(\big(\sum_{i=0}^n w_i a_i^{(0)}\big)+b\Big) $
**Vector notation:**
$
w= \begin{bmatrix} w_0 \\ \vdots \\ w_n \end{bmatrix}, \quad
a^{(0)}=\begin{bmatrix} a_0^{(0)} \\ \vdots \\ a_n^{(0)} \end{bmatrix}, \quad
a^{(1)} = \sigma(\mathbf w \cdot \mathbf a^{(0)} +b)
$
## Layer with Multiple Nodes
Sticking with the vector notation, all nodes within a layer can be expressed as above.
$ a_0^{(1)} = \sigma(w_0 \cdot a^{(0)} +b_0) \\ \dots\\ a_n^{(1)} = \sigma(w_n \cdot a^{(0)} +b_n) $
![[feed-forward-nn-3.png|center|400]]
When we combine all nodes from a certain layer (e.g. layer 1) into a vector-valued function (e.g. 2 outputs), we can write the whole layer with one equation.
$ a^{(1)} = \sigma \big(\mathbf W^{(1)} \cdot \mathbf a^{(0)}+\mathbf b^{(1)} \big) $
We build a row vector with all weights coming from one node. We iterate over all input nodes and stack these row vectors below each other. $\mathbf W$ indicates that we deal with a matrix instead of a vector.
$
\mathbf W^{(1)}=
\begin{bmatrix}
w_{0,0}^{(1)} & \cdots & w_{0,n-1}^{(1)} \\
\cdots & \cdots & \cdots \\
w_{m-1,0}^{(1)} & \cdots & w_{m-1,n-1}^{(1)}
\end{bmatrix} \quad
a^{(0)}=
\begin{bmatrix} a_0^{(0)}\\ \vdots \\a_n^{(0)}
\end{bmatrix}
\quad
b^{(1)}=
\begin{bmatrix} b_0^{(1)}\\ \vdots \\b_m^{(1)}
\end{bmatrix}
$
I.e. a column in $\mathbf W$ represents the weights going to a single node.
## Activation Functions
ReLU:
$ f(z)= \max\{0,z\} $
Hyperbolic tangent:
$ f(z) =\tanh(z) = \frac{e^z-e^{-z}}{e^z+e^{-z}} = 1-\frac{2}{e^{2z}+1} $