Recurrent Neural Networks - Bernhard Pfann, CFA

RNNs are designed for *sequential data*, where the order of occurrences matters. They process inputs one at a time and maintain a hidden state, enabling them to model temporal dependencies effectively. **Comparison of approaches:** - *Traditional ML:* We could still try to predict sequential data in the usual supervised way, by considering lagged features. However, such a feature mapping treats each input independently and does not take into account the order of the inputs. - *RNNs:* They are a better choice when working with sequential data because they can capture temporal dependencies and make predictions based on the previous inputs in the sequence. **Components of RNNs:** - *Encoding:* Transforms the observed sequence into an encoded vector. - *Decoding:* Makes predictions on the encoded vector. ## Encoding In encoding, we are always in a current state $s_{t-1}$, we add new information $x_t$ to it, and thereby create a new state $s_t$. The weight matrices decide what piece of the current state is kept, and what piece of the new information is added. $ s_t = \tanh(\underbrace{W^{s,s}*s_{t-1}}_{\text{what to keep}}+\underbrace{W^{s,x}*x_{t}}_{\text{what to add}}) $ ![[rnn-encoding-1.png|center|400]] **Variables:** | Notation | Dimension | Description | | --------- | ------------ | ------------------------------------------------------------- | | $s_t$ | $m \times 1$ | Vector capturing the new state. | | $s_{t-1}$ | $m \times 1$ | Vector capturing the current state. | | $x_t$ | $d \times 1$ | Raw feature vector with new information. | | $W^{s,s}$ | $m \times m$ | Matrix deciding what part of the previous state to keep. | | $W^{s,x}$ | $m \times d$ | Matrix deciding what part of the new information to consider. | The full encoding process e.g. of a sentence looks as follows and starts with a null vector. ![[rnn-encoding-2.png|center|500]] **Comparing with Feed Forward Networks:** | | [[Feed Forward Neural Network]] | Recurrent<br>Neural Network | | | --------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- | | **Input:** | Receives input only at the beginning through the input layer. | Receives input at each layer of the network. | | | **Layers:** | Fixed architecture. | The number of layers is not pre-defined as it depends on the length of the sequence, that we want to encode. This could be the number words in a sentence, when we encode sentences. | | | **Parameters:** | Individual parameter values for $w,b$ for each perceptron. | Same $\theta$ matrix applied at all layers. | | ## Decoding In the decoding phase, the final hidden state of the encoder is used as the initial state of the decoder. The decoder predicts output step by step using the updated hidden states. ![[rnn-decoding.png|center|400]] **Output prediction:** At each step $t$ we decode the current hidden state $s_t$ (e.g. with [[Softmax]] and argmax) to obtain $x_t$. $ p_t = \text{softmax}(W^0 \, s_t) $ **Hidden state update:** At the next step the hidden state is updated based on the previous state and decoded value. $ s_t = \tanh(W^{s,s}* s_{t-1}+W^{s,x} *x_t) $ ## Gated RNN This variation has additionally the parameter matrices $W^{g,s}$ and $W^{g,x}$, which decide what piece of information will be retained or overwritten in the update process of $s_t$. Thus $g_t$ can be described as a “forget gate”. $ \overbrace{g_t}^{m \times 1} = \text{sigmoid}( \overbrace{W^{g,s}}^{m \times m}* \overbrace{s_{t-1}\vphantom{h}}^{m \times 1}+ \overbrace{W^{g,x}}^{m \times m}* \overbrace{x_{t}\vphantom{h}}^{m \times 1}) $ Since $g_t$ comes from a sigmoid function, it returns values between $[0,1]$. Since it is a $(m \times 1)$ vector, it can be used for elementwise weighting between each element of $s_{t-1}$ and the element via regular update from $\tanh(\cdot)$. $ s_t=(1-g_t) \odot s_{t-1}+g_t\odot \tanh(W^{s,s}*s_{t-1}+W^{s,x}*x_t) $ ## LSTM LSTMs improve on Gated RNNs by introducing additional gates and a memory cell $c_t$. These gates manage long-term and short-term dependencies. **Components:** | Gate type | Equation | | ----------- | ---------------------------------------------------- | | Forget gate | $f_t=\text{sigmoid}(W^{f,h}\,h_{t-1}+W^{f,x} \,x_t)$ | | Input gate | $i_t=\text{sigmoid}(W^{i,h}\,h_{t-1}+W^{i,x} \,x_t)$ | | Output gate | $o_t=\text{sigmoid}(W^{o,h}\,h_{t-1}+W^{o,x} \,x_t)$ | **Visible state update:** Takes $\tanh(c_t)$ from the memory cell $c_t$, combined with the output gate $o_t$. $ h_t=o_t \odot \tanh(c_t) $ **Memory cell update:** Similar to the gated RNN, as it weights its current state $c_{t-1}$ with a forget gate $f_t$ and the regular $\tanh$ update with an input gate $i_t$. $ c_t=f_t \odot c_{t-1}+ i_t \odot \tanh(W^{c,h} \, h_{t-1}+W^{c,x} \, x_t) $ ![[lstm.png|center|400]]