A Markov language model is a probabilistic model that generates sequences of e.g. words based on the [[Markov Chain|Markov assumption]] of memorylessness. This means, it assumes that the probability of each element in the sequence depends only on a fixed number of previous elements.
**Bigram model:**
It is the simplest variation, which looks at the previous word $w_{i-1}$ to estimate the probability of the next word $w_i$. It is thus a conditional probability $\mathbf P(w_i \vert w_{i-1})$. We want to obtain a table of all possible word combinations with their corresponding conditional probabilities.
![[markov-lanuage-model.png|center|450]]
## Maximum Likelihood Estimation
Assume we have a corpus $S$ consisting of sentences $s$, where each sentence can have a different length $\lvert s \rvert$.
$ s \in S, \quad s= \left \{ w_1^{(s)},w_2^{(s)}, \dots ,w_{\lvert s \rvert}^{(s)} \right \} $
We want to set the probabilities in a way that makes a recreation of all $s$ most likely ([[Maximum Likelihood Estimation|MLE]]). Therefore we iterate over all words in a sentence $s$ and then over all $s \in S$ of the corpus. We take the $\log$ because the products will turn to sums when we differentiate a $\log$ function.
$ \max \left [ \log \Big(\prod_{s \in S} \prod_{i=1}^s \mathbf P (w_i^{(s)} \vert w_{i-1}^{(s)}\Big) \right ] $
**Process:**
- We collect a corpus of sentences.
- We calculate the frequency of each bigram (e.g. how often is $w_{i-1}$ directly followed by $w_i$).
- We normalize the frequencies to obtain probabilities.
$ \hat {\mathbf P}(w^\prime \lvert w) = \frac{\text{count}(w, w^\prime)}{\sum_{v}\text{count}(w, v)} $
The $v$ represents the collection of all possible words (vocabulary).
## Feedforward Network Equivalent
We can create the same bigram model with a [[Feed Forward Neural Network]], where the input layer is a zero [[Vector Operations|Vector]] of the length of all possible words, with a single $1$ at the index of $w_{i-1}$.
![[markov-language-model-2.png|center|450]]
Here in both cases the number of parameters is $\lvert v \rvert^2$. However, when we extend to a trigram model, and therefore take in the last 2 words as input, we get the following parameters:
- *Markov model:* The table of probabilities has $\lvert v \rvert^2$ rows and $\lvert v \rvert$ columns. This leads to $\lvert v \rvert^3$ parameters.
$ O\big(\lvert v \rvert^2 \times \lvert v \rvert \big) $
- *FFN:* The input layer is 2 times the one-hot encoded vector, while the output layer is of size $\lvert v \rvert$. Given there is no hidden layer, we just need to connect all input with all output nodes. Thus we get $2\lvert v \rvert^2$ parameters. Additionally we have $\lvert v \rvert$ biases for each output layer.
$ O(2\lvert v \rvert ^2 + \lvert v \rvert) $
![[markov-language-model-3.png|center|350]]