Markov Language Models - Bernhard Pfann, CFA

A Markov language model is a probabilistic model that generates sequences of e.g. words based on the [[Markov Chain|Markov assumption]] of memorylessness. This means, it assumes that the probability of each element in the sequence depends only on a fixed number of previous elements. **Bigram model:** It is the simplest variation, which looks at the previous word $w_{i-1}$ to estimate the probability of the next word $w_i$. It is thus a conditional probability $\mathbf P(w_i \vert w_{i-1})$. We want to obtain a table of all possible word combinations with their corresponding conditional probabilities. ![[markov-lanuage-model.png|center|450]] ## Maximum Likelihood Estimation Assume we have a corpus $S$ consisting of sentences $s$, where each sentence can have a different length $\lvert s \rvert$. $ s \in S, \quad s= \left \{ w_1^{(s)},w_2^{(s)}, \dots ,w_{\lvert s \rvert}^{(s)} \right \} $ We want to set the probabilities in a way that makes a recreation of all $s$ most likely ([[Maximum Likelihood Estimation|MLE]]). Therefore we iterate over all words in a sentence $s$ and then over all $s \in S$ of the corpus. We take the $\log$ because the products will turn to sums when we differentiate a $\log$ function. $ \max \left [ \log \Big(\prod_{s \in S} \prod_{i=1}^s \mathbf P (w_i^{(s)} \vert w_{i-1}^{(s)}\Big) \right ] $ **Process:** - We collect a corpus of sentences. - We calculate the frequency of each bigram (e.g. how often is $w_{i-1}$ directly followed by $w_i$). - We normalize the frequencies to obtain probabilities. $ \hat {\mathbf P}(w^\prime \lvert w) = \frac{\text{count}(w, w^\prime)}{\sum_{v}\text{count}(w, v)} $ The $v$ represents the collection of all possible words (vocabulary). ## Feedforward Network Equivalent We can create the same bigram model with a [[Feed Forward Neural Network]], where the input layer is a zero [[Vector Operations|Vector]] of the length of all possible words, with a single $1$ at the index of $w_{i-1}$. ![[markov-language-model-2.png|center|450]] Here in both cases the number of parameters is $\lvert v \rvert^2$. However, when we extend to a trigram model, and therefore take in the last 2 words as input, we get the following parameters: - *Markov model:* The table of probabilities has $\lvert v \rvert^2$ rows and $\lvert v \rvert$ columns. This leads to $\lvert v \rvert^3$ parameters. $ O\big(\lvert v \rvert^2 \times \lvert v \rvert \big) $ - *FFN:* The input layer is 2 times the one-hot encoded vector, while the output layer is of size $\lvert v \rvert$. Given there is no hidden layer, we just need to connect all input with all output nodes. Thus we get $2\lvert v \rvert^2$ parameters. Additionally we have $\lvert v \rvert$ biases for each output layer. $ O(2\lvert v \rvert ^2 + \lvert v \rvert) $ ![[markov-language-model-3.png|center|350]]