Multinomial Model - Bernhard Pfann, CFA

**Discriminative Models:** The model focus on distinguishing between different classes for classification. They directly model the conditional probability $\mathbf{P}(Y \vert X)$ or a decision boundary, without considering how the data is generated. **Generative Models:** The model aims to model the probabilistic data generation process. It learns the parameters $\theta$ of the distribution $\mathbf{P}_\theta$ that describes how the observed data is sampled. Once we have identified $\mathbf{P}_\theta$, we can: - Assign [[Likelihood Functions|likelihoods]] to new observations, of belonging to one or another distribution. - Generate artificial new observations. **Maximum Likelihood Estimation:** [[Maximum Likelihood Estimation|MLE]] is a method for finding the parameters $\theta$ that make the observed data most probable. In binary classification, this involves estimating parameters for two classes, ${\theta^+, \theta^-}$. MLE involves two main steps: 1. *Estimation:* Find the optimal parameters ${\theta^+, \theta^-}$ that maximize the likelihood of the observed data. 2. *Prediction:* Use the estimated parameters to decide which class a new observation is more likely to belong to. ## Example: Multinomial Distribution Assume we have a [[Multinomial Distribution]], with multiple discrete and mutually exclusive categories (i.e. classes) that can occur. Consider classifying documents $D$ as having a positive or negative sentiment. 1. Each $D_i$ consists of a sequence of word-tokens $w_i$, where tokens are independent of each other. 2. The entirety of tokens is drawn from a fixed vocabulary set $W$ (multinomial). 3. Now we want to say if a document $D_i$ has been generated from $\mathbf P_{\theta^+}$ (positive sentiment documents) or from $\mathbf P_{\theta^-}$ (negative sentiment documents). Both distributions are parameterized by a vector of length $W$. **Estimation:** The log-likelihood of observing a document $D$ given a positive sentiment is: $ \begin{align} \log \mathbf P(D|+) &= \log \prod_{i=1}^n \theta_{w_i^+} \\[4pt] &= \log \prod_{w\in W} (\theta_{w^+})^{\text{count}(w)} \\[2pt] &=\sum_{w\in W} \text{count}(w) \log \theta_w^+ \end{align} $ To estimate each $\theta_w^+$ with MLE, we take the derivative of the log-likelihood for each $w$ separately. The estimate looks as follows: $ \hat \theta_w= \frac{\text{count}(w)}{\displaystyle \sum_{w^\prime \in W}\text{count}(w^\prime)} $ **Prediction:** To classify a new document, we compute the posterior probability for each class and select the class with the higher posterior. Using [[Bayes Rule]], the decision criterion becomes: $ \log\frac{\overbrace{\mathbf P(D|+)*\mathbf P(+)}^{\text{Posterior}}}{\mathbf P(D|-)*\mathbf P(-)} \begin{cases} \ge 0,& +\\ < 0,& - \end{cases} \\[10pt] $ Splitting into likelihood and prior terms: $ \underbrace{\log\frac{\mathbf P(D|+)}{\mathbf P(D|-)}}_{\text{Likelihoods}} - \underbrace{\log\frac{\mathbf P(+)}{\mathbf P(-)}}_{\text{Priors}}= \begin{cases} \ge 0,& +\\ < 0,& - \end{cases} $ The computation of the likelihood boils down to the following: $ \begin{align} \log\frac{\mathbf P(D|+)}{\mathbf P(D|-)} &= \sum_{w\in W} \text{count}(w) \log \theta_w^+ -\sum_{w\in W} \text{count}(w) \log \theta^-\\[14pt] \log\frac{\mathbf P(D|+)}{\mathbf P(D|-)}&=\sum_{w\in W} \text{count}(w)* \underbrace{\Big(\log \frac{\theta_w^+}{\theta_w^-}\Big)}_{\hat\theta} \end{align} $