**Discriminative Models:**
The model focus on distinguishing between different classes for classification. They directly model the conditional probability $\mathbf{P}(Y \vert X)$ or a decision boundary, without considering how the data is generated.
**Generative Models:**
The model aims to model the probabilistic data generation process. It learns the parameters $\theta$ of the distribution $\mathbf{P}_\theta$ that describes how the observed data is sampled.
Once we have identified $\mathbf{P}_\theta$, we can:
- Assign [[Likelihood Functions|likelihoods]] to new observations, of belonging to one or another distribution.
- Generate artificial new observations.
**Maximum Likelihood Estimation:**
[[Maximum Likelihood Estimation|MLE]] is a method for finding the parameters $\theta$ that make the observed data most probable. In binary classification, this involves estimating parameters for two classes, ${\theta^+, \theta^-}$.
MLE involves two main steps:
1. *Estimation:* Find the optimal parameters ${\theta^+, \theta^-}$ that maximize the likelihood of the observed data.
2. *Prediction:* Use the estimated parameters to decide which class a new observation is more likely to belong to.
## Example: Multinomial Distribution
Assume we have a [[Multinomial Distribution]], with multiple discrete and mutually exclusive categories (i.e. classes) that can occur. Consider classifying documents $D$ as having a positive or negative sentiment.
1. Each $D_i$ consists of a sequence of word-tokens $w_i$, where tokens are independent of each other.
2. The entirety of tokens is drawn from a fixed vocabulary set $W$ (multinomial).
3. Now we want to say if a document $D_i$ has been generated from $\mathbf P_{\theta^+}$ (positive sentiment documents) or from $\mathbf P_{\theta^-}$ (negative sentiment documents). Both distributions are parameterized by a vector of length $W$.
**Estimation:**
The log-likelihood of observing a document $D$ given a positive sentiment is:
$
\begin{align}
\log \mathbf P(D|+) &= \log \prod_{i=1}^n \theta_{w_i^+} \\[4pt]
&= \log \prod_{w\in W} (\theta_{w^+})^{\text{count}(w)} \\[2pt]
&=\sum_{w\in W} \text{count}(w) \log \theta_w^+
\end{align}
$
To estimate each $\theta_w^+$ with MLE, we take the derivative of the log-likelihood for each $w$ separately. The estimate looks as follows:
$ \hat \theta_w= \frac{\text{count}(w)}{\displaystyle \sum_{w^\prime \in W}\text{count}(w^\prime)} $
**Prediction:**
To classify a new document, we compute the posterior probability for each class and select the class with the higher posterior. Using [[Bayes Rule]], the decision criterion becomes:
$
\log\frac{\overbrace{\mathbf P(D|+)*\mathbf P(+)}^{\text{Posterior}}}{\mathbf P(D|-)*\mathbf P(-)}
\begin{cases} \ge 0,& +\\ < 0,& - \end{cases} \\[10pt]
$
Splitting into likelihood and prior terms:
$
\underbrace{\log\frac{\mathbf P(D|+)}{\mathbf P(D|-)}}_{\text{Likelihoods}} - \underbrace{\log\frac{\mathbf P(+)}{\mathbf P(-)}}_{\text{Priors}}= \begin{cases} \ge 0,& +\\ < 0,& - \end{cases}
$
The computation of the likelihood boils down to the following:
$
\begin{align}
\log\frac{\mathbf P(D|+)}{\mathbf P(D|-)} &= \sum_{w\in W} \text{count}(w) \log \theta_w^+ -\sum_{w\in W} \text{count}(w) \log \theta^-\\[14pt]
\log\frac{\mathbf P(D|+)}{\mathbf P(D|-)}&=\sum_{w\in W} \text{count}(w)* \underbrace{\Big(\log \frac{\theta_w^+}{\theta_w^-}\Big)}_{\hat\theta}
\end{align}
$