The Kullback-Leibler (”KL”) divergence is also known as “relative entropy”. It is a measure of how one probability distribution $\mathbf P_{\theta^\prime}$ diverges from another $\mathbf P_\theta$. Concretely, we are looking at, how $\mathbf P_{\theta^\prime}$ differs from $\mathbf P_\theta$ at each point where $\mathbf P_{\theta^\prime}$ assigns a positive probability. Compared to the [[Total Variation Distance]], the KL divergence is asymmetric in that sense.
>[!note:]
>For $\text{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)$ to be well-defined, $\mathbf P_{\theta^\prime}$ needs to be “absolutely continuous” w.r.t. $\mathbf P_\theta$. This means that $\mathbf P_{\theta^\prime}$ does not assign positive probability to any points, which $\mathbf P_\theta$ considers impossible. A violation of that would mean a division by $0$.
**Discrete:**
$ \mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=\sum_{x \in E}p_{\theta^\prime}(x) \log\Big(\frac{p_{\theta^\prime}(x)}{p_\theta(x)}\Big) $
**Continuous:**
$ \mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=\int_{E}f_{\theta^\prime}(x) \log\Big(\frac{f_{\theta^\prime}(x)}{f_\theta(x)}\Big) \, dx $
where $E$ is the support from the distribution $\mathbf P_{\theta^\prime}$ and not from $\mathbf P_\theta$.
**Properties:**
| | Formulation | Comment |
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Asymmetric | $\mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta) \not=\mathrm{KL}(\mathbf P_\theta,\mathbf P_{\theta^\prime})$ | In general the order matters here. |
| Non-negative | $\mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta) \ge 0$ | The fraction is always positive, and the logarithm of a non-negative number is as well. |
| Definite | $\mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=0 \quad \text{then} \quad \mathbf P_{\theta^\prime} \equiv \mathbf P_\theta$ | This property is important as it makes $\theta^\star$ the unique minimizer of $\theta \mapsto \mathrm{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta)$. |
Due to the asymmetry, the KL metric is not considered a distance, but a divergence.
>[!note:]
>Some people write $\text{KL}(\mathbf P_{\theta^\prime} \parallel \mathbf P_\theta)$ to highlight, that the divergence is asymmetric, which means that the order of the distributions matters.
## Estimation
The KL-divergence can be written as an [[Expectation]]. This makes it useable for estimation, because we can then replace the expectations with sample averages by [[Law of Large Numbers|LLN]].
We see that the KL-divergence formula sums over all possible $x$ and weight each $\log(\cdot)$ by the PMF $p_{\theta^\prime}$. This is simply the expectation of $p_{\theta^\prime}$ over the whole support, but multiplied by this $\log$ term.
$
\begin{align}
\mathrm{KL}(\mathbf P_{\theta^\prime},
\mathbf P_\theta)&=\sum_{x \in E}p_{\theta^\prime}(x) *\log\Big(\frac{p_{\theta^\prime}(x)}{p_\theta(x)}\Big) \tag{1}\\[14pt]
&=\mathbb E_{\theta^\prime}\left [\log\left(\frac{p_{\theta^\prime}(X)}{p_\theta(X)}\right)\right] \tag{2}\\[12pt]
&=\mathbb E_{\theta^\prime}\big[\log p_{\theta^\prime}(X) \big] - \mathbb E_{\theta^\prime}\big[\log p_{\theta}(X) \big]\tag{3}\\[10pt]
&=c-\mathbb E_{\theta^\prime}\big[\log p_{\theta}(X) \big] \tag{4}\\[10pt]
\end{align}
$
where:
- (2) The log of a fraction, is the log of the numerator minus the log of the denominator. Because of [[Linearity of Expectations]], we can write both terms as separate expectations.
- (3) Since $\theta^\prime$ is a (unknown) constant, the expectation of it is also a constant. The constant affects the $\text{KL}$ metric itself, but not the spot where we will find the minimum.
>[!note:]
>While $X$ is a random variable its expectation can be computed under different assumed distributions (how to assign weights for each $x$). The notation $\mathbb E_{\theta^\prime}[X]$ means that this is the expectation assuming distribution $\mathbf P_{\theta^\prime}$ evaluating at all $x$ from the r.v. $X$.
If all $X_i$ are i.i.d., we can use LLN, and can replace the expectation of a function $g(X)$, by the mean of the function over all $X_i$.
$ \mathbb E\big[g(X)\big] \approx \frac{1}{n}\sum_{i=1}^ng(X_i) $
By using this approximation, we have to switch from an equality about $\text{KL}$ to the estimate $\widehat{\text{KL}}$.
$
\widehat{\mathrm{KL}}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=c-\frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i) $
## Maximum Likelihood
Now that we have found our estimator $\widehat{\text{KL}}(\mathbf P_{\theta^\prime}, \mathbf P_{\theta})$, we are interested to find the $\theta$ that minimizes it. Note that, below we are only interested in the $\arg \min$ and not the $\min$ itself. Therefore we are writing equivalence statements, and not equalities.
$
\begin{align}
\min_{\theta \in \Theta}\widehat{\mathrm{KL}}(\mathbf P_{\theta^\star}, \mathbf P_\theta) &\leftrightarrow\min_{\theta \in \Theta}\left(c-\frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i)\right) \tag{1}\\[2pt]
&\leftrightarrow \max_{\theta \in \Theta}\left(-\frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i)\right) \tag{2}\\[2pt]
&\leftrightarrow\max_{\theta \in \Theta}\left(\log \prod_{i=1}^n p_\theta(X_i)\right) \tag{3}\\[2pt]
&\leftrightarrow\max_{\theta \in \Theta}\left( \prod_{i=1}^n p_\theta(X_i)\right) \tag{4}
\end{align}
$
where:
- (1) Minimizing a constant minus something that depends on $\theta$ is the same as maximizing the latter.
- (2) Multiplying by a factor $1 \over n$ does not impact the argmax.
- (3) The sum of the logs is the same as the log of the product.
- (4) As the log function is monotonic, it does not affect the argmax either.
We conclude that estimating $\text{KL}$ boils down to maximizing the product of [[Probability Mass Function|PMF's]] of all $X_i$. We call this expression the likelihood. Since we search for the $\theta$ that maximizes this expression, we call it the maximum likelihood estimation (”MLE”).
- *Probability:* We fix the distribution $\mathbf P_\theta$ and calculate the probability for various observations $x_i$ to occur.
- *Likelihood:* We fix the observations $\{X=x_1, \dots, X_n=X_n\}$ and calculate which $\mathbf P_\theta$ is most likely to have generated this fixed data.
$
\begin{align}
L(x_1, \ldots , x_ n, \theta ) &\mapsto \mathbf P(X_1=x_1, \dots, X_n=x_n) \tag{1}\\[6pt] &\mapsto \prod_{i=1}^n \mathbf P(X_1=x_1) \tag{2}
\end{align}
$
(2) Since all $X_i$ are [[Independence and Identical Distribution|i.i.d.]], we can write the joint probability as product of single probabilities.