Kullback-Leibler Divergence - Bernhard Pfann, CFA

The Kullback-Leibler (”KL”) divergence is also known as “relative entropy”. It is a measure of how one probability distribution $\mathbf P_{\theta^\prime}$ diverges from another $\mathbf P_\theta$. Concretely, we are looking at, how $\mathbf P_{\theta^\prime}$ differs from $\mathbf P_\theta$ at each point where $\mathbf P_{\theta^\prime}$ assigns a positive probability. Compared to the [[Total Variation Distance]], the KL divergence is asymmetric in that sense. >[!note:] >For $\text{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)$ to be well-defined, $\mathbf P_{\theta^\prime}$ needs to be “absolutely continuous” w.r.t. $\mathbf P_\theta$. This means that $\mathbf P_{\theta^\prime}$ does not assign positive probability to any points, which $\mathbf P_\theta$ considers impossible. A violation of that would mean a division by $0$. **Discrete:** $ \mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=\sum_{x \in E}p_{\theta^\prime}(x) \log\Big(\frac{p_{\theta^\prime}(x)}{p_\theta(x)}\Big) $ **Continuous:** $ \mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=\int_{E}f_{\theta^\prime}(x) \log\Big(\frac{f_{\theta^\prime}(x)}{f_\theta(x)}\Big) \, dx $ where $E$ is the support from the distribution $\mathbf P_{\theta^\prime}$ and not from $\mathbf P_\theta$. **Properties:** | | Formulation | Comment | | ------------ | -------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | Asymmetric | $\mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta) \not=\mathrm{KL}(\mathbf P_\theta,\mathbf P_{\theta^\prime})$ | In general the order matters here. | | Non-negative | $\mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta) \ge 0$ | The fraction is always positive, and the logarithm of a non-negative number is as well. | | Definite | $\mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=0 \quad \text{then} \quad \mathbf P_{\theta^\prime} \equiv \mathbf P_\theta$ | This property is important as it makes $\theta^\star$ the unique minimizer of $\theta \mapsto \mathrm{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta)$. | Due to the asymmetry, the KL metric is not considered a distance, but a divergence. >[!note:] >Some people write $\text{KL}(\mathbf P_{\theta^\prime} \parallel \mathbf P_\theta)$ to highlight, that the divergence is asymmetric, which means that the order of the distributions matters. ## Estimation The KL-divergence can be written as an [[Expectation]]. This makes it useable for estimation, because we can then replace the expectations with sample averages by [[Law of Large Numbers|LLN]]. We see that the KL-divergence formula sums over all possible $x$ and weight each $\log(\cdot)$ by the PMF $p_{\theta^\prime}$. This is simply the expectation of $p_{\theta^\prime}$ over the whole support, but multiplied by this $\log$ term. $ \begin{align} \mathrm{KL}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)&=\sum_{x \in E}p_{\theta^\prime}(x) *\log\Big(\frac{p_{\theta^\prime}(x)}{p_\theta(x)}\Big) \tag{1}\\[14pt] &=\mathbb E_{\theta^\prime}\left [\log\left(\frac{p_{\theta^\prime}(X)}{p_\theta(X)}\right)\right] \tag{2}\\[12pt] &=\mathbb E_{\theta^\prime}\big[\log p_{\theta^\prime}(X) \big] - \mathbb E_{\theta^\prime}\big[\log p_{\theta}(X) \big]\tag{3}\\[10pt] &=c-\mathbb E_{\theta^\prime}\big[\log p_{\theta}(X) \big] \tag{4}\\[10pt] \end{align} $ where: - (2) The log of a fraction, is the log of the numerator minus the log of the denominator. Because of [[Linearity of Expectations]], we can write both terms as separate expectations. - (3) Since $\theta^\prime$ is a (unknown) constant, the expectation of it is also a constant. The constant affects the $\text{KL}$ metric itself, but not the spot where we will find the minimum. >[!note:] >While $X$ is a random variable its expectation can be computed under different assumed distributions (how to assign weights for each $x$). The notation $\mathbb E_{\theta^\prime}[X]$ means that this is the expectation assuming distribution $\mathbf P_{\theta^\prime}$ evaluating at all $x$ from the r.v. $X$. If all $X_i$ are i.i.d., we can use LLN, and can replace the expectation of a function $g(X)$, by the mean of the function over all $X_i$. $ \mathbb E\big[g(X)\big] \approx \frac{1}{n}\sum_{i=1}^ng(X_i) $ By using this approximation, we have to switch from an equality about $\text{KL}$ to the estimate $\widehat{\text{KL}}$. $ \widehat{\mathrm{KL}}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=c-\frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i) $ ## Maximum Likelihood Now that we have found our estimator $\widehat{\text{KL}}(\mathbf P_{\theta^\prime}, \mathbf P_{\theta})$, we are interested to find the $\theta$ that minimizes it. Note that, below we are only interested in the $\arg \min$ and not the $\min$ itself. Therefore we are writing equivalence statements, and not equalities. $ \begin{align} \min_{\theta \in \Theta}\widehat{\mathrm{KL}}(\mathbf P_{\theta^\star}, \mathbf P_\theta) &\leftrightarrow\min_{\theta \in \Theta}\left(c-\frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i)\right) \tag{1}\\[2pt] &\leftrightarrow \max_{\theta \in \Theta}\left(-\frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i)\right) \tag{2}\\[2pt] &\leftrightarrow\max_{\theta \in \Theta}\left(\log \prod_{i=1}^n p_\theta(X_i)\right) \tag{3}\\[2pt] &\leftrightarrow\max_{\theta \in \Theta}\left( \prod_{i=1}^n p_\theta(X_i)\right) \tag{4} \end{align} $ where: - (1) Minimizing a constant minus something that depends on $\theta$ is the same as maximizing the latter. - (2) Multiplying by a factor $1 \over n$ does not impact the argmax. - (3) The sum of the logs is the same as the log of the product. - (4) As the log function is monotonic, it does not affect the argmax either. We conclude that estimating $\text{KL}$ boils down to maximizing the product of [[Probability Mass Function|PMF's]] of all $X_i$. We call this expression the likelihood. Since we search for the $\theta$ that maximizes this expression, we call it the maximum likelihood estimation (”MLE”). - *Probability:* We fix the distribution $\mathbf P_\theta$ and calculate the probability for various observations $x_i$ to occur. - *Likelihood:* We fix the observations $\{X=x_1, \dots, X_n=X_n\}$ and calculate which $\mathbf P_\theta$ is most likely to have generated this fixed data. $ \begin{align} L(x_1, \ldots , x_ n, \theta ) &\mapsto \mathbf P(X_1=x_1, \dots, X_n=x_n) \tag{1}\\[6pt] &\mapsto \prod_{i=1}^n \mathbf P(X_1=x_1) \tag{2} \end{align} $ (2) Since all $X_i$ are [[Independence and Identical Distribution|i.i.d.]], we can write the joint probability as product of single probabilities.