## Comparing Distributions Our goal in most statistical tasks is to estimate a probability distribution, which can be described by parameters $\theta$ (e.g. $\mu, \sigma$ for [[Gaussian Distribution]]). While we can compare estimated parameters $\hat \theta$ with true parameters $\theta^\star$, it is better to directly compare the resulting distributions $\mathbf P_{\hat \theta}$ and $\mathbf P_{\theta^\star}$ with each other. This is because there can be distortions, where very close parameters can yield differently shaped distributions. Assume we have a [[Statistical Model]], with [[Sample Space]] $E$. Samples are [[Independence and Identical Distribution|i.i.d.]] [[Random Variable|r.v's.]] $X_1,\dots,X_n$, that are drawn from the family of distributions $\mathbf P_\theta$ with the true parameter $\theta^\star \in \Theta$. $ \big(E, (\mathbf P_\theta)_{\theta \in \Theta}\big) $ We want to find an estimator $\hat \theta$, such that the estimated distribution $\mathbf P_{\hat \theta}$ is close to the true $\mathbf P_{\theta^\star}$. ## Definition We want a metric that describes the difference between any two distributions $\mathbf P_\theta$ and $\mathbf P_{\theta^\prime}$. The Total Variation Distance ("TV") returns the *difference in probability for the same event* $A$ to occur in each of the two distributions. We select event $A$, such that it is the event that maximizes this difference among all possible events from the sample space $E$. $ \mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) = \max_{A \subset E} \Big \lvert \mathbf P_\theta(A)- \mathbf P_{\theta^\prime}(A)\Big \rvert $ Mathematically this absolute maximum is equivalent to a summation/integration of absolute differences over all events. However, we need to correct by a factor of $1 \over 2$ as we add up positive and negative differences. **Discrete:** $ \mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) = \frac{1}{2} \sum_{x \in E} \Big \lvert p_\theta(x)-p_{\theta^\prime}(x)\Big \rvert $ **Continuous:** $ \mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) = \frac{1}{2} \int_{x \in E} \Big \lvert p_\theta(x)-p_{\theta^\prime}(x) \Big \rvert \, dx $ **Properties:** The following properties make TV a distance according to the mathematical definition. | | Formulation | Comment | | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | | Symmetric | $\mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime})=\mathrm{TV}(\mathbf P_{\theta^\prime},\mathbf P_\theta)$ | The order does not matter. | | Non-negative | $\begin{align}\mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) \ge 0 \\ \mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) \le 1 \end{align}$ | The maximum difference between two distributions at any point can never be bigger than 1. | | Definite | $\mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime})=0 \quad \text{then} \quad \mathbf P_\theta \equiv \mathbf P_{\theta^\prime}$ | When TV is zero, it is ensured that the two distributions are exactly the same. | **Disjoint support:** When two distributions have disjoint support (x-axis does not overlap), $\text{TV}$ does not distinguish by how much the support is actually different. So below an example of two almost identical but disjoint distributions. $ X = \begin{cases} 0 & \text{w.p.}&0.5 \\ 1 & \text{w.p.}&0.5 \\ \end{cases} \quad Y = \begin{cases} 0.01 & \text{w.p.}&0.5 \\ 1.01 & \text{w.p.}&0.5 \\ \end{cases} \implies \quad \text{TV}(X,Y)=1 $