## Comparing Distributions
Our goal in most statistical tasks is to estimate a probability distribution, which can be described by parameters $\theta$ (e.g. $\mu, \sigma$ for [[Gaussian Distribution]]).
While we can compare estimated parameters $\hat \theta$ with true parameters $\theta^\star$, it is better to directly compare the resulting distributions $\mathbf P_{\hat \theta}$ and $\mathbf P_{\theta^\star}$ with each other. This is because there can be distortions, where very close parameters can yield differently shaped distributions.
Assume we have a [[Statistical Model]], with [[Sample Space]] $E$. Samples are [[Independence and Identical Distribution|i.i.d.]] [[Random Variable|r.v's.]] $X_1,\dots,X_n$, that are drawn from the family of distributions $\mathbf P_\theta$ with the true parameter $\theta^\star \in \Theta$.
$ \big(E, (\mathbf P_\theta)_{\theta \in \Theta}\big) $
We want to find an estimator $\hat \theta$, such that the estimated distribution $\mathbf P_{\hat \theta}$ is close to the true $\mathbf P_{\theta^\star}$.
## Definition
We want a metric that describes the difference between any two distributions $\mathbf P_\theta$ and $\mathbf P_{\theta^\prime}$. The Total Variation Distance ("TV") returns the *difference in probability for the same event* $A$ to occur in each of the two distributions. We select event $A$, such that it is the event that maximizes this difference among all possible events from the sample space $E$.
$ \mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) = \max_{A \subset E} \Big \lvert \mathbf P_\theta(A)- \mathbf P_{\theta^\prime}(A)\Big \rvert $
Mathematically this absolute maximum is equivalent to a summation/integration of absolute differences over all events. However, we need to correct by a factor of $1 \over 2$ as we add up positive and negative differences.
**Discrete:**
$ \mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) = \frac{1}{2} \sum_{x \in E} \Big \lvert p_\theta(x)-p_{\theta^\prime}(x)\Big \rvert $
**Continuous:**
$ \mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) = \frac{1}{2} \int_{x \in E} \Big \lvert p_\theta(x)-p_{\theta^\prime}(x) \Big \rvert \, dx $
**Properties:** The following properties make TV a distance according to the mathematical definition.
| | Formulation | Comment |
| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| Symmetric | $\mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime})=\mathrm{TV}(\mathbf P_{\theta^\prime},\mathbf P_\theta)$ | The order does not matter. |
| Non-negative | $\begin{align}\mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) \ge 0 \\ \mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime}) \le 1 \end{align}$ | The maximum difference between two distributions at any point can never be bigger than 1. |
| Definite | $\mathrm{TV}(\mathbf P_\theta, \mathbf P_{\theta^\prime})=0 \quad \text{then} \quad \mathbf P_\theta \equiv \mathbf P_{\theta^\prime}$ | When TV is zero, it is ensured that the two distributions are exactly the same. |
**Disjoint support:** When two distributions have disjoint support (x-axis does not overlap), $\text{TV}$ does not distinguish by how much the support is actually different. So below an example of two almost identical but disjoint distributions.
$ X = \begin{cases} 0 & \text{w.p.}&0.5 \\ 1 & \text{w.p.}&0.5 \\ \end{cases} \quad Y = \begin{cases} 0.01 & \text{w.p.}&0.5 \\ 1.01 & \text{w.p.}&0.5 \\ \end{cases} \implies \quad \text{TV}(X,Y)=1 $