## Problem The [[Chi-Square Test]] enables [[Hypothesis Tests|hypothesis testing]] for discrete distributions. For continuous data, we could discretize observations into bins and perform the same test. However, this approach introduces ambiguity: - How many bins should we use? - Should bins be equally spaced by counts or by intervals? To avoid these issues, we use the Kolmogorov-Smirnov ("KS") test, which works directly on continuous data. ## Hypotheses The KS test evaluates whether the true [[Cumulative Density Function|CDF]] $F$ matches a hypothesized CDF $F^{(0)}$, which is fully specified. The hypotheses are: $ \begin{cases} H_0:F=F^{(0)}\\ H_1: F \not = F^{(0)} \end{cases} $ Using the [[Empirical Cumulative Density Function|Empirical CDF]] $F_n(t)$, we approximate the true $F(t)$. The [[Empirical Cumulative Density Function#^d0ca51|Glivenko-Cantelli Theorem]] shows the uniform convergence of the two (given [[Independence and Identical Distribution|i.i.d.]] observations), allowing this step. $ F(t) = F^{(0)}(t) \approx F_n(t) $ The test statistic $T_n$ measures the maximum difference between $F_n(t)$ and $F^{(0)}(t)$, scaled by $\sqrt n$. This converges to the supremum of a Brownian bridge $\mathbb B(x)$. $ T_n = \sup_{t \in \mathbb R} \sqrt n \, \Big\lvert F_n(t)-F^{(0)}(t)\Big\rvert \xrightarrow[n \to \infty]{(d)}\sup_{0 \le x \le 1} \Big \lvert \mathbb B(x) \Big \rvert $ ## Computational Considerations Directly computing the supremum over all $t \in \mathbb{R}$ is infeasible. However, the supremum can anyways only be achieved at $t$, where there is an actual data point $X_i$ (i.e. the empirical CDF only changes at these jump points). ![[kolmogorov-smirnov-test.png|center|400]] **Steps:** 1. Order the observations $X_1, \dots , X_n$ into $X_{(1)} \le \cdots \le X_{(n)}$. 2. Compute $F_n(t)$ at each $t$ where we have $X_{(i)}$. 3. Record the absolute difference between $F_n$ and the hypothesized $F^{(0)}$. Notice that we have to check the difference before and after the jump. $ T_n = \sqrt n \max_{i=1,\dots,n} \Bigg \{\max\Big( \underbrace{\big \lvert F_n(i)-F^{(0)}(X_{(i)})\big \rvert}_{\text{after jump}}, \underbrace{\big \lvert F_n(i-1)-F^{(0)}(X_{(i)})\big \rvert}_{\text{before jump}}\Big) \Bigg\} $ Critical values for $T_n$​ can be obtained from the asymptotic distribution of $\mathbb{B}(x)$, a pivotal distribution. ## Non-Asymptotic Distribution The KS test statistic $T_n$ has the same distribution, regardless of the shape of our hypothesized $F^{(0)}$. This invariance comes from the [[Probability Integral Transform]], which maps any CDF into the [[Continuous Uniform Distribution]] $\mathcal{U}(0,1)$. Let $x = F^{(0)}(t)$ and the result inverse be $F^{(0)}(x)^{-1}=t$. $ \begin{align} T_n &= \sqrt n \sup_{t \in \mathbb R} \Big\lvert F_n(t)-F^{(0)}(t)\Big\rvert \tag{1}\\[8pt] &=\sqrt n \sup_{x \in [0,1]} \, \Big \lvert F_n\big(F^{(0)}(x)^{-1}\big)- x \Big \rvert \tag{2}\\[4pt] &=\sqrt n \sup_{x \in [0,1]} \, \Big \lvert \frac{1}{n}\sum_{i=1}^n \mathbf 1 \big\{ X_i \le F^{(0)}(x)^{-1}\big\}- x \Big\rvert \tag{3}\\[4pt] &=\sqrt n \sup_{x \in [0,1]} \, \Big \lvert \frac{1}{n}\sum_{i=1}^n \mathbf 1 \big\{\underbrace{F^{(0)}(X_i)}_{\mathcal U(0,1)} \le x\big\}- x \Big\rvert \tag{4}\\[4pt] &=\sqrt n \sup_{x \in [0,1]} \, \Big \lvert G_n(x)- x \Big\rvert \tag{5} \end{align} $ where: - (2) Substitute $t$ and $F(t)$. - (3) Write out the [[Empirical Cumulative Density Function#^5957fc|definition]] of the empirical CDF $F_n(t)$. - (4) Acknowledge that $F^{(0)}_X(X_i)$ follows a standard uniform $\mathcal U(0,1)$ by probability integral theorem. - (5) We can summarize the sample average of $n$ i.i.d. standard uniform distributions by $G_n(x)$. **Conclusion:** 1. We can conclude that $T_n$ is the same not matter what we choose for $F^{(0)}$. 2. $T_n$ is only affected by $n$ through the scaling term $\sqrt n$ at the beginning. However, since the critical values are unscaled of this, we can use the same probability table, to conduct non-asymptotic tests. 3. The KS test-statistic satisfies all criteria for a distance.