## Problem
The [[Chi-Square Test]] enables [[Hypothesis Tests|hypothesis testing]] for discrete distributions. For continuous data, we could discretize observations into bins and perform the same test. However, this approach introduces ambiguity:
- How many bins should we use?
- Should bins be equally spaced by counts or by intervals?
To avoid these issues, we use the Kolmogorov-Smirnov ("KS") test, which works directly on continuous data.
## Hypotheses
The KS test evaluates whether the true [[Cumulative Density Function|CDF]] $F$ matches a hypothesized CDF $F^{(0)}$, which is fully specified. The hypotheses are:
$ \begin{cases} H_0:F=F^{(0)}\\ H_1: F \not = F^{(0)} \end{cases} $
Using the [[Empirical Cumulative Density Function|Empirical CDF]] $F_n(t)$, we approximate the true $F(t)$. The [[Empirical Cumulative Density Function#^d0ca51|Glivenko-Cantelli Theorem]] shows the uniform convergence of the two (given [[Independence and Identical Distribution|i.i.d.]] observations), allowing this step.
$ F(t) = F^{(0)}(t) \approx F_n(t) $
The test statistic $T_n$ measures the maximum difference between $F_n(t)$ and $F^{(0)}(t)$, scaled by $\sqrt n$. This converges to the supremum of a Brownian bridge $\mathbb B(x)$.
$
T_n = \sup_{t \in \mathbb R} \sqrt n \, \Big\lvert F_n(t)-F^{(0)}(t)\Big\rvert \xrightarrow[n \to \infty]{(d)}\sup_{0 \le x \le 1} \Big \lvert \mathbb B(x) \Big \rvert
$
## Computational Considerations
Directly computing the supremum over all $t \in \mathbb{R}$ is infeasible. However, the supremum can anyways only be achieved at $t$, where there is an actual data point $X_i$ (i.e. the empirical CDF only changes at these jump points).
![[kolmogorov-smirnov-test.png|center|400]]
**Steps:**
1. Order the observations $X_1, \dots , X_n$ into $X_{(1)} \le \cdots \le X_{(n)}$.
2. Compute $F_n(t)$ at each $t$ where we have $X_{(i)}$.
3. Record the absolute difference between $F_n$ and the hypothesized $F^{(0)}$. Notice that we have to check the difference before and after the jump.
$
T_n = \sqrt n \max_{i=1,\dots,n} \Bigg \{\max\Big(
\underbrace{\big \lvert F_n(i)-F^{(0)}(X_{(i)})\big \rvert}_{\text{after jump}},
\underbrace{\big \lvert F_n(i-1)-F^{(0)}(X_{(i)})\big \rvert}_{\text{before jump}}\Big) \Bigg\}
$
Critical values for $T_n$ can be obtained from the asymptotic distribution of $\mathbb{B}(x)$, a pivotal distribution.
## Non-Asymptotic Distribution
The KS test statistic $T_n$ has the same distribution, regardless of the shape of our hypothesized $F^{(0)}$. This invariance comes from the [[Probability Integral Transform]], which maps any CDF into the [[Continuous Uniform Distribution]] $\mathcal{U}(0,1)$.
Let $x = F^{(0)}(t)$ and the result inverse be $F^{(0)}(x)^{-1}=t$.
$
\begin{align}
T_n &= \sqrt n \sup_{t \in \mathbb R} \Big\lvert F_n(t)-F^{(0)}(t)\Big\rvert \tag{1}\\[8pt]
&=\sqrt n \sup_{x \in [0,1]} \, \Big \lvert F_n\big(F^{(0)}(x)^{-1}\big)- x \Big \rvert \tag{2}\\[4pt]
&=\sqrt n \sup_{x \in [0,1]} \, \Big \lvert \frac{1}{n}\sum_{i=1}^n \mathbf 1 \big\{ X_i \le F^{(0)}(x)^{-1}\big\}- x \Big\rvert \tag{3}\\[4pt]
&=\sqrt n \sup_{x \in [0,1]} \, \Big \lvert \frac{1}{n}\sum_{i=1}^n \mathbf 1 \big\{\underbrace{F^{(0)}(X_i)}_{\mathcal U(0,1)} \le x\big\}- x \Big\rvert \tag{4}\\[4pt]
&=\sqrt n \sup_{x \in [0,1]} \, \Big \lvert G_n(x)- x \Big\rvert \tag{5}
\end{align}
$
where:
- (2) Substitute $t$ and $F(t)$.
- (3) Write out the [[Empirical Cumulative Density Function#^5957fc|definition]] of the empirical CDF $F_n(t)$.
- (4) Acknowledge that $F^{(0)}_X(X_i)$ follows a standard uniform $\mathcal U(0,1)$ by probability integral theorem.
- (5) We can summarize the sample average of $n$ i.i.d. standard uniform distributions by $G_n(x)$.
**Conclusion:**
1. We can conclude that $T_n$ is the same not matter what we choose for $F^{(0)}$.
2. $T_n$ is only affected by $n$ through the scaling term $\sqrt n$ at the beginning. However, since the critical values are unscaled of this, we can use the same probability table, to conduct non-asymptotic tests.
3. The KS test-statistic satisfies all criteria for a distance.