In [[Wald Test]], we have so far directly compared $\hat \theta$ with some $\theta_0$. In the Likelihood-Ratio test, we instead compare the [[Likelihood Functions|likelihood]] values that are retrieved from these $\theta$ values.
![[likelihood-ratio-test.png|center|400]]
Both, the Wald test and Likelihood-Ratio test are asymptotically equivalent, which means that for large sample sizes they should give similar results.
## Univariate Form
The most basic setup, is the assessment of the likelihood ratio between two simple [[Hypothesis Tests|hypothesis]].
$ \begin{cases} H_0: \theta = \theta_0 \\ H_1: \theta = \theta_1 \end{cases} $
We compute the likelihood of our data $(x_1, \dots, x_n)$ assuming $\theta_0$ or $\theta_1$, and then take the ratio of the two. The [[Statistical Test]] $\psi_C$ is an indicator variable that returns $1$ if the ratio is above a certain threshold value $C$.
$ \psi_C = \mathbf 1 \left( \frac{L_n(x_1, \dots, x_n; \theta_1)}{L_n(x_1, \dots, x_n; \theta_0)}>C \right) $
When $H_1$ is a [[Hypothesis Tests#Types of Hypotheses|composite hypothesis]], then the numerator of the test $\psi$ takes the maximum (actually supremum) likelihood, among all $\theta \in \Theta_1$. Since $H_0$ is only covering a single $\theta_0$, we can say that $\Theta_1$ covers almost all $\theta$, which makes it the maximum likelihood.
$ \psi_C = \mathbf 1 \left( \frac{\sup_{\theta \in \Theta_1} L_n(x_1, \dots, x_n; \theta)}{L_n(x_1, \dots, x_n; \theta_0)}>C \right) \implies
\mathbf 1 \left( \frac{ L_n(X; \theta^{\mathrm{MLE}})}{L_n(X; \theta_0)}>C \right) $
When we multiply this $\psi_C$ with $2\log$ we get a test statistic $\Lambda$ that follows a [[Chi-Square Distribution]] $\chi^2$ according to Wilks theorem. This allows us to set $C$ at the desired $\alpha$ percentile of such a distribution.
$ \Lambda = 2\log \left(\frac{ L_n(X; \theta^{\mathrm{MLE}})}{L_n(X; \theta_0)}\right) $
## Multivariate Form
We have an unknown parameter [[Vector Operations|Vector]] $\theta \in \mathbb R^d$, and we are deciding between two hypotheses. However, $H_0$ only makes statements about a subset of parameters $\theta \subset \Theta$. This subset does NOT include $(\theta_0, \dots, \theta_r)$.
$
\begin{cases}
H_0: (\theta_{r+1}, \dots,\theta_d)=(\theta_{r+1}^{(0)}, \dots,\theta_d^{(0)}) \\[4pt] H_1: (\theta_{r+1}, \dots,\theta_d) \not =(\theta_{r+1}^{(0)}, \dots,\theta_d^{(0)})
\end{cases}
$
Now we perform [[Maximum Likelihood Estimation]] two times:
| Symbol | Estimator | Description |
| ------------------------- | ---------------------------------------------- | ----------------------------------------------------------------------------------------- |
| $\theta^{\mathrm{MLE}}_n$ | $\arg \max_{\theta \in \Theta} \ell(\theta)$ | Unconstrained setup (regular MLE) |
| $\hat \theta^c_n$ | $\arg \max_{\theta \in \Theta_0} \ell(\theta)$ | Constrained setup, where $(\theta_{r+1}, \dots , \theta_d)$ are fixed according to $H_0$. |
The test statistic, where $\ell_n$ is the log-likelihood:
$ T_n=2\left(\ell_n(\hat \theta_n^{\mathrm{MLE}})- \ell_n(\hat \theta_n^c) \right) $
If $H_0$ is true, then both estimators should give the same likelihood, and therefore $T_n$ should be small. Thus, we reject $H_0$ when $T_n$ is greater than some threshold value $C$.
## Wilks Theorem
When $H_0$ is true, then the test statistic converges to a chi-square in distribution, with $(d-r)$ degrees of freedom.
$ T_n \xrightarrow[n \to \infty]{(d)}\chi^2_{d-r} $
- $d:$ Number of parameters that are free to vary under $H_1$ (i.e. total number of parameters)
- $r:$ Number of parameters that are free to vary under $H_0$
The difference $(d-r)$ is the number of restricted parameters (i.e. parameters explicitly set under $H_0$). These parameters need to be additionally estimated under $H_1$, compared to $H_0$. So the $\chi^2$ distribution reflects the statistical variability associated with estimating these additional parameters.