## Estimator
In order to find the minimum [[Kullback-Leibler Divergence|KL Divergence]], we need to maximize the respective [[Likelihood Functions]]. We are interested in the set of parameters $\theta$ that leads to this maximum (”argmax”).
**Maximum Likelihood Estimator:**
$ \hat \theta_n^{MLE} = \arg \max_{\theta \in \Theta} L(X_1, \dots, X_n; \theta) $
>[!note:]
>The expression $\arg \max(f(\theta))$ looks for maximum of a function, but returns the functions’ parameters that yield that maximum value. It does not return the actual function value.
**Log-Likelihood Estimator:**
$ \hat \theta_n^{MLE} = \arg \max_{\theta \in \Theta} \log \big( L(X_1, \dots, X_n; \theta)\big) $
**Log Equivalence:**
We can add the logarithmic function and the $\arg \max$ will not change. This is because the logarithm is a [[Monotonicity|monotonically increasing]] function, so it does not flip the order of which parameter leads to the maximum.
$ \arg \max f(\theta) \leftrightarrow \arg \max g(f(\theta)) $
- Maximizing a function is the same thing as maximizing an increasing function of the function, at least in terms of $\arg \max$.
- So we could have picked any other increasing function (e.g. exponential likelihood). However the $\log$ function will simplify the differentiation steps when we maximize and bring down exponents.
## Maximization
Maximizing a function is the same as minimizing the negative of that function.
$ \min_{\theta \in \Theta}-h(\theta) \leftrightarrow \max_{\theta \in \Theta}h(\theta) $
We find the minimum (maximum) where the first derivative $h^\prime(\theta)$ is $0$, which corresponds to the valley (peak) of the function, that has zero slope (i.e. point of inflection).
For [[Identify Convex and Concave Functions|convex or concave]] functions we will always have a unique solution.
$ h^\prime(\theta)=0 $
This is also true for the multivariate case. However, here $\theta$ is a vector of $d$ parameters. The [[Gradient Descent#Gradient Vector|Gradient Vector]] $\nabla(\theta)$ denotes a vector of partial derivatives for each of these parameters. The maximum/minimum is found where $\nabla (\theta)$ equals a zero-vector in all parameter dimensions.
$ \nabla(\theta) = 0 \in \mathbb R^d $
## Consistency
Under mild regularity constraints (smoothness of function), MLE is a [[Properties of an Estimator#Key Properties of an Estimator|consistent estimator]], which means that it converges to $\theta^\star$ eventually.
$
\hat \theta_n^{MLE} \xrightarrow[n\to \infty]{\mathbf P}\theta^\star
$
This is because we switched the expectation of $\text{KL}$ with an average for $\widehat{\text{KL}}$.
$ \widehat{\mathrm{KL}}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=c-\frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i) $
Whenever we do this, we can rely on [[Law of Large Numbers|LLN]], which allows us to state [[Modes of Convergence#Convergence in Probability|Convergence in Probability]]. So $\widehat{\text{KL}}$ will converge to the true $\text{KL}$ for all $\theta \in \Theta$ the more $X_i$ observations we can collect.
$ \begin{aligned}
\widehat{\text{KL}}(\mathbf P_{\theta^\star}, \mathbf P_\theta) &\xrightarrow[n\to \infty]{\mathbf P} \, \text{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta) \\[8pt]
c- \frac{1}{n} \sum_{i=1}^n \log p_\theta(X_i) &\xrightarrow[n\to \infty]{\mathbf P} \, \text{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta) \\
\frac{1}{n} \sum_{i=1}^n \log p_\theta(X_i) &\xrightarrow[n\to \infty]{\mathbf P} \, c-\text{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta)
\end{aligned} $
The right-hand side minimizes if $\mathbf P_\theta = \mathbf P_{\theta^\star}$. If the parameter $\theta$ is [[Identifiability|identifiable]], then $\theta^\star$ can be uniquely identified from this $\mathbf P_{\theta^\star}$.
## Asymptotic Normality
Under some mild regularity conditions, the MLE estimator exhibits [[Modes of Convergence#Convergence in Distribution|Convergence in Distribution]] to a [[Gaussian Distribution]] $\mathcal N$.
- By subtracting $\theta^\star$ from the estimator we center the mean at zero.
- By multiplication with $\sqrt n$ we make sure that the [[Variance]] is converging to some non-trivial constant (not like $0$ or $\infty$). We call this constant asymptotic variance.
$
\sqrt n*\big(\hat\theta_n^{MLE}-\theta^\star\big) \xrightarrow[n \to \infty]{(d)}\mathcal N\left(0,\frac{1} {I(\theta^\star)}\right)
$
The asymptotic variance is denoted by 1 over the Fisher information $I^\star (\theta)$. To compute this quantity, assume that $\ell(\theta)$ is the log-likelihood of a single observation $X_1$.
$ \ell(\theta)=\ell(X_1,\theta) = \log f_\theta(X_1) $
Given that $\ell(\theta)$ is twice differentiable, Fisher information can be defined in two ways:
- Variance of the first derivative of $\ell(\theta)$
- Negative expectation of the second derivate of $\ell(\theta)$
$ I(\theta)= \mathrm{Var}[\ell^\prime(\theta)] = - \mathbb E[\ell^{\prime \prime}(\theta)] $
>[!note:]
>Since the Fisher information can be defined by the expectation of the second derivative, $I(\theta^\star)$ can be interpreted as the average curvature around $\theta^\star$. So the stronger the curvature around the minimum, the smaller the asymptotic variance.
**Assumptions for Convergence of MLE estimator:**
1. Parameter $\theta$ is [[Identifiability|identifiable]].
2. For all $\theta \in \Theta$ the support of $\mathbf P_\theta$ does not depend on $\theta$. This e.g. rules out [[Continuous Uniform Distribution]] $\mathcal U([0, \theta])$.
3. $\theta^\star$ is not in the boundary of $\Theta$ (since we need to take derivatives).
4. The Fisher Information $I(\theta)\not = 0$ in the region around $\theta^\star$.
5. More technical conditions.