Maximum Likelihood Estimation - Bernhard Pfann, CFA

## Estimator In order to find the minimum [[Kullback-Leibler Divergence|KL Divergence]], we need to maximize the respective [[Likelihood Functions]]. We are interested in the set of parameters $\theta$ that leads to this maximum (”argmax”). **Maximum Likelihood Estimator:** $ \hat \theta_n^{MLE} = \arg \max_{\theta \in \Theta} L(X_1, \dots, X_n; \theta) $ >[!note:] >The expression $\arg \max(f(\theta))$ looks for maximum of a function, but returns the functions’ parameters that yield that maximum value. It does not return the actual function value. **Log-Likelihood Estimator:** $ \hat \theta_n^{MLE} = \arg \max_{\theta \in \Theta} \log \big( L(X_1, \dots, X_n; \theta)\big) $ **Log Equivalence:** We can add the logarithmic function and the $\arg \max$ will not change. This is because the logarithm is a [[Monotonicity|monotonically increasing]] function, so it does not flip the order of which parameter leads to the maximum. $ \arg \max f(\theta) \leftrightarrow \arg \max g(f(\theta)) $ - Maximizing a function is the same thing as maximizing an increasing function of the function, at least in terms of $\arg \max$. - So we could have picked any other increasing function (e.g. exponential likelihood). However the $\log$ function will simplify the differentiation steps when we maximize and bring down exponents. ## Maximization Maximizing a function is the same as minimizing the negative of that function. $ \min_{\theta \in \Theta}-h(\theta) \leftrightarrow \max_{\theta \in \Theta}h(\theta) $ We find the minimum (maximum) where the first derivative $h^\prime(\theta)$ is $0$, which corresponds to the valley (peak) of the function, that has zero slope (i.e. point of inflection). For [[Identify Convex and Concave Functions|convex or concave]] functions we will always have a unique solution. $ h^\prime(\theta)=0 $ This is also true for the multivariate case. However, here $\theta$ is a vector of $d$ parameters. The [[Gradient Descent#Gradient Vector|Gradient Vector]] $\nabla(\theta)$ denotes a vector of partial derivatives for each of these parameters. The maximum/minimum is found where $\nabla (\theta)$ equals a zero-vector in all parameter dimensions. $ \nabla(\theta) = 0 \in \mathbb R^d $ ## Consistency Under mild regularity constraints (smoothness of function), MLE is a [[Properties of an Estimator#Key Properties of an Estimator|consistent estimator]], which means that it converges to $\theta^\star$ eventually. $ \hat \theta_n^{MLE} \xrightarrow[n\to \infty]{\mathbf P}\theta^\star $ This is because we switched the expectation of $\text{KL}$ with an average for $\widehat{\text{KL}}$. $ \widehat{\mathrm{KL}}(\mathbf P_{\theta^\prime}, \mathbf P_\theta)=c-\frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i) $ Whenever we do this, we can rely on [[Law of Large Numbers|LLN]], which allows us to state [[Modes of Convergence#Convergence in Probability|Convergence in Probability]]. So $\widehat{\text{KL}}$ will converge to the true $\text{KL}$ for all $\theta \in \Theta$ the more $X_i$ observations we can collect. $ \begin{aligned} \widehat{\text{KL}}(\mathbf P_{\theta^\star}, \mathbf P_\theta) &\xrightarrow[n\to \infty]{\mathbf P} \, \text{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta) \\[8pt] c- \frac{1}{n} \sum_{i=1}^n \log p_\theta(X_i) &\xrightarrow[n\to \infty]{\mathbf P} \, \text{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta) \\ \frac{1}{n} \sum_{i=1}^n \log p_\theta(X_i) &\xrightarrow[n\to \infty]{\mathbf P} \, c-\text{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta) \end{aligned} $ The right-hand side minimizes if $\mathbf P_\theta = \mathbf P_{\theta^\star}$. If the parameter $\theta$ is [[Identifiability|identifiable]], then $\theta^\star$ can be uniquely identified from this $\mathbf P_{\theta^\star}$. ## Asymptotic Normality Under some mild regularity conditions, the MLE estimator exhibits [[Modes of Convergence#Convergence in Distribution|Convergence in Distribution]] to a [[Gaussian Distribution]] $\mathcal N$. - By subtracting $\theta^\star$ from the estimator we center the mean at zero. - By multiplication with $\sqrt n$ we make sure that the [[Variance]] is converging to some non-trivial constant (not like $0$ or $\infty$). We call this constant asymptotic variance. $ \sqrt n*\big(\hat\theta_n^{MLE}-\theta^\star\big) \xrightarrow[n \to \infty]{(d)}\mathcal N\left(0,\frac{1} {I(\theta^\star)}\right) $ The asymptotic variance is denoted by 1 over the Fisher information $I^\star (\theta)$. To compute this quantity, assume that $\ell(\theta)$ is the log-likelihood of a single observation $X_1$. $ \ell(\theta)=\ell(X_1,\theta) = \log f_\theta(X_1) $ Given that $\ell(\theta)$ is twice differentiable, Fisher information can be defined in two ways: - Variance of the first derivative of $\ell(\theta)$ - Negative expectation of the second derivate of $\ell(\theta)$ $ I(\theta)= \mathrm{Var}[\ell^\prime(\theta)] = - \mathbb E[\ell^{\prime \prime}(\theta)] $ >[!note:] >Since the Fisher information can be defined by the expectation of the second derivative, $I(\theta^\star)$ can be interpreted as the average curvature around $\theta^\star$. So the stronger the curvature around the minimum, the smaller the asymptotic variance. **Assumptions for Convergence of MLE estimator:** 1. Parameter $\theta$ is [[Identifiability|identifiable]]. 2. For all $\theta \in \Theta$ the support of $\mathbf P_\theta$ does not depend on $\theta$. This e.g. rules out [[Continuous Uniform Distribution]] $\mathcal U([0, \theta])$. 3. $\theta^\star$ is not in the boundary of $\Theta$ (since we need to take derivatives). 4. The Fisher Information $I(\theta)\not = 0$ in the region around $\theta^\star$. 5. More technical conditions.