M-Estimation - Bernhard Pfann, CFA

[[Maximum Likelihood Estimation]] is one special case of M-estimation, where we define the loss function to be the [[Kullback-Leibler Divergence|KL Divergence]]. However, instead we could also use any other loss function $\rho$, such that it has certain properties: 1. It should have an expectation. $ \mathcal Q(\theta)= \mathbb E\big [\rho(X, \theta)\big] $ 2. Its expectation should be minimized at $\theta^\star$ solely (parameter is [[Identifiability|identifiable]]). $ \arg \min_{\theta \in \Theta}\mathcal Q(\theta) = \theta^\star $ ## Estimation Process In both frameworks, we observe some i.i.d. r.v’s. $X_1, \dots, X_n$ that are generated from some $\mathbf P_{\theta^\star}$ where $\theta^\star$ is unknown. | # | Steps | MLE-Estimation | M-Estimation | | --- | --------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | | 1 | Choose loss function | $\mathrm{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta)$ | $\rho(X, \theta)$ | | 2 | Take expectation of loss function | $\mathrm{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta)= c- \mathbb E \big[\log (p_\theta(X))\big]$ | $\mathcal Q(\theta)=\mathbb E\big[\rho(X, \theta)\big]$ | | 3 | Replace expectation with average | $\widehat{\mathrm{KL}}(\mathbf P_{\theta^\star}, \mathbf P_\theta) =-\frac{1}{n} \sum_{i=1}^n \log\big(p_\theta (X_i)\big)$ | $\widehat {\mathcal Q} (\mathbf P_{\theta^\star}, \mathbf P_\theta)=\frac{1}{n} \sum_{i=1}^n \rho(X_i, \theta)$ | | 4 | Optimize | Take the $\arg \min$ of $\widehat{\mathrm{KL}}$. | Take the $\arg \min$ of $\widehat {\mathcal Q}$. | (2) Constant $c$ has no effect on the location of the minimum, and can thus be ignored in this setting. In M-estimation there is no [[Statistical Model]] that needs to be assumed. Therefore we do not estimate distribution-specific parameters (e.g. $\lambda$ for a [[Poisson Distribution|Poisson]] model). Instead, we can choose which quantity we are interested in (mean, median, quantile etc.). Dependent on the choice, the loss function $\rho$ has to be built accordingly. ## Choice of Loss Functions | Quantity | Loss function | Intuition | | -------- | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Mean | $\rho(x,\theta)=(x-\theta)^2$ | Minimize squared difference to $\theta$. | | Median | $\rho(x, \theta)= \lvert x - \theta \rvert$ | Minimize absolute differences to $\theta$. | | Quantile | $\rho(x, \theta)=C_\alpha(x- \theta)$ | E.g. for $95^{\text{th}}$ quantile, where $\alpha=0.95$, we weight every negative difference $(x-\theta)$ with $0.05$ and with $0.95$ otherwise. However the later case should only occur $5\%$ of the time at the optimal $\theta$. | | MLE | $\rho(x, \theta)) = - \log p_\theta(X)$ | Minimize negative log likelihood (i.e. maximize likelihood). | The quantile function writes as follows: $ C_\alpha= \begin{cases} -(1-\alpha)x & \text{if } x <0\\ \alpha x & \text{if }x \ge 0 \end{cases} $ [[Properties of an Estimator#Key Properties of an Estimator|Consistency]] and [[Properties of an Estimator#Key Properties of an Estimator|asymptotic normality]] is still guaranteed by the estimator in M-estimation. However, the asymptotic variance is different (not Fisher information). $ \sqrt n (\hat \theta_n-\theta^\star) \xrightarrow[n \to \infty]{(d)}\mathcal N\Big (0,\, J(\theta)^{-1} K(\theta^\star) J(\theta)^{-1}\Big) $