[[Maximum Likelihood Estimation]] is one special case of M-estimation, where we define the loss function to be the [[Kullback-Leibler Divergence|KL Divergence]]. However, instead we could also use any other loss function $\rho$, such that it has certain properties:
1. It should have an expectation.
$ \mathcal Q(\theta)= \mathbb E\big [\rho(X, \theta)\big] $
2. Its expectation should be minimized at $\theta^\star$ solely (parameter is [[Identifiability|identifiable]]).
$ \arg \min_{\theta \in \Theta}\mathcal Q(\theta) = \theta^\star $
## Estimation Process
In both frameworks, we observe some i.i.d. r.v’s. $X_1, \dots, X_n$ that are generated from some $\mathbf P_{\theta^\star}$ where $\theta^\star$ is unknown.
| # | Steps | MLE-Estimation | M-Estimation |
| --- | --------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| 1 | Choose loss function | $\mathrm{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta)$ | $\rho(X, \theta)$ |
| 2 | Take expectation of loss function | $\mathrm{KL}(\mathbf P_{\theta^\star}, \mathbf P_\theta)= c- \mathbb E \big[\log (p_\theta(X))\big]$ | $\mathcal Q(\theta)=\mathbb E\big[\rho(X, \theta)\big]$ |
| 3 | Replace expectation with average | $\widehat{\mathrm{KL}}(\mathbf P_{\theta^\star}, \mathbf P_\theta) =-\frac{1}{n} \sum_{i=1}^n \log\big(p_\theta (X_i)\big)$ | $\widehat {\mathcal Q} (\mathbf P_{\theta^\star}, \mathbf P_\theta)=\frac{1}{n} \sum_{i=1}^n \rho(X_i, \theta)$ |
| 4 | Optimize | Take the $\arg \min$ of $\widehat{\mathrm{KL}}$. | Take the $\arg \min$ of $\widehat {\mathcal Q}$. |
(2) Constant $c$ has no effect on the location of the minimum, and can thus be ignored in this setting.
In M-estimation there is no [[Statistical Model]] that needs to be assumed. Therefore we do not estimate distribution-specific parameters (e.g. $\lambda$ for a [[Poisson Distribution|Poisson]] model). Instead, we can choose which quantity we are interested in (mean, median, quantile etc.). Dependent on the choice, the loss function $\rho$ has to be built accordingly.
## Choice of Loss Functions
| Quantity | Loss function | Intuition |
| -------- | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Mean | $\rho(x,\theta)=(x-\theta)^2$ | Minimize squared difference to $\theta$. |
| Median | $\rho(x, \theta)= \lvert x - \theta \rvert$ | Minimize absolute differences to $\theta$. |
| Quantile | $\rho(x, \theta)=C_\alpha(x- \theta)$ | E.g. for $95^{\text{th}}$ quantile, where $\alpha=0.95$, we weight every negative difference $(x-\theta)$ with $0.05$ and with $0.95$ otherwise. However the later case should only occur $5\%$ of the time at the optimal $\theta$. |
| MLE | $\rho(x, \theta)) = - \log p_\theta(X)$ | Minimize negative log likelihood (i.e. maximize likelihood). |
The quantile function writes as follows:
$
C_\alpha= \begin{cases} -(1-\alpha)x & \text{if } x <0\\ \alpha x & \text{if }x \ge 0 \end{cases}
$
[[Properties of an Estimator#Key Properties of an Estimator|Consistency]] and [[Properties of an Estimator#Key Properties of an Estimator|asymptotic normality]] is still guaranteed by the estimator in M-estimation. However, the asymptotic variance is different (not Fisher information).
$ \sqrt n (\hat \theta_n-\theta^\star) \xrightarrow[n \to \infty]{(d)}\mathcal N\Big (0,\, J(\theta)^{-1} K(\theta^\star) J(\theta)^{-1}\Big) $