When $Y$comes from a [[Canonical Exponential Family]], the [[Probability Density Function|PDF]] for a single observation $y_i$ looks as follows: $ f_{\theta_i}(y_i) = \exp \left\{\frac{y_i\theta_i-b(\theta_i)}{\phi}+c(y_i, \phi)\right\} $ However, since we want to estimate $\beta$ (via [[Maximum Likelihood Estimation|MLE]]), we need to form the [[Likelihood Functions|likelihood]] w.r.t. $(\beta, \mathbb X_i)$ instead of the canonical parameter $\theta_i$. The [[Generalized Linear Models]] tell us the structure, how $(\beta, \mathbb X)$ relate to the regression function $\mu$. $ g(\mu_i)=\mathbb X_i^T \beta $ Dependent on what [[Canonical Link Functions|Link Function]] $g$ we use, we can replace $\theta$ by some form of $g(\mu)$. - *Canonical link:* By design the canonical link $g$ allows us to directly use the linear predictor $\mathbb X_i^T \beta$ instead of $\theta$. Basically the canonical $g$ makes sure that we can just work with the linear predictor inside this exponential PDF. $ \theta_i = g(\mu_i)= \mathbb X_i^T \beta $ - *Non-canonical link:* We have to express $\theta$ by some potentially complicated function $h(\cdot)$ of the linear predictor $\mathbb X_i \beta$. $ \begin{align} \mu_i&=b^\prime(\theta_i)\\[8pt] \theta_i &= (b^\prime)^{-1}(\mu) \\[8pt] &=(b^\prime)^{-1}(g^{-1}(\mathbb X_i^T\beta)) \\[8pt] &=g\big(b^\prime(\mathbb X_i^T \beta)\big)^{-1} \\[8pt] &\equiv h(\mathbb X_i^T \beta) \end{align} $ where: - (1) We that the [[Canonical Exponential Family|Mean of a Canonical Exponential]] is the first derivative of the log-partition function. - (2) We take the inverse of the log-partition function, going from $b^\prime(\theta): \theta \mapsto \mu$ to $(b^\prime)^{-1}: \mu \mapsto \theta$. - (4) By definition $a^{-1}(b^{-1}(\cdot))$ can also be written as $b(a(\cdot))^{-1}$. - (5) We summarize $g(b^\prime(\cdot))^{-1}$ into $h(\cdot)$. >[!note:] >If we use the canonical link, then $h$ is the identity function $\mathbf I$. Dependent on our choice of $g$, the log-likelihood w.r.t. $\beta, \mathbb X_i$ looks as follows.. $ \begin{align} \ell(\mathbf Y, \mathbb X, \theta) &= \sum_{i=1}^n \frac{\mathbf Y_i\theta_i-b(\theta_i)}{\phi}+c(\mathbf Y_i, \phi) \\[12pt] \ell(\mathbf Y, \mathbb X, \beta) &= \sum_{i=1}^n \frac{\mathbf Y_i h(\mathbb X_i^T\beta)-b(h(\mathbb X_i^T\beta))}{\phi}+c(\mathbf Y_i, \phi) \\[12pt] \ell(\mathbf Y, \mathbb X, \beta) &= \sum_{i=1}^n \frac{\mathbf Y_i \mathbb X_i^T\beta-b(\mathbb X_i^T\beta)}{\phi}+c(\mathbf Y_i, \phi) \end{align} $ We need to check the [[Identify Convex and Concave Functions|concavity]] of our likelihood function, to ensure that our maximization is unique. - *Univariate case:* Concavity is given when the second derivative $\frac{\partial \ell}{\partial \beta^2}$ is non-positive. - *Multivariate case:* Concavity is given when the [[Hessian]] $\mathbf H$ is negative semi-definite. However, when we use the canonical link, it is ensured that the log-likelihood is strictly concave. Thus MLE will lead to a unique maximum. $ \begin{align} \frac{\partial \ell}{\partial \beta}:\frac{1}{\phi}\sum_{i=1}^n \mathbf Y_i \mathbb X_i^T-b^\prime(\mathbb X_i^T \beta)\mathbb \, \mathbb X_i &\stackrel{!}{=}0 \\[6pt] \sum_{i=1}^n \mathbf Y_i \mathbb X_i^T &= \sum_{i=1}^n b^\prime(\mathbb X_i^T \hat \beta)\mathbb \, \mathbb X_i \end{align} $ ## Special Case of Linear Regression When the canonical exponential family of $\mathbf Y$ is [[Gaussian Distribution|Gaussian]], we know that the link function $g= \mathbf I$, which simplifies everything to the [[Multivariate Linear Regression]] equation. $ \begin{aligned} \sum_{i=1}^n \mathbf Y_i \mathbb X_i^T &= \sum_{i=1}^n b^\prime(\mathbb X_i^T \hat \beta)\mathbb \, \mathbb X_i \\[8pt] \sum_{i=1}^n \mathbf Y_i \mathbb X_i^T&= \sum_{i=1}^n \mathbb X_i^T \hat \beta \, \mathbb X_i \\[14pt] (\mathbb X^T\mathbb X)^{-1} \mathbb X^T \mathbf Y &= \hat \beta \end{aligned} $ ## Closed-Form Solutions - *Linear regression:* It is just a special case of GLM. It has a closed-form solution because the minimization of least squares is a convex quadratic problem. - *Generalized linear model:* Most GLMs do not have a closed-form and need to rely on optimization algorithms (e.g. gradient descent) instead. This is because we maximize the likelihood of something non-linear, and therefore cannot solve it directly by taking derivatives.