When $Y$comes from a [[Canonical Exponential Family]], the [[Probability Density Function|PDF]] for a single observation $y_i$ looks as follows:
$ f_{\theta_i}(y_i) = \exp \left\{\frac{y_i\theta_i-b(\theta_i)}{\phi}+c(y_i, \phi)\right\} $
However, since we want to estimate $\beta$ (via [[Maximum Likelihood Estimation|MLE]]), we need to form the [[Likelihood Functions|likelihood]] w.r.t. $(\beta, \mathbb X_i)$ instead of the canonical parameter $\theta_i$. The [[Generalized Linear Models]] tell us the structure, how $(\beta, \mathbb X)$ relate to the regression function $\mu$.
$ g(\mu_i)=\mathbb X_i^T \beta $
Dependent on what [[Canonical Link Functions|Link Function]] $g$ we use, we can replace $\theta$ by some form of $g(\mu)$.
- *Canonical link:* By design the canonical link $g$ allows us to directly use the linear predictor $\mathbb X_i^T \beta$ instead of $\theta$. Basically the canonical $g$ makes sure that we can just work with the linear predictor inside this exponential PDF.
$ \theta_i = g(\mu_i)= \mathbb X_i^T \beta $
- *Non-canonical link:* We have to express $\theta$ by some potentially complicated function $h(\cdot)$ of the linear predictor $\mathbb X_i \beta$.
$
\begin{align}
\mu_i&=b^\prime(\theta_i)\\[8pt]
\theta_i &= (b^\prime)^{-1}(\mu) \\[8pt]
&=(b^\prime)^{-1}(g^{-1}(\mathbb X_i^T\beta)) \\[8pt]
&=g\big(b^\prime(\mathbb X_i^T \beta)\big)^{-1} \\[8pt]
&\equiv h(\mathbb X_i^T \beta)
\end{align}
$
where:
- (1) We that the [[Canonical Exponential Family|Mean of a Canonical Exponential]] is the first derivative of the log-partition function.
- (2) We take the inverse of the log-partition function, going from $b^\prime(\theta): \theta \mapsto \mu$ to $(b^\prime)^{-1}: \mu \mapsto \theta$.
- (4) By definition $a^{-1}(b^{-1}(\cdot))$ can also be written as $b(a(\cdot))^{-1}$.
- (5) We summarize $g(b^\prime(\cdot))^{-1}$ into $h(\cdot)$.
>[!note:]
>If we use the canonical link, then $h$ is the identity function $\mathbf I$.
Dependent on our choice of $g$, the log-likelihood w.r.t. $\beta, \mathbb X_i$ looks as follows..
$
\begin{align}
\ell(\mathbf Y, \mathbb X, \theta) &= \sum_{i=1}^n \frac{\mathbf Y_i\theta_i-b(\theta_i)}{\phi}+c(\mathbf Y_i, \phi) \\[12pt]
\ell(\mathbf Y, \mathbb X, \beta) &= \sum_{i=1}^n \frac{\mathbf Y_i h(\mathbb X_i^T\beta)-b(h(\mathbb X_i^T\beta))}{\phi}+c(\mathbf Y_i, \phi) \\[12pt]
\ell(\mathbf Y, \mathbb X, \beta) &= \sum_{i=1}^n \frac{\mathbf Y_i \mathbb X_i^T\beta-b(\mathbb X_i^T\beta)}{\phi}+c(\mathbf Y_i, \phi)
\end{align}
$
We need to check the [[Identify Convex and Concave Functions|concavity]] of our likelihood function, to ensure that our maximization is unique.
- *Univariate case:* Concavity is given when the second derivative $\frac{\partial \ell}{\partial \beta^2}$ is non-positive.
- *Multivariate case:* Concavity is given when the [[Hessian]] $\mathbf H$ is negative semi-definite.
However, when we use the canonical link, it is ensured that the log-likelihood is strictly concave. Thus MLE will lead to a unique maximum.
$
\begin{align}
\frac{\partial \ell}{\partial \beta}:\frac{1}{\phi}\sum_{i=1}^n \mathbf Y_i \mathbb X_i^T-b^\prime(\mathbb X_i^T \beta)\mathbb \, \mathbb X_i &\stackrel{!}{=}0 \\[6pt]
\sum_{i=1}^n \mathbf Y_i \mathbb X_i^T &= \sum_{i=1}^n b^\prime(\mathbb X_i^T \hat \beta)\mathbb \, \mathbb X_i
\end{align}
$
## Special Case of Linear Regression
When the canonical exponential family of $\mathbf Y$ is [[Gaussian Distribution|Gaussian]], we know that the link function $g= \mathbf I$, which simplifies everything to the [[Multivariate Linear Regression]] equation.
$
\begin{aligned}
\sum_{i=1}^n \mathbf Y_i \mathbb X_i^T &= \sum_{i=1}^n b^\prime(\mathbb X_i^T \hat \beta)\mathbb \, \mathbb X_i \\[8pt]
\sum_{i=1}^n \mathbf Y_i \mathbb X_i^T&= \sum_{i=1}^n \mathbb X_i^T \hat \beta \, \mathbb X_i \\[14pt]
(\mathbb X^T\mathbb X)^{-1} \mathbb X^T \mathbf Y &= \hat \beta
\end{aligned}
$
## Closed-Form Solutions
- *Linear regression:* It is just a special case of GLM. It has a closed-form solution because the minimization of least squares is a convex quadratic problem.
- *Generalized linear model:* Most GLMs do not have a closed-form and need to rely on optimization algorithms (e.g. gradient descent) instead. This is because we maximize the likelihood of something non-linear, and therefore cannot solve it directly by taking derivatives.