We have a design matrix $\mathbb X \in \mathbb R^d$ with $n$ observations and $d$ features. Each observation belongs to one of the classes from $C$. We want to draw a decision boundary, so we can allocation new observations to the correct class based on these boundaries.
**Useful link:** [Machine Learning: What is Discriminant Analysis?](https://www.youtube.com/watch?v=eBm8Uo9yhwI)
## Quadratic Discriminant Analysis
We assume that the observations from a given class $(X \vert C=c)$ come from a [[Multivariate Gaussian]] $\mathcal N^d$, each with a respective mean $\mu_c$. In QDA we assume that the $\mathcal N^d$ of each class has an individual [[Covariance Matrix]] $\Sigma_c$.
This results in quadratic decision boundaries.
$ (X\vert C=c) \sim \mathcal N^d(\mu_c, \Sigma_c) $
![[discriminant-analysis.png|center|400]]
For a new observation $x$ we choose the class $c$ such that $\mathbf P(c\vert x)$ is maximized:
$ \arg \max_C \mathbf P(C=c \vert X=x) $
$
\begin{align}
\mathbf P(C=c \vert X=x) &\propto \overbrace{\mathbf P(C=c)}^{\text{Prior}}* \overbrace{\mathbf P(X=x\vert C=c)}^{\text{Likelihood}} \tag{1}\\[4pt]
\log \mathbf P(C=c \vert X=x) & \propto \log\Big(\mathbf P(C=c) * \mathbf P(X=x\vert C=c)\Big) \tag{2}\\[6pt]
& \propto \log\mathbf P(C=c)+ \log \mathbf P(X=x\vert C=c) \tag{3}
\end{align}
$
where:
- (1) We apply [[Bayes Rule]] to express $\mathbf P(c\vert x)$ as a multiplication of prior and likelihood.
- (2) Since we only care about the $\arg \max$ and not the actual maximum, we can also take the $\log$, since the logarithm is a [[Monotonicity|monotonically]] increasing function.
$
\begin{align}
\mathbf P(X=x \vert C=c)&=\frac{1}{\sqrt{(2\pi)^d \text {det}(\Sigma )}}*\exp \left(-\frac{1}{2}(\mathbf x-\mu )^T \Sigma ^{-1} (\mathbf x-\mu )\right) \\[10pt]
\log \mathbf P(X=x\vert C=c)&=-\frac{1}{2}\log \det \Sigma_c-\frac{1}{2}(x-\mu_c)^T \Sigma_C^{-1}(x-\mu_c)
\end{align}
$
We write out the PDF of the likelihood, which is a multivariate Gaussian.
$
\begin{align}
\log \mathbf P(c \vert x) &\propto \log\mathbf P(C=c)+ \log \mathbf P(X=x\vert C=c) \\
\log \mathbf P(c \vert x) &\propto \log\mathbf P(C=c) -\frac{1}{2}\log \det \Sigma_c-\frac{1}{2}(x-\mu_c)^T \Sigma_C^{-1}(x-\mu_c)
\end{align}
$
Finally we plug-in the $x$ of a new observation, and check for which class $c$ the posterior is the highest.
>[!note:]
>We can see that the class allocation is a function in $x^T$ times $\Sigma_C^{-1}(x-\mu_c)$. This means it is quadratic w.r.t $x$. Therefore the decision boundary is quadratic.
**Decision boundary:** Collection of all points $x$ where the probability to belong to two classes is equal.
$ \mathbf P(C=c_1 \vert X=x)\equiv \mathbf P(C=c_2\vert X=x) $
## Linear Discriminant Analysis
There might not be enough samples, to properly estimate the class-wise covariance matrix $\Sigma_c$. We can simplify, by assuming that observations of each class are spread-out around their class mean $\mu_c$ equally. Hence, there is just a single covariance matrix $\Sigma$ that is true for the Gaussians $\mathcal N^d$ of each class.
This results in linear decision boundaries.
$ (X\vert C=c) \sim \mathcal N^d(\mu_c, \Sigma) $
![[linear-discriminant-analysis.png|center|400]]
This also simplifies the posterior..
$ \log \mathbf P(c \vert x) \propto \log\mathbf P(C=c) -\frac{1}{2}\log \det \Sigma-\frac{1}{2}(x-\mu_c)^T \Sigma^{-1}(x-\mu_c) $
## Linear Discriminant Analysis (Identity Matrix)
The simplest setup is, when the covariance matrix is just the identify times a global variance $\sigma^2$. This means that the $d$ features are all scaled and independent of each other.
$ (X\vert C=c) \sim \mathcal N^d(\mu_c, \sigma^2I) $
In this case the decision boundary is also linear. Even more, new observation $x$ will be assigned to the class whose $\mu_c$ is the closest.
![[linear-discriminant-analysis-identity.png|center|400]]
Since computing the distance to class mean is computationally way easier, we can transform our $\mathbb X$ so that the covariance turns into the identity.
We can achieve this by [[Spectral Theorem]], where we decompose the covariance matrix into eigenvectors $V$ and eigenvalues $\Lambda$.
$
\begin{align}
\Sigma = X^TX &= V \Lambda V^T \\[6pt]
(XV\Lambda^{-1/2})^T (XV\Lambda^{-1/2}) &= \Lambda^{-1/2}V^T (V \Lambda V^T)V\Lambda^{-1/2} \\[6pt]
\Lambda^{-1/2}V^T (X^TX) V\Lambda^{-1/2} &= \Lambda^{-1/2}V^T(V \Lambda V^T)V\Lambda^{-1/2} \\[6pt]
\Lambda^{-1/2}V^T (X^TX) V\Lambda^{-1/2} &= \Lambda^{-1/2} \Lambda \Lambda^{-1/2} \\[6pt]
\Lambda^{-1/2}V^T (X^T \underbrace{X) V\Lambda^{-1/2}}_{\text{transformation}} &= \mathbf I \end{align}
$
We have shown, that when the covariance matrix $\Sigma$ is multiplied by the whitening transformation $V\Lambda^{-1/2}$, it results in the identity matrix $\mathbf I$.
- The multiplication with $V$ rotates the $X$ along the eigenvectors.
- The multiplication with $\Lambda^{-1/2}$ scales the variance to $1$ according to the eigenvalues.