Discriminant Analysis - Bernhard Pfann, CFA

We have a design matrix $\mathbb X \in \mathbb R^d$ with $n$ observations and $d$ features. Each observation belongs to one of the classes from $C$. We want to draw a decision boundary, so we can allocation new observations to the correct class based on these boundaries. **Useful link:** [Machine Learning: What is Discriminant Analysis?](https://www.youtube.com/watch?v=eBm8Uo9yhwI) ## Quadratic Discriminant Analysis We assume that the observations from a given class $(X \vert C=c)$ come from a [[Multivariate Gaussian]] $\mathcal N^d$, each with a respective mean $\mu_c$. In QDA we assume that the $\mathcal N^d$ of each class has an individual [[Covariance Matrix]] $\Sigma_c$. This results in quadratic decision boundaries. $ (X\vert C=c) \sim \mathcal N^d(\mu_c, \Sigma_c) $ ![[discriminant-analysis.png|center|400]] For a new observation $x$ we choose the class $c$ such that $\mathbf P(c\vert x)$ is maximized: $ \arg \max_C \mathbf P(C=c \vert X=x) $ $ \begin{align} \mathbf P(C=c \vert X=x) &\propto \overbrace{\mathbf P(C=c)}^{\text{Prior}}* \overbrace{\mathbf P(X=x\vert C=c)}^{\text{Likelihood}} \tag{1}\\[4pt] \log \mathbf P(C=c \vert X=x) & \propto \log\Big(\mathbf P(C=c) * \mathbf P(X=x\vert C=c)\Big) \tag{2}\\[6pt] & \propto \log\mathbf P(C=c)+ \log \mathbf P(X=x\vert C=c) \tag{3} \end{align} $ where: - (1) We apply [[Bayes Rule]] to express $\mathbf P(c\vert x)$ as a multiplication of prior and likelihood. - (2) Since we only care about the $\arg \max$ and not the actual maximum, we can also take the $\log$, since the logarithm is a [[Monotonicity|monotonically]] increasing function. $ \begin{align} \mathbf P(X=x \vert C=c)&=\frac{1}{\sqrt{(2\pi)^d \text {det}(\Sigma )}}*\exp \left(-\frac{1}{2}(\mathbf x-\mu )^T \Sigma ^{-1} (\mathbf x-\mu )\right) \\[10pt] \log \mathbf P(X=x\vert C=c)&=-\frac{1}{2}\log \det \Sigma_c-\frac{1}{2}(x-\mu_c)^T \Sigma_C^{-1}(x-\mu_c) \end{align} $ We write out the PDF of the likelihood, which is a multivariate Gaussian. $ \begin{align} \log \mathbf P(c \vert x) &\propto \log\mathbf P(C=c)+ \log \mathbf P(X=x\vert C=c) \\ \log \mathbf P(c \vert x) &\propto \log\mathbf P(C=c) -\frac{1}{2}\log \det \Sigma_c-\frac{1}{2}(x-\mu_c)^T \Sigma_C^{-1}(x-\mu_c) \end{align} $ Finally we plug-in the $x$ of a new observation, and check for which class $c$ the posterior is the highest. >[!note:] >We can see that the class allocation is a function in $x^T$ times $\Sigma_C^{-1}(x-\mu_c)$. This means it is quadratic w.r.t $x$. Therefore the decision boundary is quadratic. **Decision boundary:** Collection of all points $x$ where the probability to belong to two classes is equal. $ \mathbf P(C=c_1 \vert X=x)\equiv \mathbf P(C=c_2\vert X=x) $ ## Linear Discriminant Analysis There might not be enough samples, to properly estimate the class-wise covariance matrix $\Sigma_c$. We can simplify, by assuming that observations of each class are spread-out around their class mean $\mu_c$ equally. Hence, there is just a single covariance matrix $\Sigma$ that is true for the Gaussians $\mathcal N^d$ of each class. This results in linear decision boundaries. $ (X\vert C=c) \sim \mathcal N^d(\mu_c, \Sigma) $ ![[linear-discriminant-analysis.png|center|400]] This also simplifies the posterior.. $ \log \mathbf P(c \vert x) \propto \log\mathbf P(C=c) -\frac{1}{2}\log \det \Sigma-\frac{1}{2}(x-\mu_c)^T \Sigma^{-1}(x-\mu_c) $ ## Linear Discriminant Analysis (Identity Matrix) The simplest setup is, when the covariance matrix is just the identify times a global variance $\sigma^2$. This means that the $d$ features are all scaled and independent of each other. $ (X\vert C=c) \sim \mathcal N^d(\mu_c, \sigma^2I) $ In this case the decision boundary is also linear. Even more, new observation $x$ will be assigned to the class whose $\mu_c$ is the closest. ![[linear-discriminant-analysis-identity.png|center|400]] Since computing the distance to class mean is computationally way easier, we can transform our $\mathbb X$ so that the covariance turns into the identity. We can achieve this by [[Spectral Theorem]], where we decompose the covariance matrix into eigenvectors $V$ and eigenvalues $\Lambda$. $ \begin{align} \Sigma = X^TX &= V \Lambda V^T \\[6pt] (XV\Lambda^{-1/2})^T (XV\Lambda^{-1/2}) &= \Lambda^{-1/2}V^T (V \Lambda V^T)V\Lambda^{-1/2} \\[6pt] \Lambda^{-1/2}V^T (X^TX) V\Lambda^{-1/2} &= \Lambda^{-1/2}V^T(V \Lambda V^T)V\Lambda^{-1/2} \\[6pt] \Lambda^{-1/2}V^T (X^TX) V\Lambda^{-1/2} &= \Lambda^{-1/2} \Lambda \Lambda^{-1/2} \\[6pt] \Lambda^{-1/2}V^T (X^T \underbrace{X) V\Lambda^{-1/2}}_{\text{transformation}} &= \mathbf I \end{align} $ We have shown, that when the covariance matrix $\Sigma$ is multiplied by the whitening transformation $V\Lambda^{-1/2}$, it results in the identity matrix $\mathbf I$. - The multiplication with $V$ rotates the $X$ along the eigenvectors. - The multiplication with $\Lambda^{-1/2}$ scales the variance to $1$ according to the eigenvalues.