Expectation-Maximization Algorithm

**Motivation:** For functions, which are not [[Identify Convex and Concave Functions|convex or concave]] we cannot rely on the first derivative to find a minimum or maximum. Therefore we need algorithms that find the min/max in an iterative way. Some of the algorithms are guaranteed to converge, while others are not. The EM algorithm is used to maximize the log-likelihood for the mixture of multiple Gaussians and is only guaranteed to converge locally. ## Derivation Assume a [[Gaussian Mixtures|Gaussian Mixture]] where we know $\sigma_1^2=\sigma_2^2=1$ and the mixing parameter $\tau=1/2$. For us the unknowns are $\mu_1, \mu_2$. Since this distribution might have multiple min/max (unless $\mu_1=\mu_2$), taking the first derivative of this [[Likelihood Functions|Likelihood Function]] $L$ will not help. $ \begin{aligned} f(x) &= \tau*\frac{1}{\sqrt2\pi}\left(e^{-\frac{(x-\mu_1)^2}{2}}\right)+ (1-\tau)*\frac{1}{\sqrt2\pi}\left(e^{-\frac{(x-\mu_2)^2}{2}} \right) \\[10pt] f(x) &= \frac{1}{2}*\frac{1}{2\sqrt2\pi}\left(e^{-\frac{(x-\mu_1)^2}{2}}+e^{-\frac{(x-\mu_2)^2}{2}} \right)\\[10pt] L(x) &= \prod_{i=1}^n \frac{1}{2\sqrt2\pi}\left(e^{-\frac{(x-\mu_1)^2}{2}}+e^{-\frac{(x-\mu_2)^2}{2}} \right) \end{aligned} $ However, we can express the density also by its sampling description, where the mixing parameter is [[Bernoulli Distribution|Bernoulli]] $Z \sim\text{Ber}(p=\frac{1}{2})$. This way we can also bring $z$ to the exponent and multiply both $X$ terms with each other (i.e. when $z=0$ then $X^{(1)^z}=1$ and vice versa). $ Z= \begin{cases} 0 & \text{w.p.} &(1-p)\\ 1 & \text{w.p.} &p \end{cases} $ $ \begin{align} X&=ZX^{(1)}+(1-Z)X^{(2)} \\ &=X^{(1)^z}*X^{(2)^{1-z}} \end{align} $ **Joint density:** We compute the [[Joint Probability Density Function|Joint PDF]] by multiplying the marginal of $z$ with the conditional density of $x$ given $z$. $ \begin{align} f(x,z)&= p(z)*f(x\vert z) \tag{1}\\[10pt] &=p(z=0)*f(x\vert z=0)+p(z=1)*f(x\vert z=1) \tag{2}\\[8pt] &=\frac{1}{2}*\left(z*\frac{1}{\sqrt{2\pi}}*e^{-\frac{(x-\mu_1)^2}{2}}+(1-z)*\frac{1}{\sqrt{2\pi}}*e^{-\frac{(x-\mu_2)^2}{2}} \right) \tag{3}\\[10pt] &=\frac{1}{2\sqrt{2\pi}}*\left(z_e^{-\frac{(x-\mu_1)^2}{2}}+(1-z)*e^{-\frac{(x-\mu_2)^2}{2}} \right) \tag{4}\\[10pt] &=\frac{1}{2\sqrt{2\pi}}*\left( e^{-z\frac{(x-\mu_1)^2}{2}} *e^{-(1-z)\frac{(x-\mu_2)^2}{2}} \right) \tag{5}\\[10pt] &=\frac{1}{2\sqrt{2\pi}}* \exp \left\{ -z*\frac{(x-\mu_1)^2}{2}-(1-z)*\frac{(x-\mu_2)^2}{2} \right\} \tag{6} \end{align} $ where: - (5) We can use the fact from (2) and bring the Bernoulli variable $z$ to the exponent and switch the addition to a multiplication. - (6) When two exponential terms that are being multiplied, have the same base, their exponents can be summed under the same base. See [[Algebra of Exponents]]. **Log-likelihood:** We build the complete log likelihood on the joint density, acting as if we would know the latent $z_i$ of each observation. $ L\big((x_1, z_1),\dots(x_n,z_n); \mu_1, \mu_2\big)=\prod_{i=1}^nf(x,z) $ $ \begin{align} L&=\left(\frac{1}{2\sqrt{2\pi}}\right)^n \prod_{i=1}^n\exp \left\{ -z_i*\frac{(x-\mu_1)^2}{2}-(1-z_i)*\frac{(x-\mu_2)^2}{2} \right\} \tag{1}\\[14pt] L&=\left(\frac{1}{2\sqrt{2\pi}}\right)^n \exp \left \{\sum_{i=1}^n -z_i*\frac{(x-\mu_1)^2}{2}-(1-z_i)*\frac{(x-\mu_2)^2}{2} \right\} \tag{2}\\[14pt] LL&=n\log\Big(\frac{1}{2\sqrt{2\pi}}\Big)+ \sum_{i=1}^n \left(-z_i*\frac{(x-\mu_1)^2}{2}-(1-z_i)*\frac{(x-\mu_2)^2}{2}\right) \tag{3} \end{align} $ (2) Move the product as a sum into the exponent. We can see that this function is now concave in $\mu_1, \mu_2$, wherefore we can take the first derivative to find the maximum likelihood. ![[expectation-maximization.png|center|600]] ## Algorithm We have input data $X_1, \dots, X_n$ that comes from two distributions (e.g. Gaussians). 1. Initialize $\hat \mu_1, \hat \mu_2$ 2. Repeat until convergence: 1. Compute weights (E-step) 2. Update centers (M-step) **Maximization Step:** We differentiate the log likelihood w.r.t. $\mu_1$, to update the mean. $ \begin{aligned} \frac{\partial LL}{\partial \mu_1}&=\sum_{i=1}^n-z_i\frac{2(x_i-\mu_1)}{2}*-1 \\ 0&=\sum_{i=1}^n z_i x_i-z_i\hat \mu_1 \\ \sum_{i=1}^n z_i\hat\mu_1&=\sum_{i=1}^n z_i x_i\\[6pt] \hat \mu_1&=\frac{\sum_{i=1}^n z_i x_i}{\sum_{i=1}^n z_i} \end{aligned} $ Now that we have our [[Maximum Likelihood Estimation#Estimator|Maximum Likelihood Estimators]] $\hat \mu_1, \hat \mu_2$, we only need to plug-in the observed $x_i$ with their corresponding $z_i$ values of $\{0,1\}$. **Estimation Step:** However, since the $z_i$ are unknown to us, we have to estimate the $z_i$ conditional on the observed $x_i$. Thus, we are looking for the [[Conditional Expectation]] $\mathbb E[Z_i \vert X_i]$, which for a Bernoulli is the same as the [[Conditional Probability]] $\mathbf P(Z_i\vert X_i)$. $ z_i \approx \mathbb E [Z_i \vert X_i] \equiv \mathbf P(Z_i\vert X_i) $ To get the conditional probability, we use [[Bayes Rule]] together with the [[Total Probability Theorem]] (for the denominator). $ \begin{aligned} \mathbf P(Z_i=1 \vert X_i) &= \frac{f (X_i\vert Z_i=1)*\mathbf P(Z_i=1)}{f (X_i)} \\[14pt] &= \frac{f (X_i\vert Z_i=1)*\mathbf P(Z_i=1)}{\sum_z f (X_i\vert Z_i)*\mathbf P(Z_i)}\\[14pt] z_i&\approx \frac{e^{ -\frac{(x_i-\hat \mu_1)^2}{2}}} {e^{ -\frac{(x_i-\hat \mu_1)^2}{2}}+e^{ -\frac{(x_i-\hat \mu_2)^2}{2}}} \end{aligned} $ We can now use the conditional probability as a proxy for the $z_i$ values, and plug them into the maximization step. Now both the estimation and maximization step are performed after each other in a loop.