**Motivation:** For functions, which are not [[Identify Convex and Concave Functions|convex or concave]] we cannot rely on the first derivative to find a minimum or maximum. Therefore we need algorithms that find the min/max in an iterative way. Some of the algorithms are guaranteed to converge, while others are not.
The EM algorithm is used to maximize the log-likelihood for the mixture of multiple Gaussians and is only guaranteed to converge locally.
## Derivation
Assume a [[Gaussian Mixtures|Gaussian Mixture]] where we know $\sigma_1^2=\sigma_2^2=1$ and the mixing parameter $\tau=1/2$. For us the unknowns are $\mu_1, \mu_2$. Since this distribution might have multiple min/max (unless $\mu_1=\mu_2$), taking the first derivative of this [[Likelihood Functions|Likelihood Function]] $L$ will not help.
$
\begin{aligned}
f(x) &= \tau*\frac{1}{\sqrt2\pi}\left(e^{-\frac{(x-\mu_1)^2}{2}}\right)+
(1-\tau)*\frac{1}{\sqrt2\pi}\left(e^{-\frac{(x-\mu_2)^2}{2}} \right) \\[10pt]
f(x) &= \frac{1}{2}*\frac{1}{2\sqrt2\pi}\left(e^{-\frac{(x-\mu_1)^2}{2}}+e^{-\frac{(x-\mu_2)^2}{2}} \right)\\[10pt]
L(x) &= \prod_{i=1}^n \frac{1}{2\sqrt2\pi}\left(e^{-\frac{(x-\mu_1)^2}{2}}+e^{-\frac{(x-\mu_2)^2}{2}} \right)
\end{aligned}
$
However, we can express the density also by its sampling description, where the mixing parameter is [[Bernoulli Distribution|Bernoulli]] $Z \sim\text{Ber}(p=\frac{1}{2})$. This way we can also bring $z$ to the exponent and multiply both $X$ terms with each other (i.e. when $z=0$ then $X^{(1)^z}=1$ and vice versa).
$ Z= \begin{cases} 0 & \text{w.p.} &(1-p)\\ 1 & \text{w.p.} &p \end{cases} $
$ \begin{align} X&=ZX^{(1)}+(1-Z)X^{(2)} \\ &=X^{(1)^z}*X^{(2)^{1-z}} \end{align} $
**Joint density:** We compute the [[Joint Probability Density Function|Joint PDF]] by multiplying the marginal of $z$ with the conditional density of $x$ given $z$.
$
\begin{align}
f(x,z)&= p(z)*f(x\vert z) \tag{1}\\[10pt]
&=p(z=0)*f(x\vert z=0)+p(z=1)*f(x\vert z=1) \tag{2}\\[8pt]
&=\frac{1}{2}*\left(z*\frac{1}{\sqrt{2\pi}}*e^{-\frac{(x-\mu_1)^2}{2}}+(1-z)*\frac{1}{\sqrt{2\pi}}*e^{-\frac{(x-\mu_2)^2}{2}} \right) \tag{3}\\[10pt]
&=\frac{1}{2\sqrt{2\pi}}*\left(z_e^{-\frac{(x-\mu_1)^2}{2}}+(1-z)*e^{-\frac{(x-\mu_2)^2}{2}} \right) \tag{4}\\[10pt]
&=\frac{1}{2\sqrt{2\pi}}*\left( e^{-z\frac{(x-\mu_1)^2}{2}} *e^{-(1-z)\frac{(x-\mu_2)^2}{2}} \right) \tag{5}\\[10pt]
&=\frac{1}{2\sqrt{2\pi}}* \exp \left\{ -z*\frac{(x-\mu_1)^2}{2}-(1-z)*\frac{(x-\mu_2)^2}{2} \right\} \tag{6}
\end{align}
$
where:
- (5) We can use the fact from (2) and bring the Bernoulli variable $z$ to the exponent and switch the addition to a multiplication.
- (6) When two exponential terms that are being multiplied, have the same base, their exponents can be summed under the same base. See [[Algebra of Exponents]].
**Log-likelihood:** We build the complete log likelihood on the joint density, acting as if we would know the latent $z_i$ of each observation.
$ L\big((x_1, z_1),\dots(x_n,z_n); \mu_1, \mu_2\big)=\prod_{i=1}^nf(x,z) $
$
\begin{align}
L&=\left(\frac{1}{2\sqrt{2\pi}}\right)^n \prod_{i=1}^n\exp \left\{ -z_i*\frac{(x-\mu_1)^2}{2}-(1-z_i)*\frac{(x-\mu_2)^2}{2} \right\} \tag{1}\\[14pt]
L&=\left(\frac{1}{2\sqrt{2\pi}}\right)^n \exp \left \{\sum_{i=1}^n -z_i*\frac{(x-\mu_1)^2}{2}-(1-z_i)*\frac{(x-\mu_2)^2}{2} \right\} \tag{2}\\[14pt]
LL&=n\log\Big(\frac{1}{2\sqrt{2\pi}}\Big)+ \sum_{i=1}^n \left(-z_i*\frac{(x-\mu_1)^2}{2}-(1-z_i)*\frac{(x-\mu_2)^2}{2}\right) \tag{3}
\end{align}
$
(2) Move the product as a sum into the exponent.
We can see that this function is now concave in $\mu_1, \mu_2$, wherefore we can take the first derivative to find the maximum likelihood.
![[expectation-maximization.png|center|600]]
## Algorithm
We have input data $X_1, \dots, X_n$ that comes from two distributions (e.g. Gaussians).
1. Initialize $\hat \mu_1, \hat \mu_2$
2. Repeat until convergence:
1. Compute weights (E-step)
2. Update centers (M-step)
**Maximization Step:**
We differentiate the log likelihood w.r.t. $\mu_1$, to update the mean.
$
\begin{aligned}
\frac{\partial LL}{\partial \mu_1}&=\sum_{i=1}^n-z_i\frac{2(x_i-\mu_1)}{2}*-1 \\
0&=\sum_{i=1}^n z_i x_i-z_i\hat \mu_1 \\
\sum_{i=1}^n z_i\hat\mu_1&=\sum_{i=1}^n z_i x_i\\[6pt]
\hat \mu_1&=\frac{\sum_{i=1}^n z_i x_i}{\sum_{i=1}^n z_i}
\end{aligned}
$
Now that we have our [[Maximum Likelihood Estimation#Estimator|Maximum Likelihood Estimators]] $\hat \mu_1, \hat \mu_2$, we only need to plug-in the observed $x_i$ with their corresponding $z_i$ values of $\{0,1\}$.
**Estimation Step:**
However, since the $z_i$ are unknown to us, we have to estimate the $z_i$ conditional on the observed $x_i$. Thus, we are looking for the [[Conditional Expectation]] $\mathbb E[Z_i \vert X_i]$, which for a Bernoulli is the same as the [[Conditional Probability]] $\mathbf P(Z_i\vert X_i)$.
$ z_i \approx \mathbb E [Z_i \vert X_i] \equiv \mathbf P(Z_i\vert X_i) $
To get the conditional probability, we use [[Bayes Rule]] together with the [[Total Probability Theorem]] (for the denominator).
$
\begin{aligned}
\mathbf P(Z_i=1 \vert X_i) &= \frac{f (X_i\vert Z_i=1)*\mathbf P(Z_i=1)}{f (X_i)} \\[14pt]
&= \frac{f (X_i\vert Z_i=1)*\mathbf P(Z_i=1)}{\sum_z f (X_i\vert Z_i)*\mathbf P(Z_i)}\\[14pt]
z_i&\approx \frac{e^{ -\frac{(x_i-\hat \mu_1)^2}{2}}} {e^{ -\frac{(x_i-\hat \mu_1)^2}{2}}+e^{ -\frac{(x_i-\hat \mu_2)^2}{2}}}
\end{aligned}
$
We can now use the conditional probability as a proxy for the $z_i$ values, and plug them into the maximization step. Now both the estimation and maximization step are performed after each other in a loop.