We want to estimate parameter $\Theta$, based on some prior belief and noisy observations $X$, where the observations are driven the $\Theta$ and some additive Gaussian noise $W$. > [!note:] > The following examples are special cases of [[Linear Normal Models]]. ## Single Observation: Estimating Single Parameter Let $X=\Theta+W$, where: - Prior: $\Theta \sim \mathcal N(0,1)$ - Noise: $W \sim \mathcal N(0,1)$ - Independence: $\Theta \perp W$ **Prior:** Our belief about $\theta$ before seeing any data follows a standard [[Gaussian Distribution]] $\mathcal N(0,1)$. $ f_\Theta(\theta) = c* \exp\left(\frac{-\theta^2}{2}\right), \quad \text{where}\; c=\frac{1}{\sigma \sqrt{2 \pi}} $ **Likelihood:** When the [[Random Variable]] $\Theta$ is fixed at a specific $\theta$, then the distribution of $X$ is just $W$ plus that $\theta$ constant. Thus, the likelihood of $X$ is just a shifted Gaussian with mean $\theta$. $ \begin{align} \text{given } \Theta = \theta: X&= \theta + W_i \\ X&= \theta + \mathcal{N}(0, 1)\\ X &\sim \mathcal{N}(\theta, 1) \end{align} $ This likelihood is expressed as.. $ \overbrace{f_{X \vert \Theta}(x \vert \theta)}^{\mathcal{N}(\theta,1)} = c*\exp\left(-\frac{(x-\theta)^2}{2}\right), \quad \text{where}\; c=\frac{1}{\sigma \sqrt{2 \pi}} $ **Posterior:** Applying [[Bayes Rule]].. $ \begin{align} f_{\Theta \vert X}(\theta \vert x)&= \frac{f_\Theta(\theta)*f_{X\vert \Theta}(x \vert \theta)}{f_X(x)} \\[6pt] &= \frac{1}{f_X(x)}*c* \exp\left(\frac{-\theta^2}{2}\right) * c*\exp\left(-\frac{(x- \theta)^2}{2}\right) \\[10pt] &=c(x)*\exp\left({-\frac{\theta^2}{2}-\frac{(x- \theta)^2}{2}}\right)\\[6pt] &=c(x)*\exp\left({-\Big(\frac{\theta^2+(x- \theta)^2}{2}}\Big)\right)\\[8pt] &=c(x)*\exp\big({-\text{quadratic}(\theta)}\big) \end{align} $ Since the posterior is a negative exponential of a quadratic term, it is a [[Gaussian Distribution]]. The Gaussian is symmetric and unimodal, which means that the [[MAP Estimator]] $\hat \Theta_{\text{MAP}}$ is equal to the [[LMS Estimator]] $\hat \Theta_{LMS}$. **Finding MAP Estimate:** To find the maximum of the posterior, we minimize the quadratic exponent, since: - $c(x)$ only contains constants, and can be neglected. - So the remaining $e^{-\text{quadratic}(\theta)}$ needs to be maximized. - So the $\text{quadratic}(\theta)$ needs to be minimized. - So we take the first derivative of the quadratic and set it to $0$. $ \begin{align} \frac{d}{d\theta} \mathrm{quadratic}(\theta)= \min_\theta \left[\frac{\theta^2+(x- \theta)^2}{2}\right] &\stackrel{!}{=}0 \\[6pt] \theta+(x- \theta)*-1&=0 \\[6pt] \theta+(\theta-x)&=0 \\[6pt] \theta&=\frac{x}{2} \end{align} $ ## Multiple Observations: Estimating Single Parameter Let $X$ be a vector of observations $X_1, \dots, X_n$ where each $X_i=\Theta+W_i$. Also assume: - Prior: $\Theta \sim \mathcal{N}(x_0, \sigma_0^2)$ - Noise: $W_i \sim \mathcal{N}(0,\sigma_i^2)$ - Independence between $\Theta, W_1, \dots, W_n$ **Prior:** $ f_\Theta(\theta)=c*\exp\left(\frac{-(\theta- x_0)^2}{2\sigma_0^2}\right), \quad \text{where}\; c=\frac{1}{\sigma \sqrt{2 \pi}} $ **Likelihood:** Again, when $\Theta=\theta$ is fixed, $X_i$ is the Gaussian noise with an added constant. $ \begin{align} \text{given } \Theta = \theta: X_i&= \theta + W_i \\ X_i&= \theta + \mathcal{N}(0, \sigma_i^2)\\ X_i &\sim \mathcal{N}(\theta, \sigma_i^2) \end{align} $ **Maintained independence:** - Since all $W_i$ are independent of $\Theta$, this is also true for any specific $\Theta=\theta$. $ \mathbf P(W_i \vert \theta)= \mathbf P(W_i)$ - Since the likelihood of each $X_i$ is just a shifted $W_i$, this implies also independence of likelihood terms. **Joint Likelihood:** Since $X$ is a vector of observations, the likelihood is also a vector of likelihoods from each observation $x_i$. Due to independence this [[Joint Probability Density Function|Joint PDF]] can be expressed as a product of univariate likelihoods from each single observation. $ \begin{align} f_{X \vert \Theta}(x \vert \theta) &= f_{X_1, \dots ,X_n \vert \Theta}(x_i, \dots,x_n \vert \theta) \\ &=\prod_{i=1}^n f_{x_i\vert \Theta}(x_i \vert \theta) \\ &=\prod_{i=1}^{n} \mathcal{N}(\theta, \sigma_i^2) \\ &=\prod_{i=1}^n c_i*\exp\left(\frac{-(x_i- \theta)^2}{2\sigma_i^2}\right) \end{align} $ **Posterior:** $ \begin{align} f_{\Theta\vert X}(\theta \vert x) &= \frac{1}{f_X(x)}*c_0*\exp\Big(\frac{-(\theta- x_0)^2}{2\sigma_0^2}\Big) * \prod_{i=1}^n c_i*\exp\Big(\frac{-(x_i- \theta)^2}{2\sigma_i^2}\Big) \tag{1}\\[6pt] &=c(x)*\exp\Big(\frac{-(\theta- x_0)^2}{2\sigma_0^2}\Big)*\prod_{i=1}^n \exp\Big(\frac{-(x_i- \theta)^2}{2\sigma_i^2}\Big) \tag{2}\\[6pt] &=c(x)*\prod_{i=0}^n \exp\Big(\frac{-(\theta- x_i)^2}{2\sigma_i^2}\Big) \tag{3} \end{align} $ where: - (2) Bundle all terms without $\theta$ into a single term $c(x)$. - (3) The combination of the exponents make the product $\prod$ start at $i=0$. **MAP estimate:** To obtain the MAP estimate $\hat \theta_{\text{MAP}}$, we need to maximize the posterior, which means to minimize the quadratic exponent (without the negative sign). $ \begin{align} \hat \Theta_{\text{MAP}} = \frac{d}{d\theta} \sum_{i=0}^n \frac{(\theta- x_i)^2}{2\sigma_i^2}&\stackrel{!}{=}0 \\[6pt] \sum_{i=0}^n \frac{\theta - x_i}{\sigma_i^2}&=0 \\[6pt] \sum_{i=0}^n \frac{\theta}{\sigma_i^2} - \frac{x_i}{\sigma_i^2}&=0 \\[6pt] \theta \sum_{i=0}^n \frac{1}{\sigma_i^2}-\sum_{i=0}^n \frac{x_i}{\sigma_i^2}&=0 \\[6pt] \theta \sum_{i=0}^n* \frac{1}{\sigma_i^2}&=\sum_{i=0}^n \frac{x_i}{\sigma_i^2} \\[6pt] \theta &= \frac{\sum_{i=0}^n \frac{x_i}{\sigma_i^2}}{ \sum_{i=0}^n \frac{1}{\sigma_i^2}} \end{align} $ **Interpretation:** Basically $\hat \theta$ is a weighted average of $x_0$ (mean from prior) and all $x_1, \dots ,x_n$ (observations), where the weights are given by the variances of each $x_i$ term. This makes sense, since a very noisy observation (high $\sigma_i$) should get less weight in estimating the mean. We see that the prior is treated just like one ordinary observation. **Generalization:** We can generalize the estimate $\hat \theta$ by writing it out as a r.v. $\hat \Theta$. Specific observations $x_i$ turn into r.v's. $X_i$. Only the prior mean $x_0$ stays a constant as it is set by us. $ \begin{align} \hat \theta = \frac{\displaystyle \sum_{i=0}^n \frac{x_i}{\sigma_i^2}}{\displaystyle \sum_{i=0}^n \frac{1}{\sigma_i^2}}, \quad \hat \Theta = \frac{\displaystyle \frac{x_0}{\sigma_0^2} + \sum_{i=1}^n \frac{X_i}{\sigma_i^2}}{ \displaystyle \sum_{i=0}^n \frac{1}{\sigma_i^2}} \end{align} $ The estimator $\hat \Theta$ changes linearly when we changed $X$. ## Multiple Observations: Estimating Multiple Parameters The goal is to estimate the a parameter vector $\Theta$ from a vector of observations $X$. Each noisy observation is a linear combination of all $\Theta_j$ and the additive Gaussian noise. $ X_i=\Theta_0+\Theta_1 t+\Theta_2 t^2+W_i$ where: - Prior: $\Theta_j \sim \mathcal{N}(0, \sigma_j^2)$ - Noise: $W_i \sim \mathcal{N}(0,\sigma_i^2)$ - Independence between $\Theta_0, \Theta_1, \Theta_2$ - Independence between $\Theta, W_i$ The overall procedure is equivalent to the case of [[Gaussian Random Variable with Additive Noise#Multiple Observations Estimating Single Parameter|Multiple Observations Estimating Single Parameter]]. The posterior is written as follows: $ f_{\Theta \vert X}(\theta \vert x)= \underbrace{\frac{1}{f_X(x)}}_{\text{Normalization}} * \quad \underbrace{\prod_{j=0}^2 f_{\Theta_j}(\theta_j)}_{\text{Prior}} * \quad \underbrace{\prod_{i=1}^n f_{X_i \vert \Theta}(x_i \vert \theta)}_{\text{Likelihood}} $ To find the MAP estimate for $\Theta_0, \Theta_1, \Theta_2$ we need to take the derivate wrt. each $\theta_j$ separately. Thereby we will yield 3 linear equations for 3 unknows, which then can be solved!