We want to estimate parameter $\Theta$, based on some prior belief and noisy observations $X$, where the observations are driven the $\Theta$ and some additive Gaussian noise $W$.
> [!note:]
> The following examples are special cases of [[Linear Normal Models]].
## Single Observation: Estimating Single Parameter
Let $X=\Theta+W$, where:
- Prior: $\Theta \sim \mathcal N(0,1)$
- Noise: $W \sim \mathcal N(0,1)$
- Independence: $\Theta \perp W$
**Prior:** Our belief about $\theta$ before seeing any data follows a standard [[Gaussian Distribution]] $\mathcal N(0,1)$.
$ f_\Theta(\theta) = c* \exp\left(\frac{-\theta^2}{2}\right), \quad \text{where}\; c=\frac{1}{\sigma \sqrt{2 \pi}}
$
**Likelihood:** When the [[Random Variable]] $\Theta$ is fixed at a specific $\theta$, then the distribution of $X$ is just $W$ plus that $\theta$ constant. Thus, the likelihood of $X$ is just a shifted Gaussian with mean $\theta$.
$
\begin{align}
\text{given } \Theta = \theta:
X&= \theta + W_i \\
X&= \theta + \mathcal{N}(0, 1)\\
X &\sim \mathcal{N}(\theta, 1)
\end{align}
$
This likelihood is expressed as..
$ \overbrace{f_{X \vert \Theta}(x \vert \theta)}^{\mathcal{N}(\theta,1)} =
c*\exp\left(-\frac{(x-\theta)^2}{2}\right), \quad \text{where}\; c=\frac{1}{\sigma \sqrt{2 \pi}} $
**Posterior:** Applying [[Bayes Rule]]..
$
\begin{align}
f_{\Theta \vert X}(\theta \vert x)&= \frac{f_\Theta(\theta)*f_{X\vert \Theta}(x \vert \theta)}{f_X(x)} \\[6pt]
&= \frac{1}{f_X(x)}*c* \exp\left(\frac{-\theta^2}{2}\right) * c*\exp\left(-\frac{(x- \theta)^2}{2}\right) \\[10pt]
&=c(x)*\exp\left({-\frac{\theta^2}{2}-\frac{(x- \theta)^2}{2}}\right)\\[6pt]
&=c(x)*\exp\left({-\Big(\frac{\theta^2+(x- \theta)^2}{2}}\Big)\right)\\[8pt]
&=c(x)*\exp\big({-\text{quadratic}(\theta)}\big)
\end{align}
$
Since the posterior is a negative exponential of a quadratic term, it is a [[Gaussian Distribution]]. The Gaussian is symmetric and unimodal, which means that the [[MAP Estimator]] $\hat \Theta_{\text{MAP}}$ is equal to the [[LMS Estimator]] $\hat \Theta_{LMS}$.
**Finding MAP Estimate:** To find the maximum of the posterior, we minimize the quadratic exponent, since:
- $c(x)$ only contains constants, and can be neglected.
- So the remaining $e^{-\text{quadratic}(\theta)}$ needs to be maximized.
- So the $\text{quadratic}(\theta)$ needs to be minimized.
- So we take the first derivative of the quadratic and set it to $0$.
$
\begin{align}
\frac{d}{d\theta} \mathrm{quadratic}(\theta)= \min_\theta \left[\frac{\theta^2+(x- \theta)^2}{2}\right] &\stackrel{!}{=}0 \\[6pt]
\theta+(x- \theta)*-1&=0 \\[6pt]
\theta+(\theta-x)&=0 \\[6pt]
\theta&=\frac{x}{2}
\end{align}
$
## Multiple Observations: Estimating Single Parameter
Let $X$ be a vector of observations $X_1, \dots, X_n$ where each $X_i=\Theta+W_i$. Also assume:
- Prior: $\Theta \sim \mathcal{N}(x_0, \sigma_0^2)$
- Noise: $W_i \sim \mathcal{N}(0,\sigma_i^2)$
- Independence between $\Theta, W_1, \dots, W_n$
**Prior:**
$ f_\Theta(\theta)=c*\exp\left(\frac{-(\theta- x_0)^2}{2\sigma_0^2}\right), \quad \text{where}\; c=\frac{1}{\sigma \sqrt{2 \pi}} $
**Likelihood:** Again, when $\Theta=\theta$ is fixed, $X_i$ is the Gaussian noise with an added constant.
$
\begin{align}
\text{given } \Theta = \theta:
X_i&= \theta + W_i \\
X_i&= \theta + \mathcal{N}(0, \sigma_i^2)\\
X_i &\sim \mathcal{N}(\theta, \sigma_i^2)
\end{align}
$
**Maintained independence:**
- Since all $W_i$ are independent of $\Theta$, this is also true for any specific $\Theta=\theta$.
$ \mathbf P(W_i \vert \theta)= \mathbf P(W_i)$
- Since the likelihood of each $X_i$ is just a shifted $W_i$, this implies also independence of likelihood terms.
**Joint Likelihood:** Since $X$ is a vector of observations, the likelihood is also a vector of likelihoods from each observation $x_i$. Due to independence this [[Joint Probability Density Function|Joint PDF]] can be expressed as a product of univariate likelihoods from each single observation.
$
\begin{align}
f_{X \vert \Theta}(x \vert \theta)
&= f_{X_1, \dots ,X_n \vert \Theta}(x_i, \dots,x_n \vert \theta) \\
&=\prod_{i=1}^n f_{x_i\vert \Theta}(x_i \vert \theta) \\
&=\prod_{i=1}^{n} \mathcal{N}(\theta, \sigma_i^2) \\
&=\prod_{i=1}^n c_i*\exp\left(\frac{-(x_i- \theta)^2}{2\sigma_i^2}\right)
\end{align}
$
**Posterior:**
$
\begin{align}
f_{\Theta\vert X}(\theta \vert x)
&= \frac{1}{f_X(x)}*c_0*\exp\Big(\frac{-(\theta- x_0)^2}{2\sigma_0^2}\Big) * \prod_{i=1}^n c_i*\exp\Big(\frac{-(x_i- \theta)^2}{2\sigma_i^2}\Big) \tag{1}\\[6pt]
&=c(x)*\exp\Big(\frac{-(\theta- x_0)^2}{2\sigma_0^2}\Big)*\prod_{i=1}^n \exp\Big(\frac{-(x_i- \theta)^2}{2\sigma_i^2}\Big) \tag{2}\\[6pt]
&=c(x)*\prod_{i=0}^n \exp\Big(\frac{-(\theta- x_i)^2}{2\sigma_i^2}\Big) \tag{3}
\end{align}
$
where:
- (2) Bundle all terms without $\theta$ into a single term $c(x)$.
- (3) The combination of the exponents make the product $\prod$ start at $i=0$.
**MAP estimate:** To obtain the MAP estimate $\hat \theta_{\text{MAP}}$, we need to maximize the posterior, which means to minimize the quadratic exponent (without the negative sign).
$
\begin{align}
\hat \Theta_{\text{MAP}} = \frac{d}{d\theta} \sum_{i=0}^n \frac{(\theta- x_i)^2}{2\sigma_i^2}&\stackrel{!}{=}0 \\[6pt]
\sum_{i=0}^n \frac{\theta - x_i}{\sigma_i^2}&=0 \\[6pt]
\sum_{i=0}^n \frac{\theta}{\sigma_i^2} - \frac{x_i}{\sigma_i^2}&=0 \\[6pt]
\theta \sum_{i=0}^n \frac{1}{\sigma_i^2}-\sum_{i=0}^n \frac{x_i}{\sigma_i^2}&=0 \\[6pt]
\theta \sum_{i=0}^n* \frac{1}{\sigma_i^2}&=\sum_{i=0}^n \frac{x_i}{\sigma_i^2} \\[6pt]
\theta &= \frac{\sum_{i=0}^n \frac{x_i}{\sigma_i^2}}{ \sum_{i=0}^n \frac{1}{\sigma_i^2}}
\end{align}
$
**Interpretation:** Basically $\hat \theta$ is a weighted average of $x_0$ (mean from prior) and all $x_1, \dots ,x_n$ (observations), where the weights are given by the variances of each $x_i$ term. This makes sense, since a very noisy observation (high $\sigma_i$) should get less weight in estimating the mean.
We see that the prior is treated just like one ordinary observation.
**Generalization:** We can generalize the estimate $\hat \theta$ by writing it out as a r.v. $\hat \Theta$. Specific observations $x_i$ turn into r.v's. $X_i$. Only the prior mean $x_0$ stays a constant as it is set by us.
$
\begin{align}
\hat \theta =
\frac{\displaystyle \sum_{i=0}^n \frac{x_i}{\sigma_i^2}}{\displaystyle \sum_{i=0}^n \frac{1}{\sigma_i^2}}, \quad
\hat \Theta =
\frac{\displaystyle \frac{x_0}{\sigma_0^2} + \sum_{i=1}^n \frac{X_i}{\sigma_i^2}}{ \displaystyle \sum_{i=0}^n \frac{1}{\sigma_i^2}}
\end{align}
$
The estimator $\hat \Theta$ changes linearly when we changed $X$.
## Multiple Observations: Estimating Multiple Parameters
The goal is to estimate the a parameter vector $\Theta$ from a vector of observations $X$. Each noisy observation is a linear combination of all $\Theta_j$ and the additive Gaussian noise.
$ X_i=\Theta_0+\Theta_1 t+\Theta_2 t^2+W_i$
where:
- Prior: $\Theta_j \sim \mathcal{N}(0, \sigma_j^2)$
- Noise: $W_i \sim \mathcal{N}(0,\sigma_i^2)$
- Independence between $\Theta_0, \Theta_1, \Theta_2$
- Independence between $\Theta, W_i$
The overall procedure is equivalent to the case of [[Gaussian Random Variable with Additive Noise#Multiple Observations Estimating Single Parameter|Multiple Observations Estimating Single Parameter]]. The posterior is written as follows:
$
f_{\Theta \vert X}(\theta \vert x)=
\underbrace{\frac{1}{f_X(x)}}_{\text{Normalization}} * \quad
\underbrace{\prod_{j=0}^2 f_{\Theta_j}(\theta_j)}_{\text{Prior}} * \quad \underbrace{\prod_{i=1}^n f_{X_i \vert \Theta}(x_i \vert \theta)}_{\text{Likelihood}}
$
To find the MAP estimate for $\Theta_0, \Theta_1, \Theta_2$ we need to take the derivate wrt. each $\theta_j$ separately. Thereby we will yield 3 linear equations for 3 unknows, which then can be solved!