Bayesian Statistics - Bernhard Pfann, CFA

**Frequentist approach:** - Define possible [[Sample Space]] of $\Theta$, e.g. $\in \mathbb R$. - Choose probability distribution for likelihood $\mathbf P_\theta$. - Choose estimation method ([[Maximum Likelihood Estimation|MLE]], [[Method of Moments]], [[M-Estimation]]). **Bayesian approach:** - Define prior distribution of $\Theta$, e.g. $\sim \mathcal N$. - Choose probability distribution for likelihood $\mathbf P_\theta$. - Choose estimation method ([[MAP Estimator]], [[LMS Estimator]]) >[!note:] >As frequentists, we believe that there is one true value for $\theta^\star$. However, Bayesians believe that $\theta$ is a [[Random Variable]] with random realizations from that distribution. ## Bayesian Update **Components for Bayesian Update:** - *Prior:* The unknown parameter $\theta$ comes from a prior distribution $\pi$. $ \theta \sim \pi $ - *Likelihood:* The [[Likelihood Functions|likelihood]] of seeing the data, given some fixed $\theta$. $ L_n(X_1,\dots,X_n \vert \, \theta) $ - *Posterior:* The probability distribution of the r.v. $\theta$ given the observed data. $ \pi(\theta \vert X_1, \dots,X_n) $ - *Marginal likelihood:* The unconditional likelihood of seeing our observed data. We can use the [[Total Probability Theorem]] to express it as the integral of conditional likelihoods over all possible $\theta$ values. $ L_n(X_1, \dots,X_n) = \int_\Theta L_n(X_1, \dots, X_n\vert \theta) \, d\theta $ **Final Expression:** We apply Bayes formula to get from the prior to the posterior. The denominator acts as normalization term that ensures that the posterior integrates to $1$. However, as the denominator will be the same for any fixed $\theta$, the values without normalization will be comparable relative to each other, which is sufficient for estimation purposes. $ \pi(\theta \vert X_1, \dots,X_n)=\frac{L_n(X_1, \dots, X_n\vert \theta)*\pi(\theta)}{\displaystyle \int_\Theta L_n(X_1, \dots, X_n\vert \theta)* \pi(\theta)* \, d\theta} $ >[!note:] >When we ignore the denominator we have to write the “proportional sign” $\propto$ instead of the equal sign. ## Non-Informed Prior - For bounded $\Theta$ it is a [[Continuous Uniform Distribution]] over the sample space of $\Theta$. - For unbounded $\Theta$ where is no valid [[Probability Density Function|PDF]] to form a uniform prior. However we can build an improper prior by just say that it has density of $1$ everywhere. This means that the integral of such a prior $\pi(\theta)=\infty$. $ \int_{-\infty}^\infty \pi(\theta) \, d\theta= \infty $ ## Jeffreys Prior We want a non-informative prior that is invariant to the reparameterization of $\theta$. E.g. we assume that our data comes from a [[Gaussian Distribution|Gaussian]] $\mathcal N(\theta, 1)$, but now we decide that we want to estimate $\mathcal N(\theta^2,1)$ instead. To achieve this attribute of invariance, the prior is based on the Fisher information $I(\theta)$, which inherently reflects the parameterization of $\theta$. $ \pi(\theta) \propto \begin{cases} \sqrt{I(\theta)}, & \theta \in \mathbb R\\[4pt] \sqrt{\text{det } I (\theta)}, & \theta \in \mathbb R^d \end{cases} $ - *Univariate theta:* When $\theta$ is a scalar, then $I (\theta)$ is just a scalar, and therefore $\text{det } I(\theta)=I(\theta)$. - *Multivariate theta:* When $\theta$ is a $d$-dimensional vector, then $I(\theta)$ is a $(d \times d)$ matrix. The [[Matrix Determinant]] collapses all information from the matrix into a single number. ## Estimation Once we have a posterior distribution, we can make an estimate of $\theta$ in two main ways: - *Posterior expectation (LMS):* $ \hat \theta^{(\pi)}=\mathbb E[\underbrace{\pi(\theta \vert X_1, \dots , X_n)}_{\text{posterior}}]= \int_\Theta \theta \, \pi(\theta \vert X_1, \dots, X_n) \, d\theta $ - *Posterior maximum (MAP):* $ \hat\theta^{(\pi)}= \arg \max_{\theta \in \Theta}\, \pi(\theta \vert X_1, \dots, X_n) $ Since the Bayesian estimator depends on the prior $\pi$, it is mentioned in the superscript. >[!note:] >With more observations, the likelihood function will get more “peaked” around the true parameter (due to [[Properties of an Estimator#Key Properties of an Estimator|consistency of the estimator]] as $n\to \infty$). > >At the same time the prior remains unchanged with more observations. This effect makes the likelihood term more influential (compared to the prior) when we multiply it with the prior to obtain the posterior.