Bayesian Linear Regression - Bernhard Pfann, CFA

*Frequentist perspective:* Data is generated from the model with some error of uncertainty. $ y = \alpha+\beta*x+ \epsilon $ *Bayesian perspective:* Data is sampled from a probability distribution ([[Gaussian Distribution|Gaussian]]). Its mean is given by the same parameters. $ y = \mathcal N(\alpha+\beta*x, \sigma^2) $ ## First Iteration **Priors:** Our initial believes about the probability distributions for the parameters $\alpha, \beta$. $ \begin{align} \mathbf P(\alpha)&=\mathcal N(0,1) \\ \mathbf P(\beta)&=\mathcal N(0,1) \\ \mathbf P(\sigma)&=\text{Half Normal}(0,10) \end{align} $ We also set starting values (e.g. $\alpha=0, \beta=1.1, \sigma=5$) to kick-off the algorithm. For these specific values, we can compute the density of the priors.. $ \mathbf P(\alpha=0)=\overbrace{ \Big(\frac{1}{\sqrt{2\pi}} *e^{-0.5\alpha^2}\Big) }^{\text{Gaussian density}}=0.39894 $ **Likelihood:** It is the probability of observe a certain $y$ value, has been sampled from $\mathcal N(\alpha+ \beta*x, \sigma^2)$. Assume we got the following data point $\{x_i=50, y_i=60\}$. The likelihood is then.. $ \begin{align} \mathbf P(y \vert X, \alpha, \beta, \sigma) &= \mathbf P\Big(y_i=60 \, \vert \,\mathcal N(\alpha + \beta_x, \sigma)\Big) \\ &= \mathbf P\big(y_i=60 \, \vert \, \mathcal N(0+50_1.1, 5) \Big) \\ &= \mathbf P\big(y_i=60 \, \vert \, \mathcal N(55, 5) \Big) \\ &=0.04839 \end{align} $ Now we need to compute the joint [[Likelihood Functions|likelihood]] of all $X_i$ observations. Since we assume that they are i.i.d., the joint likelihood is the product of the single likelihoods. We can also log transform the likelihood to make it computationally easier. $ L: \mathbb R^5 \mapsto \mathbb R \\[8pt] L(y; X, \alpha, \beta, \sigma)= \prod_{i=1}^n \mathbf P(y_i \, \vert \, x_i, \alpha, \beta, \sigma) $ **Posterior:** We apply [[Bayes Rule]] and neglect the normalization term $\mathbf P(y \vert X)$ since it does not include any of the parameters, that we want to optimize for. Therefore we say, that we calculate the unnormalized posterior. $ \mathbf P(\alpha, \beta, \sigma \vert y, X) = \mathbf P(\alpha)* \mathbf P (\beta)* \mathbf P(\sigma)* L(y; X, \alpha, \beta, \sigma) $ In frequentist [[Univariate Linear Regression]], we are only interested in the estimates of the parameters that maximize the probability (likelihood). However, here in Bayesian linear regression we view the parameter only interested in th that maximize the posterior, we also want the full their full distribution. ## Metropolis-Hastings Algorithm So far we have finished our first iteration, by calculating the posterior, based on the given data and the set starting values for the parameters. The Metropolis-Hastings (”MH”) algorithm defines two main steps during iterations: 1. Acceptance rule 2. Proposal of new parameter values to compute posteriors on **Acceptance rule:** Now we either accept or reject the parameter values from the posterior. Let us denote $p_i, p_{i+1}$ the current and the new posterior. $ \text{acceptance} =\begin{cases} p_{i+1} > p_i &\text{w.p.} &1 \\ p_{i+1} \le p_i &\text{w.p.} & \frac{p_{i+1}}{p_i} \end{cases} $ This implies that a worse $p_{i+1}$ is very rare to be accepted, when it is significantly lower than $p_i$. Vice versa, when $p_{i+1}$ is just incrementally smaller than $p_i$ its acceptance probability is almost $0.5.$ This decision process ensures that the chain of accepted parameter values is representative of the true posterior distribution. >[!note:] >Since we do not have a posterior to compare during the first iteration, the first posterior is always accepted. **Proposal Distribution:** In the first iteration we where relying on starting values for the priors $(\alpha=0, \beta=1.1, \sigma=5)$. In the consecutive iterations, the MH algorithm proposes values for each parameter. It does this by sampling from a user defined distribution function (usually Gaussian), where the location parameter is the latest accepted parameter in the chain. Now we repeat the process: 1. Sampling from proposal distributions 2. Calculate posterior values 3. Assess acceptance rule After e.g. 1000 iterations, we can collect all accepted parameter values, and build a histogram for each parameter, to see its distribution. For prediction we simply sample e.g. 1000 times from each parameter distribution, to obtain a distribution of the predicted value $y_i$.