In linear regression, we want to understand the [[Conditional Probability Density Function|Conditional PDF]] of $Y$ given $X$. Let each pair $(X_i, Y_i)$ be [[Independence and Identical Distribution|i.i.d.]] from some unknown joint distribution $\mathbf P_{X,Y}(x,y)$. We can describe $\mathbf P$ entirely by its [[Joint Probability Density Function|Joint PDF]] $h(x,y)$, or by the marginal density of $h(x)$ together with the conditional density $h(x \vert y)$. $ \underbrace{h(y \vert x)}_{\text{conditional}} =\frac{\overbrace{h(x,y)}^{\text{joint}}}{\underbrace{h(x)}_{\text{marginal}}}, \quad \text{where}\; h(x)= \int_y h(x,y) \, dy $ ## Regression Function **Conditional Expectation:** Focusing on the conditional PDF $h(x\vert y)$, its expectation $\mathbb E[Y \vert X]$ is a key concept, often called the regression function. It summarizes what $Y$ for every possible value of $X$. $ \mathbb E[Y \vert X=x]=\int_y y * h(y \vert x) \, dy $ **Challenge:** However, when $X$ is continuous, the exact computation of $\mathbb E[Y \vert X=x]$ is challenging because, there might be no support for specific values of $x$. We would basically need infinite data points distributed over all possible $x$. **Linear Function:** Thus, we need to impose some structure how the regression function looks like. The [[Linear Functions of Random Variables|linear function]] is the simplest kind of function where $Y$ is dependent on $X$. $ \mathbb E[Y \vert X] \approx \hat y = a + bx $ where: - $a$ is the intercept, representing the expectation of $Y$, when $X=0$. - $b$ is the slope, capturing how changes in $X$ affect the expectation of $Y$. ## Least Squares Estimator The least squares estimator (also called [[LLMS Estimator]]) is a method used to determine the best-fit linear relationship between two [[Random Variable|Random Variables]], $X$ and $Y$, under the assumption that they follow a joint probability distribution with finite [[Expectation]] and [[Variance]]. The goal of least squares estimation is to minimize the [[MSE]], i.e. the expectation of the squared differences between the estimator $\hat Y$ and the observed values $Y$. $ (a^\star, b^\star) =\arg \min_{(a,b)} \, \mathbb E[(Y-\underbrace{a-b*X}_{\hat Y})^2] $ Over all possible choices for $(a,b)$, we find such $a^\star,b^\star$ that minimize the expression. To find $(a^\star, b^\star)$ we compute each [[Partial Derivative]] an set it to $0$. **Derivative of Slope Coefficient:** $ \begin{align} \frac{\partial}{\partial b} \, \mathbb E[(Y-a-bX)^2] &= -2* \mathbb E[X(Y-a-bX)] \tag{1}\\[6pt] -2* \mathbb E[X(Y-a^\star-b^\star X)]&=0 \tag{2}\\[8pt] \underbrace{\mathbb E[XY]-\mathbb E[X]\mathbb E[Y]}_{\text{Cov}(X,Y)}+ \underbrace{b^\star \mathbb E[X]^2-b^\star \mathbb E[X^2]}_{-b^\star\text{Var}(X)} &=0 \tag{3} \end{align} $ Resulting in: $ b^\star = \frac{\mathrm{Cov}(X,Y)}{\mathrm {Var(X)}} $ Since $b^\star$ is the slope of the regression line, the more $\mathrm{Cov}(X,Y)$ is above $0$, the more the line is upward sloping, and vice versa. The variance term normalizes the [[Covariance]] term. **Derivative of Intercept:** $ \begin{align} \frac{\partial}{\partial a} \,\mathbb E[(Y-a-bX)^2]&= \mathbb E[-2(Y-a-bX)] \tag{4}\\[6pt] &=-2*\mathbb E[Y-a^\star-b^\star X]=0 \tag{5}\\[8pt] &=\mathbb E[Y]-b^\star \mathbb E[X]=a^\star \tag{6} \end{align} $ Substituting the slope coefficient $b$: $ a^\star = \mathbb E[Y]-\frac{\mathrm {Cov}(X,Y)}{\mathrm{Var}(X)}* \mathbb E[X]$ >[!note:] >The same line of derivation works for an empirical set-up, where we only observe sample data. We just replace expectations with an average $\frac{1}{n}\sum$.