In linear regression, we want to understand the [[Conditional Probability Density Function|Conditional PDF]] of $Y$ given $X$. Let each pair $(X_i, Y_i)$ be [[Independence and Identical Distribution|i.i.d.]] from some unknown joint distribution $\mathbf P_{X,Y}(x,y)$.
We can describe $\mathbf P$ entirely by its [[Joint Probability Density Function|Joint PDF]] $h(x,y)$, or by the marginal density of $h(x)$ together with the conditional density $h(x \vert y)$.
$
\underbrace{h(y \vert x)}_{\text{conditional}} =\frac{\overbrace{h(x,y)}^{\text{joint}}}{\underbrace{h(x)}_{\text{marginal}}}, \quad \text{where}\; h(x)= \int_y h(x,y) \, dy
$
## Regression Function
**Conditional Expectation:**
Focusing on the conditional PDF $h(x\vert y)$, its expectation $\mathbb E[Y \vert X]$ is a key concept, often called the regression function. It summarizes what $Y$ for every possible value of $X$.
$ \mathbb E[Y \vert X=x]=\int_y y * h(y \vert x) \, dy $
**Challenge:**
However, when $X$ is continuous, the exact computation of $\mathbb E[Y \vert X=x]$ is challenging because, there might be no support for specific values of $x$. We would basically need infinite data points distributed over all possible $x$.
**Linear Function:**
Thus, we need to impose some structure how the regression function looks like. The [[Linear Functions of Random Variables|linear function]] is the simplest kind of function where $Y$ is dependent on $X$.
$ \mathbb E[Y \vert X] \approx \hat y = a + bx $
where:
- $a$ is the intercept, representing the expectation of $Y$, when $X=0$.
- $b$ is the slope, capturing how changes in $X$ affect the expectation of $Y$.
## Least Squares Estimator
The least squares estimator (also called [[LLMS Estimator]]) is a method used to determine the best-fit linear relationship between two [[Random Variable|Random Variables]], $X$ and $Y$, under the assumption that they follow a joint probability distribution with finite [[Expectation]] and [[Variance]].
The goal of least squares estimation is to minimize the [[MSE]], i.e. the expectation of the squared differences between the estimator $\hat Y$ and the observed values $Y$.
$
(a^\star, b^\star) =\arg \min_{(a,b)} \, \mathbb E[(Y-\underbrace{a-b*X}_{\hat Y})^2]
$
Over all possible choices for $(a,b)$, we find such $a^\star,b^\star$ that minimize the expression. To find $(a^\star, b^\star)$ we compute each [[Partial Derivative]] an set it to $0$.
**Derivative of Slope Coefficient:**
$
\begin{align}
\frac{\partial}{\partial b} \, \mathbb E[(Y-a-bX)^2] &= -2* \mathbb E[X(Y-a-bX)] \tag{1}\\[6pt]
-2* \mathbb E[X(Y-a^\star-b^\star X)]&=0 \tag{2}\\[8pt]
\underbrace{\mathbb E[XY]-\mathbb E[X]\mathbb E[Y]}_{\text{Cov}(X,Y)}+ \underbrace{b^\star \mathbb E[X]^2-b^\star \mathbb E[X^2]}_{-b^\star\text{Var}(X)} &=0 \tag{3}
\end{align}
$
Resulting in:
$ b^\star = \frac{\mathrm{Cov}(X,Y)}{\mathrm {Var(X)}} $
Since $b^\star$ is the slope of the regression line, the more $\mathrm{Cov}(X,Y)$ is above $0$, the more the line is upward sloping, and vice versa. The variance term normalizes the [[Covariance]] term.
**Derivative of Intercept:**
$
\begin{align}
\frac{\partial}{\partial a} \,\mathbb E[(Y-a-bX)^2]&= \mathbb E[-2(Y-a-bX)] \tag{4}\\[6pt]
&=-2*\mathbb E[Y-a^\star-b^\star X]=0 \tag{5}\\[8pt]
&=\mathbb E[Y]-b^\star \mathbb E[X]=a^\star \tag{6}
\end{align}
$
Substituting the slope coefficient $b$:
$ a^\star = \mathbb E[Y]-\frac{\mathrm {Cov}(X,Y)}{\mathrm{Var}(X)}* \mathbb E[X]$
>[!note:]
>The same line of derivation works for an empirical set-up, where we only observe sample data. We just replace expectations with an average $\frac{1}{n}\sum$.