**Real-World Phenomena:**
- There is a relationship between $X$ and $y$, which can be described by the unknown function $f$.
- In supervised learning we want to derive a function $f_{\text{est}}$ that approximates $f$.
$ f_{\text{est}} \sim f $
**Sample Data:**
- To do that, we observe from the [[Random Variable|Random Variables]] $X$ and $y$ some sample data $X_{\text{data}}$ and $y_{\text{data}}$.
- Their relationship is described by $f_{\text{data}}$.
- Our goal is to find an $f_{\text{est}}$ that approximates $f_{\text{data}}$ well.
$ \quad f_{\text{est}}(\theta, \theta_0) \sim f_{\text{data}}(X,y) $
**Minimization:**
The function $f_{\text{est}}$ has parameters $\theta, \theta_0$. We need to find the parameter values, which minimize the loss function $L$, since $L$ describes the difference between $f_{\text{est}}$ and $f_{\text{data}}$.
$ \begin{aligned} \min_{\theta, \theta_0} J&=L(f_{\text{est}}, f_{\text{data}}) \\ &=L(\theta, \theta_0, X{_\text{data}},y{_\text{data}}) \end{aligned} $
**Regularization:**
However, we have to keep in mind that $f_{\text{data}} \not = f$. Therefore we need regularization $R$ to make sure that the parameters from $f_{\text{est}}(\theta, \theta_0)$ do not get too attached to the specific sampled observations.
$ \min_{\theta, \theta_0} J =L(\theta, \theta_0, X{_\text{data}},y{_\text{data}}) + \alpha R(\theta) $
The impact of regularization $R$ is steered by a weight $\alpha$. It is called a hyperparameter, since it is not being minimized itself.