**Real-World Phenomena:** - There is a relationship between $X$ and $y$, which can be described by the unknown function $f$. - In supervised learning we want to derive a function $f_{\text{est}}$ that approximates $f$. $ f_{\text{est}} \sim f $ **Sample Data:** - To do that, we observe from the [[Random Variable|Random Variables]] $X$ and $y$ some sample data $X_{\text{data}}$ and $y_{\text{data}}$. - Their relationship is described by $f_{\text{data}}$. - Our goal is to find an $f_{\text{est}}$ that approximates $f_{\text{data}}$ well. $ \quad f_{\text{est}}(\theta, \theta_0) \sim f_{\text{data}}(X,y) $ **Minimization:** The function $f_{\text{est}}$ has parameters $\theta, \theta_0$. We need to find the parameter values, which minimize the loss function $L$, since $L$ describes the difference between $f_{\text{est}}$ and $f_{\text{data}}$. $ \begin{aligned} \min_{\theta, \theta_0} J&=L(f_{\text{est}}, f_{\text{data}}) \\ &=L(\theta, \theta_0, X{_\text{data}},y{_\text{data}}) \end{aligned} $ **Regularization:** However, we have to keep in mind that $f_{\text{data}} \not = f$. Therefore we need regularization $R$ to make sure that the parameters from $f_{\text{est}}(\theta, \theta_0)$ do not get too attached to the specific sampled observations. $ \min_{\theta, \theta_0} J =L(\theta, \theta_0, X{_\text{data}},y{_\text{data}}) + \alpha R(\theta) $ The impact of regularization $R$ is steered by a weight $\alpha$. It is called a hyperparameter, since it is not being minimized itself.