In [[Univariate Linear Regression]], we estimate $y$ by multiplying the variable $x$ with a slope parameter $m$ and adding a constant $c$ on top. $y(x,a)=mx_i+c, \quad a= \begin{bmatrix} m \\ c \end{bmatrix}$ In Non-Linear Least Square, we generalize from fitting that [[Linear Functions of Random Variables|linear function]] to fitting any non-linear function of observations $x_i$ and the parameter vector $\mathbf a$. $y=f(x_i, \mathbf a), \quad \mathbf a= \begin{bmatrix} a_1\\ \vdots \\ a_k\end{bmatrix}$ **Loss function:** The goal is to optimize $\mathbf a$ such that $f(x_i, \mathbf a)$ fits the data $(x_i, y_i)$ as closely as possible. This is equivalent to minimizing the loss function $\chi^2$. $ \chi^2 = \sum_{i=1}^n \frac{\big(y_i-f(x_i, a_k)\big)^2}{\sigma_i^2} $ where: - $y_i$: Observed value - $f(x_i, \mathbf a)$: Model prediction - $\sigma_i$: External weights representing the uncertainty of each observation $y_i$. The larger $\sigma_i$ the less it contributes to the loss (making the data point less important). >[!note:] >$\sigma_i$ is either provided externally, or derived from variance estimates. If no such information is available, a constant $\sigma$ is assume for all $x_i$. **Gradient:** The [[Gradient Descent#Gradient Vector|gradient vector]] of $\chi^2$ contains the [[Partial Derivative|partial derivatives]] w.r.t. each parameter $a_j$: $ \nabla \chi^2 = \begin{bmatrix} \frac{\partial \chi^2}{\partial a_1}\\ \vdots \\ \frac{\partial \chi^2}{\partial a_k} \end{bmatrix} $ For each parameter $a_j$ we apply the [[Differentiation Rules#Chain Rule|Chain Rule]] to compute: $ \frac{\partial \chi^2}{\partial a_j} = 2*\sum_{i=1}^n \left(\frac{y_i-f(x_i, a_k)}{\sigma_i^2} \right)*-\frac{\partial f}{\partial a_j} $ **Optimization:** Using the gradient $\nabla \chi^2$, we iteratively update the parameter vector $\mathbf a$ to minimize $\chi^2$. The learning rate $\gamma$ thereby defines the step size of updates per iteration. $ \mathbf a_{\text{new}} =\mathbf a_{\text{current}} - \gamma \nabla \chi^2 $ > [!note:] > In some solvers the [[Hessian]] is used to get an estimate of the best learning rate at each step. So the stronger the curvature, the smaller the step size should be.