In [[Univariate Linear Regression]], we estimate $y$ by multiplying the variable $x$ with a slope parameter $m$ and adding a constant $c$ on top.
$y(x,a)=mx_i+c, \quad a= \begin{bmatrix} m \\ c \end{bmatrix}$
In Non-Linear Least Square, we generalize from fitting that [[Linear Functions of Random Variables|linear function]] to fitting any non-linear function of observations $x_i$ and the parameter vector $\mathbf a$.
$y=f(x_i, \mathbf a), \quad \mathbf a= \begin{bmatrix} a_1\\ \vdots \\ a_k\end{bmatrix}$
**Loss function:** The goal is to optimize $\mathbf a$ such that $f(x_i, \mathbf a)$ fits the data $(x_i, y_i)$ as closely as possible. This is equivalent to minimizing the loss function $\chi^2$.
$ \chi^2 = \sum_{i=1}^n \frac{\big(y_i-f(x_i, a_k)\big)^2}{\sigma_i^2} $
where:
- $y_i$: Observed value
- $f(x_i, \mathbf a)$: Model prediction
- $\sigma_i$: External weights representing the uncertainty of each observation $y_i$. The larger $\sigma_i$ the less it contributes to the loss (making the data point less important).
>[!note:]
>$\sigma_i$ is either provided externally, or derived from variance estimates. If no such information is available, a constant $\sigma$ is assume for all $x_i$.
**Gradient:** The [[Gradient Descent#Gradient Vector|gradient vector]] of $\chi^2$ contains the [[Partial Derivative|partial derivatives]] w.r.t. each parameter $a_j$:
$ \nabla \chi^2 = \begin{bmatrix}
\frac{\partial \chi^2}{\partial a_1}\\ \vdots \\
\frac{\partial \chi^2}{\partial a_k} \end{bmatrix} $
For each parameter $a_j$ we apply the [[Differentiation Rules#Chain Rule|Chain Rule]] to compute:
$ \frac{\partial \chi^2}{\partial a_j} = 2*\sum_{i=1}^n \left(\frac{y_i-f(x_i, a_k)}{\sigma_i^2} \right)*-\frac{\partial f}{\partial a_j} $
**Optimization:** Using the gradient $\nabla \chi^2$, we iteratively update the parameter vector $\mathbf a$ to minimize $\chi^2$. The learning rate $\gamma$ thereby defines the step size of updates per iteration.
$ \mathbf a_{\text{new}} =\mathbf a_{\text{current}} - \gamma \nabla \chi^2 $
> [!note:]
> In some solvers the [[Hessian]] is used to get an estimate of the best learning rate at each step. So the stronger the curvature, the smaller the step size should be.