Finding to optimal parameters $\theta, \theta_0$ for the [[Support Vector Machine]] means to minimize its objective function $J(\theta, \theta_0)$. We will derive the [[Partial Derivative|Partial Derivatives]] $\frac{\partial J}{\partial \theta}$, $\frac{\partial J}{\partial \theta_0}$ and apply [[Gradient Descent]]. ![[Support Vector Machine#^6f5abf]] We see that the objective function is just the sum of two parts, the average *hinge loss* and the *regularization*. Therefore we can find the partial derivative of each term separately and add them up again. ## Differentiate Hinge Loss $ \begin{align} \frac{d}{d\theta} &\frac{1}{n}\sum_{i=1}^{n}\text{Loss}_h \big(y^{(i)} (\theta \cdot x^{(i)}+\theta_0)\big) \tag{1}\\[10pt] =&\frac{1}{n}\sum_{i=1}^{n} \frac{d}{d\theta} \text{Loss}_h \big(y^{(i)} (\theta \cdot x^{(i)}+\theta_0)\big) \tag{2} \end{align} $ (2) The derivate of a sum is equal to the sum of derivatives. Thus we can pull out the sum. The hinge loss is split up in two different case statements that have to be evaluated separately. However only the case where $z>1$ is interesting as the other returns $0$. $ \text{Loss}_h\big( \underbrace{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}_{z}\big)= \begin{cases} 0 & \text{if } z \ge 1\\ 1-z & \text{if } z <1\\ \end{cases} $ Differentiate $z$ w.r.t. $\theta$ when $z>1:$ $ \frac{d}{d\theta}\mathrm{Loss}_h(z)=\frac{d}{d\theta} \, 1-\overbrace{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}^{z} = -y^{(i)}x^{(i)} $ Differentiate $z$ w.r.t. $\theta_0$ when $z>1:$ $ \frac{d}{d\theta_0}\mathrm{Loss}(z)=\frac{d}{d\theta_0} \, 1-\overbrace{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}^{z} =-y^{(i)}*1 $ Applying the derivatives to the case statements of the hinge loss, tells us how to update the $\theta, \theta_0$ parameters. $ \frac{d}{d\theta}\text{Loss}_h\big( \underbrace{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}_{z}\big)= \begin{cases} 0 & \text{if } z \ge 1\\ -y^{(i)}x^{(i)} & \text{if } z <1\\ \end{cases} $ $ \frac{d}{d\theta_0}\text{Loss}_h\big( \underbrace{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}_{z}\big)= \begin{cases} 0 & \text{if } z \ge 1\\ -y^{(i)} & \text{if } z <1\\ \end{cases} $ ## Differentiate Regularization We differentiate the regularization term w.r.t. $\theta$ and $\theta_0$ separately. $ \frac{\partial}{\partial \theta} \frac{\lambda}{2} \lVert \theta \rVert^2 = \lambda \theta, \quad \frac{\partial}{\partial \theta_0} \frac{\lambda}{2} \lVert \theta \rVert^2 = 0 $ ## Gradient Descent To apply gradient descent we iteratively update $\theta, \theta_0$ in the direction of their derivatives. The progress is as follows: - Initialize $\theta, \theta_0$ at some value. - We update $\theta$ repeatedly with $\theta = \theta - \eta \,(\frac{dJ}{d\theta})$. - We update $\theta_0$ repeatedly with $\theta_0 = \theta_0 - \eta \,(\frac{dJ}{d\theta_0})$. **Algorithm:** ``` Initialize theta=0, theta_0=0 for t = 1 to T: for i = 1 to N: if y_i * (theta * x_i + theta_0) <= 1 theta -= eta * (lambda * theta - y_i * x_i) theta_0 -= eta * y_i else: theta -= eta * (lambda * theta) ```