Support Vector Machine - Bernhard Pfann, CFA

The Support Vector Machine (”SVM”) extends the [[Perceptron Algorithm]], by handling training sets that are not linearly separable. Also it returns a more generalizable [[Linear Classifier#Decision Boundary|Decision Boundary]], by maximizing the margin around the decision boundary. **Decision boundary:** The set of points $x$ which satisfy.. $ \theta \cdot x + \theta_0 = 0 $ **Margin boundary:** The set of points $x$ which satisfy.. $ \theta \cdot x + \theta_0 \pm1 $ ![[support-vector-machine.png|center|400]] ## Distance to Decision Boundary To measures the distance of a point $x^{(i)}$ to the decision boundary, we first identify if $x^{(i)}$ is on the correct side of the decision boundary. Therefore we incorporate its label $y^{(i)}$. To make the distance invariant to the scale of $\theta$, we divide by the [[Vector Length#Norm Notation|Vector Norm]] $\vert \vert \theta \vert \vert$. $ \gamma(\theta, \theta_0)=\frac{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}{\lVert \theta \rVert} $ The $\gamma$ denotes the (signed) distance of a data point $x$ to the decision boundary. **Points on the Margin Boundaries:** - If $y^{(i)}=+1$, and $x^{(i)}$ lies on the *positive margin boundary*, both terms of the numerator equal to $1$. - If $y^{(i)}=-1$, and $x^{(i)}$ lies on the *positive margin boundary*, the product of the numerator terms equals to $(-1)$. Thus, each $x^{(i)}$ on the margin boundary has a numerator of either $\{-1,1\}$. We can see that its distance $\gamma_{\text{ref}}$ to the decision boundary is $1 \over \lVert \theta \rVert$. $ \gamma_{\text{ref}}(\theta, \theta_0) = \frac{1}{\lVert \theta \rVert} $ ## Hinge Loss Hinge loss penalizes points based on their position relative to the margin: 1. If $x^{(i)}$ is outside the margin boundary (correct side), no penalty is applied. 2. If $x^{(i)}$ is within the margin or on the wrong side of the decision boundary, a penalty is incurred. The hinge loss function is given by: $ \text{Loss}_h\Big(\frac{\gamma}{\gamma_{\text{ref}}}\Big) =\text{Loss}_h\left( \frac{ \frac{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}{\lVert\theta \rVert}}{ \frac{1}{\lVert\theta \rVert}}\right) = \text{Loss}_h \Big(y^{(i)}(\theta \cdot x^{(i)}+\theta_0)\Big) $ This can be written as a case statement: $ \text{Loss}_h\big( \underbrace{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}_{z}\big)= \begin{cases} 0 & \text{if } z \ge 1\\ 1-z & \text{if } z <1\\ \end{cases} $ ## Regularization To maximize the margin width $\frac{1}{\lVert \theta \rVert}$, we equivalently minimize $\lVert \theta \rVert^2$. For mathematical convenience, the objective often includes $\frac{1}{2} \lVert \theta \rVert^2$. $ \max \left(\frac{1}{\lVert\theta \rVert}\right) \equiv \min \big(\lVert\theta \rVert\big) \implies \min \Big(\frac{1}{2} \lVert\theta \rVert^2 \Big) $ ## Objective Function We *combine the hinge loss* with the *regularization*, to get a function, which we want to minimize. The regularization parameter $\lambda$ moderates how important small loss versus thick boundaries are to us. - *High lambda:* Prioritizes wider margins (more regularization). - *Low lambda:* Focuses on minimizing classification errors (less regularization). $ J(\theta, \theta_0)= \frac{\lambda}{2}\lvert \lvert \theta \rvert \rvert^2 + \frac{1}{n}\sum_{i=1}^n \text{Loss}_h\left(y^{(i)}(\theta \cdot x^{(i)}+\theta_0)\right) $ ^6f5abf