The Support Vector Machine (”SVM”) extends the [[Perceptron Algorithm]], by handling training sets that are not linearly separable. Also it returns a more generalizable [[Linear Classifier#Decision Boundary|Decision Boundary]], by maximizing the margin around the decision boundary.
**Decision boundary:** The set of points $x$ which satisfy..
$ \theta \cdot x + \theta_0 = 0 $
**Margin boundary:** The set of points $x$ which satisfy..
$ \theta \cdot x + \theta_0 \pm1 $
![[support-vector-machine.png|center|400]]
## Distance to Decision Boundary
To measures the distance of a point $x^{(i)}$ to the decision boundary, we first identify if $x^{(i)}$ is on the correct side of the decision boundary. Therefore we incorporate its label $y^{(i)}$. To make the distance invariant to the scale of $\theta$, we divide by the [[Vector Length#Norm Notation|Vector Norm]] $\vert \vert \theta \vert \vert$.
$ \gamma(\theta, \theta_0)=\frac{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}{\lVert \theta \rVert} $
The $\gamma$ denotes the (signed) distance of a data point $x$ to the decision boundary.
**Points on the Margin Boundaries:**
- If $y^{(i)}=+1$, and $x^{(i)}$ lies on the *positive margin boundary*, both terms of the numerator equal to $1$.
- If $y^{(i)}=-1$, and $x^{(i)}$ lies on the *positive margin boundary*, the product of the numerator terms equals to $(-1)$.
Thus, each $x^{(i)}$ on the margin boundary has a numerator of either $\{-1,1\}$. We can see that its distance $\gamma_{\text{ref}}$ to the decision boundary is $1 \over \lVert \theta \rVert$.
$ \gamma_{\text{ref}}(\theta, \theta_0) = \frac{1}{\lVert \theta \rVert} $
## Hinge Loss
Hinge loss penalizes points based on their position relative to the margin:
1. If $x^{(i)}$ is outside the margin boundary (correct side), no penalty is applied.
2. If $x^{(i)}$ is within the margin or on the wrong side of the decision boundary, a penalty is incurred.
The hinge loss function is given by:
$
\text{Loss}_h\Big(\frac{\gamma}{\gamma_{\text{ref}}}\Big) =\text{Loss}_h\left( \frac{ \frac{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}{\lVert\theta \rVert}}{ \frac{1}{\lVert\theta \rVert}}\right) = \text{Loss}_h \Big(y^{(i)}(\theta \cdot x^{(i)}+\theta_0)\Big)
$
This can be written as a case statement:
$
\text{Loss}_h\big( \underbrace{y^{(i)}(\theta \cdot x^{(i)}+\theta_0)}_{z}\big)= \begin{cases} 0 & \text{if } z \ge 1\\ 1-z & \text{if } z <1\\ \end{cases}
$
## Regularization
To maximize the margin width $\frac{1}{\lVert \theta \rVert}$, we equivalently minimize $\lVert \theta \rVert^2$. For mathematical convenience, the objective often includes $\frac{1}{2} \lVert \theta \rVert^2$.
$ \max \left(\frac{1}{\lVert\theta \rVert}\right) \equiv \min \big(\lVert\theta \rVert\big) \implies \min \Big(\frac{1}{2} \lVert\theta \rVert^2 \Big) $
## Objective Function
We *combine the hinge loss* with the *regularization*, to get a function, which we want to minimize. The regularization parameter $\lambda$ moderates how important small loss versus thick boundaries are to us.
- *High lambda:* Prioritizes wider margins (more regularization).
- *Low lambda:* Focuses on minimizing classification errors (less regularization).
$
J(\theta, \theta_0)= \frac{\lambda}{2}\lvert \lvert \theta \rvert \rvert^2 + \frac{1}{n}\sum_{i=1}^n \text{Loss}_h\left(y^{(i)}(\theta \cdot x^{(i)}+\theta_0)\right)
$
^6f5abf