The *Passive-Aggressive Algorithm* is a modification of the standard [[Perceptron Algorithm]]. It was designed to improve performance on linearly separable datasets by incorporating an adaptive learning rate, allowing the algorithm to make updates only when necessary.
## Objective Function
The algorithm (without offset) updates the parameter $\theta$ by minimizing the following objective function:
$ J(\theta)=\frac{\lambda}{2} \lVert \theta - \theta^{(k)} \rVert^2+ \text{Loss}_h(y^{(i)}\theta \cdot x^{(i)}) $
where:
- $\theta^{(k)}$: Current value of the parameter
- $\lambda$: The regularization parameter controlling the aggressiveness of the update.
- $\text{Loss}_h$: The [[Support Vector Machine#Hinge Loss|Hinge Loss]]
## Gradient Derivation
While the hinge loss is defined the same as in [[Support Vector Machine|SVM]], this algorithm favors smaller updates of $\theta$, the bigger $\lambda$. We can observe that by taking the derivative of this loss function.
The gradient of a sum can be expressed as the sum of gradients.
$
\frac{\partial}{\partial \theta}
\left(\frac{\lambda}{2} \lVert \theta - \theta^{(k)} \rVert^2+
\text{Loss}_h(y^{(i)}\theta \cdot x^{(i)})\right) =
\frac{\partial}{\partial \theta} \left(\frac{\lambda}{2} \lVert \theta - \theta^{(k)} \rVert^2 \right) +
\frac{\partial}{\partial \theta} \left( \text{Loss}_h(y^{(i)}\theta \cdot x^{(i)})\right)
$
Derivative of the regularization:
$ \frac{\partial}{\partial \theta} \left(\frac{\lambda}{2} \lVert \theta - \theta^{(k)} \rVert^2 \right) = \lambda(\theta- \theta^{(k)}) $
Derivative of the hinge loss:
$ \frac{\partial}{\partial \theta} \left( \text{Loss}_h(y^{(i)}\theta \cdot x^{(i)})\right) = -y^{(i)}\cdot x^{(i)}$
Combining terms:
$
\begin{align}
\lambda(\theta- \theta^{(k)})-(y^{(i)}\cdot x^{(i)})&\stackrel{!}{=}0 \\[12pt]
\lambda (\theta - \theta^{(k)}) &= y^{(i)}\cdot x ^{(i)} \\[6pt]
\theta &= \theta^{(k)}+ \frac{y^{(i)} \cdot x^{(i)}}{\lambda}
\end{align}
$
So we can say that $1 \over \lambda$ is the learning rate $\eta$ of this algorithm.