Passive Aggressive Algorithm - Bernhard Pfann, CFA

The *Passive-Aggressive Algorithm* is a modification of the standard [[Perceptron Algorithm]]. It was designed to improve performance on linearly separable datasets by incorporating an adaptive learning rate, allowing the algorithm to make updates only when necessary. ## Objective Function The algorithm (without offset) updates the parameter $\theta$ by minimizing the following objective function: $ J(\theta)=\frac{\lambda}{2} \lVert \theta - \theta^{(k)} \rVert^2+ \text{Loss}_h(y^{(i)}\theta \cdot x^{(i)}) $ where: - $\theta^{(k)}$: Current value of the parameter - $\lambda$: The regularization parameter controlling the aggressiveness of the update. - $\text{Loss}_h$: The [[Support Vector Machine#Hinge Loss|Hinge Loss]] ## Gradient Derivation While the hinge loss is defined the same as in [[Support Vector Machine|SVM]], this algorithm favors smaller updates of $\theta$, the bigger $\lambda$. We can observe that by taking the derivative of this loss function. The gradient of a sum can be expressed as the sum of gradients. $ \frac{\partial}{\partial \theta} \left(\frac{\lambda}{2} \lVert \theta - \theta^{(k)} \rVert^2+ \text{Loss}_h(y^{(i)}\theta \cdot x^{(i)})\right) = \frac{\partial}{\partial \theta} \left(\frac{\lambda}{2} \lVert \theta - \theta^{(k)} \rVert^2 \right) + \frac{\partial}{\partial \theta} \left( \text{Loss}_h(y^{(i)}\theta \cdot x^{(i)})\right) $ Derivative of the regularization: $ \frac{\partial}{\partial \theta} \left(\frac{\lambda}{2} \lVert \theta - \theta^{(k)} \rVert^2 \right) = \lambda(\theta- \theta^{(k)}) $ Derivative of the hinge loss: $ \frac{\partial}{\partial \theta} \left( \text{Loss}_h(y^{(i)}\theta \cdot x^{(i)})\right) = -y^{(i)}\cdot x^{(i)}$ Combining terms: $ \begin{align} \lambda(\theta- \theta^{(k)})-(y^{(i)}\cdot x^{(i)})&\stackrel{!}{=}0 \\[12pt] \lambda (\theta - \theta^{(k)}) &= y^{(i)}\cdot x ^{(i)} \\[6pt] \theta &= \theta^{(k)}+ \frac{y^{(i)} \cdot x^{(i)}}{\lambda} \end{align} $ So we can say that $1 \over \lambda$ is the learning rate $\eta$ of this algorithm.