Softmax - Bernhard Pfann, CFA

Softmax is a generalized activation function used for multi-class classification problems. It maps raw prediction scores for $k$ classes to probabilities, ensuring the output satisfies the [[Probability Axioms]] of non-negativity and normalization. ## Sigmoid vs. Softmax - **Sigmoid function:** An activation function in binary classification. Since we translate our labels into classes $\{0,1\}$, the function returns a probability value for class $1$. $ \sigma(z) = \frac{1}{1+\exp(-z)} $ ![[sigmoid.png|center|400]] - **Softmax:** Extends sigmoid to handle multi-class classification. It maps a vector of $k$ raw scores to probabilities for $k$ classes. ## Softmax Function We evaluate a single observation $x$, comprised of $d$ features $(x \in \mathbb R^d)$. To obtain a prediction score for class $j$ we take the inner product of $x \cdot \theta_j$, where $\theta_j$ is the $1 \times d$ learned weights vector for class $j$. This returns the prediction score $z_j$ as a scalar $\in \mathbb R$. $ z_j = \overbrace{ \begin{bmatrix} x^{(0)}\\ \vdots \\[3pt] x^{(d)} \end{bmatrix}}^{x} \cdot \overbrace{ \begin{bmatrix} \theta_j^{(0)}\\ \vdots \\[3pt] \theta_j^{(d)} \end{bmatrix}}^{\theta_j} $ - *Input:* $1 \times k$ vector, where each element represents a score $\in \mathbb R$ to belong to class $j$. - *Output:* $1 \times k$ vector, where each element represents the probability to belong to class $j$. $ \underbrace{ \begin{bmatrix} z_0 \\ \vdots \\z_k \end{bmatrix} }_{\text{Input}} \mapsto \underbrace{ \begin{bmatrix} \exp(z_0) \\ \vdots \\ \exp(z_k) \end{bmatrix} * \frac{1}{\sum_{j=1}^k \exp(j)} }_{\text{Output}} $ ## Softmax Temperature We can tune the returned probabilities with this hyperparameter. It scales the prediction score $\theta_j \cdot x$ by a factor of $\tau$, before taking the exponential of it. $ h(x) = \begin{bmatrix} \exp(z_0 / \tau) \\ \vdots \\ \exp(z_{k-1} / \tau) \end{bmatrix} * \frac{1}{\sum_{j=0}^{k-1} \exp(z_j/\tau)} $ **Effects of tau:** - $\tau \to 1$: Standard Softmax behavior. - $\tau \to 0^+$: Sharpens probabilities, focusing on the class with the highest score. - $\tau \to \infty$: Flattens probabilities, making them more uniform. **Applications:** - *Lower tau:* Useful for confident predictions when classes are well-separated. - *Higher tau:* Encourages exploration, which can be advantageous for imbalanced datasets.