Softmax is a generalized activation function used for multi-class classification problems. It maps raw prediction scores for $k$ classes to probabilities, ensuring the output satisfies the [[Probability Axioms]] of non-negativity and normalization.
## Sigmoid vs. Softmax
- **Sigmoid function:** An activation function in binary classification. Since we translate our labels into classes $\{0,1\}$, the function returns a probability value for class $1$.
$ \sigma(z) = \frac{1}{1+\exp(-z)} $
![[sigmoid.png|center|400]]
- **Softmax:** Extends sigmoid to handle multi-class classification. It maps a vector of $k$ raw scores to probabilities for $k$ classes.
## Softmax Function
We evaluate a single observation $x$, comprised of $d$ features $(x \in \mathbb R^d)$. To obtain a prediction score for class $j$ we take the inner product of $x \cdot \theta_j$, where $\theta_j$ is the $1 \times d$ learned weights vector for class $j$. This returns the prediction score $z_j$ as a scalar $\in \mathbb R$.
$ z_j = \overbrace{ \begin{bmatrix} x^{(0)}\\ \vdots \\[3pt] x^{(d)} \end{bmatrix}}^{x} \cdot \overbrace{ \begin{bmatrix} \theta_j^{(0)}\\ \vdots \\[3pt] \theta_j^{(d)} \end{bmatrix}}^{\theta_j} $
- *Input:* $1 \times k$ vector, where each element represents a score $\in \mathbb R$ to belong to class $j$.
- *Output:* $1 \times k$ vector, where each element represents the probability to belong to class $j$.
$
\underbrace{ \begin{bmatrix} z_0 \\ \vdots \\z_k \end{bmatrix} }_{\text{Input}}
\mapsto \underbrace{ \begin{bmatrix} \exp(z_0) \\ \vdots \\ \exp(z_k) \end{bmatrix} * \frac{1}{\sum_{j=1}^k \exp(j)} }_{\text{Output}}
$
## Softmax Temperature
We can tune the returned probabilities with this hyperparameter. It scales the prediction score $\theta_j \cdot x$ by a factor of $\tau$, before taking the exponential of it.
$
h(x) =
\begin{bmatrix} \exp(z_0 / \tau) \\ \vdots \\ \exp(z_{k-1} / \tau) \end{bmatrix} * \frac{1}{\sum_{j=0}^{k-1} \exp(z_j/\tau)}
$
**Effects of tau:**
- $\tau \to 1$: Standard Softmax behavior.
- $\tau \to 0^+$: Sharpens probabilities, focusing on the class with the highest score.
- $\tau \to \infty$: Flattens probabilities, making them more uniform.
**Applications:**
- *Lower tau:* Useful for confident predictions when classes are well-separated.
- *Higher tau:* Encourages exploration, which can be advantageous for imbalanced datasets.