## Loss Function
Assume a multi-label classification problem with $k$ classes, like image classification between bananas 🍌 , mangos 🥭 and oranges 🍊. We can include the [[Softmax]] function into the cost function, to make our [[Gradient Descent]] smoother in its updates (compared to a $0/1$ output vector).
$
J(\theta) = -\frac{1}{n}\Bigg[
\sum _{i=1}^ n \sum _{j=0}^{k-1} \mathbf 1_{y^{(i)}=j} \log {\frac{e^{\theta _ j \cdot x^{(i)} / \tau }} {\sum _{l=0}^{k-1} e^{\theta _ l \cdot x^{(i)} / \tau }}}\Bigg] + \frac{\lambda }{2}\sum _{j=0}^{k-1}\sum _{i=0}^{d-1} \theta _{ji}^2
$
We see that $J(\theta)$ sums over all $i$ observations, and all $k$ classes for each observation. However the indicator variable $\mathbf 1_{y{(i)}=j}$ only triggers the Softmax when the current class $j$ is the correct label $y$ for the current observation.
In the following we want to obtain the [[Gradient Descent#Gradient Vector|Gradient Vector]] $\nabla J(\theta_m)$ for the 🥭 class $m$, where we will need the derivative of the Softmax function.
$
\frac{J}{\partial \theta_m} = -\frac{1}{n}\left[
\sum _{i=1}^ n \sum _{j=0}^{k-1} \mathbf 1_{y^{(i)}=j} \frac{\partial}{\partial \theta_m}\log {\frac{e^{\theta _ j \cdot x^{(i)} / \tau }} {\sum _{l=0}^{k-1} e^{\theta _ l \cdot x^{(i)} / \tau }}}\right] +
\frac{\lambda }{2}\sum _{j=0}^{k-1}\sum _{i=0}^{d-1} \frac{\partial}{\partial \theta_m}\theta _{ji}^2
$
## Derivative of Softmax
Since the Softmax function returns probabilities, we can write it as $\mathbf P$ conditional on the observation $x^{(i)}$ and $\theta$ parameters.
$ \mathbf P(y^{(i)} = j | x^{(i)}, \theta) =\frac{\exp(\theta_j\cdot x^{(i)}/\tau)}{\sum_{l=0}^{k-1}\exp(\theta_l\cdot x^{(i)}/\tau)} $
To differentiate the latter, we need to apply the [[Differentiation Rules#Quotient Rule|Quotient Rule]]. Since we differentiate w.r.t. $\theta_m$, we can see that the derivative of the numerator looks different, dependent if the currently evaluated class $j=$ 🥭 or not.
- When $j=m$:
$ \begin{aligned}
u(x) &= \exp(\theta_m\cdot x^{(i)}/\tau) &
u^\prime(x) &=
\exp(\theta_m\cdot x^{(i)}/\tau) * (x^{(i)}/\tau) \\[4pt]
v(x) &= \sum_{l=0}^{k-1}\exp(\theta_l\cdot x^{(i)}/\tau) &
v^\prime(x) &= \exp(\theta_m\cdot x^{(i)}/\tau) * (x^{(i)}/\tau)
\end{aligned} $
- When $j \not = m$:
$ \begin{aligned}
u(x) &= \exp(\theta_j\cdot x^{(i)}/\tau) &
u^\prime(x) &=0 \\[4pt]
v(x) &= \sum_{l=0}^{k-1}\exp(\theta_l\cdot x^{(i)}/\tau) &
v^\prime(x) &= \exp(\theta_m\cdot x^{(i)}/\tau) * (x^{(i)}/\tau)
\end{aligned} $
For shorter notation we will denote the prediction score $z_j = \theta_j \cdot x^{(i)}$. Now we apply the quotient rule for both cases.
- When $j=m$:
$
\begin{align}
\frac{\partial}{\partial \theta_m}&= \frac{
\exp (\frac{z_m}{\tau}) *
\frac{x^{(i)}}{\tau} *
\left(\sum_{l=0}^{k-1}\exp (\frac{z_l}{\tau}) -
\exp (\frac{z_m}{\tau})\right) *
\exp (\frac{z_m}{\tau}) *
\frac{x^{(i)}}{\tau}
}
{
\left[\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})\right]^2
} \\[10pt]
&=\frac {
\exp (\frac{z_m}{\tau}) *
\frac{x^{(i)}}{\tau} *
\left(\sum_{l=0}^{k-1}\exp (\frac{z_l}{\tau})-\exp (\frac{z_m}{\tau} )\right)
}
{\left[\sum_{l=0}^{k-1}\exp (\frac{z_l}{\tau} )\right]^2} \\[10pt]
&=\frac{
\exp \left(\frac{z_m}{\tau} \right) *
\frac{x^{(i)}}{\tau}
}
{\sum_{l=0}^{k-1}\exp \left(\frac{z_l}{\tau} \right)} *
\frac {
\sum_{l=0}^{k-1}\exp \left( \frac{z_l}{\tau} \right)-
\exp \left(\frac{z_m}{\tau} \right)}
{\sum_{l=0}^{k-1}\exp \left (\frac{z_l}{\tau} \right)}\\[10pt]
&=\mathbf P \left(y^{(i)} = m \Big \vert x^{(i)}, \theta \right)*\frac{x^{(i)}}{\tau}* \frac {\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})}{\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})}-\frac{\exp(\frac{z_m}{\tau})} {\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})} \\[10pt]
&=\mathbf P \left(y^{(i)} = m \Big \vert x^{(i)}, \theta \right)*\frac{x^{(i)}}{\tau}* \left(1-\mathbf P \left(y^{(i)} = m \Big \vert x^{(i)}, \theta \right)\right) \\[10pt]
&=\frac{x^{(i)}}{\tau}* \mathbf P_m* (1-\mathbf P_m)
\end{align}
$
- When $j \not = m$:
$
\begin{align}
\frac{\partial}{\partial \theta_j} &= \frac {0*\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})-\exp(\frac{z_j}{\tau})*\exp(\frac{z_m}{\tau}) * \frac{x^{(i)}}{\tau}} {\Big[\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})\Big]^2} \\[10pt]
&=-\frac{\exp(\frac{z_j}{\tau})}{\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})}*
\frac{\exp(\frac{z_m}{\tau}) * \frac{x^{(i)}}{\tau}} {\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})} \\[14pt]
&=-\mathbf P \left(y^{(i)} = j \Big \vert x^{(i)}, \theta \right)*\mathbf P \left (y^{(i)} = m \Big \vert x^{(i)}, \theta \right)*\frac{x^{(i)}}{\tau}\\[16pt]
&=-\frac{x^{(i)}}{\tau} *\mathbf P_j*\mathbf P_m
\end{align}
$
## Derivative of Log Softmax
Since we take the $\log$ of the Softmax function, we need to apply the chain rule on $\log(\mathbf P_j)$ since $\mathbf P_j$ contains $\theta_m$ itself. Remember that $\frac{d}{dx} \log(x) = \frac{1}{x}* x^\prime$.
- When $j=m$:
$
\frac{\partial}{\partial \theta_m} \log(\mathbf P_m)=
\frac{1}{\mathbf P_m}*\left(\frac{x^{(i)}}{\tau}* \mathbf P_m* (1-\mathbf P_m)\right) =
\frac{x^{(i)}}{\tau}* (1-\mathbf P_m)
$
- When $j \not = m$:
$
\frac{\partial}{\partial \theta_m} \log(\mathbf P_j)=
\frac{1}{\mathbf P_j}*\left(-\frac{x^{(i)}}{\tau} *\mathbf P_j*\mathbf P_m\right) = -\frac{x^{(i)}}{\tau} *\mathbf P_m
$
## Derivative of Single Observation
Now we can take the derivative of the summation over all $k$ classes for a single observation.
$
\begin{align}
\frac{\partial}{\partial \theta_m} \Bigg[ \sum_{j=0}^{k-1} \mathbf 1_{j=y^{(i)}} \log (\mathbf P_j) \Bigg]
& = \sum_{j=0; j \not = m}^{k-1}\left(\mathbf 1_{j=y^{(i)}}*-\frac{x^{(i)}}{\tau}*\mathbf P_m \right)+ \left(\mathbf 1_{m=y^{(i)}}*\frac{x^{(i)}}{\tau}*(1-\mathbf P_m) \right)\\[14pt]
&=\frac{x^{(i)}}{\tau}*\left[\sum_{j=0; j \not =m}^{k-1} \left(\mathbf 1_{j=y^{(i)}}*-\mathbf P_m \right) + \left(\mathbf 1_{m=y^{(i)}}*(1- \mathbf P_m)\right) \right]\\[16pt]
& =\frac{x^{(i)}}{\tau}\left( \mathbf 1_{m=y^{(i)}}-\mathbf P_m\right)
\end{align}
$
We can do the last step of simplification, by combining the indicator functions, which resemble two possible outcomes:
- When $j=m: \frac{x^{(i)}}{\tau}*(1-\mathbf P_m)$
- When $j \not = m : \frac{x^{(i)}}{\tau}*(0-\mathbf P_m)$
## Derivative of Loss Function
Finally we plug our derivative of a single observation into the full loss function, which iterates over all observations and adds a regularization term.
$
\begin{align}
\frac{J}{\partial \theta_m}
&= -\frac{1}{n}\Bigg[
\sum _{i=1}^n \frac{\partial}{\partial \theta_m} \sum _{j=0}^{k-1} \mathbf 1_{y^{(i)}=j} \log {\frac{e^{\theta _ j \cdot x^{(i)} / \tau }} {\sum _{l=0}^{k-1} e^{\theta _ l \cdot x^{(i)} / \tau }}}\Bigg] + \frac{\lambda }{2}\sum _{j=0}^{k-1}\sum _{i=0}^{d-1} \frac{\partial}{\partial \theta_m}\theta _{ji}^2\\[16pt]
&=-\frac{1}{n}* \sum_{i=1}^n\left(\frac{x^{(i)}}{\tau}\left( \mathbf 1_{m=y^{(i)}}-\mathbf P_m\right)\right)+\lambda \theta_m\\[12pt]
&=-\frac{1}{\tau n}* \sum_{i=1}^n\left(x^{(i)}\left( \mathbf 1_{m=y^{(i)}}-\mathbf P_m\right)\right)+\lambda \theta_m
\end{align}
$
**Intuition:** For our model we want to find the minimum of this gradient of the loss function. The above formulation makes sense, since correct examples with $\mathbf P_m \approx 1$ and incorrect examples with $\mathbf P_m \approx 0$ for a certain class $m$, give the lowest outcomes.