Gradient Descent with Softmax - Bernhard Pfann, CFA

## Loss Function Assume a multi-label classification problem with $k$ classes, like image classification between bananas 🍌 , mangos 🥭 and oranges 🍊. We can include the [[Softmax]] function into the cost function, to make our [[Gradient Descent]] smoother in its updates (compared to a $0/1$ output vector). $ J(\theta) = -\frac{1}{n}\Bigg[ \sum _{i=1}^ n \sum _{j=0}^{k-1} \mathbf 1_{y^{(i)}=j} \log {\frac{e^{\theta _ j \cdot x^{(i)} / \tau }} {\sum _{l=0}^{k-1} e^{\theta _ l \cdot x^{(i)} / \tau }}}\Bigg] + \frac{\lambda }{2}\sum _{j=0}^{k-1}\sum _{i=0}^{d-1} \theta _{ji}^2 $ We see that $J(\theta)$ sums over all $i$ observations, and all $k$ classes for each observation. However the indicator variable $\mathbf 1_{y{(i)}=j}$ only triggers the Softmax when the current class $j$ is the correct label $y$ for the current observation. In the following we want to obtain the [[Gradient Descent#Gradient Vector|Gradient Vector]] $\nabla J(\theta_m)$ for the 🥭 class $m$, where we will need the derivative of the Softmax function. $ \frac{J}{\partial \theta_m} = -\frac{1}{n}\left[ \sum _{i=1}^ n \sum _{j=0}^{k-1} \mathbf 1_{y^{(i)}=j} \frac{\partial}{\partial \theta_m}\log {\frac{e^{\theta _ j \cdot x^{(i)} / \tau }} {\sum _{l=0}^{k-1} e^{\theta _ l \cdot x^{(i)} / \tau }}}\right] + \frac{\lambda }{2}\sum _{j=0}^{k-1}\sum _{i=0}^{d-1} \frac{\partial}{\partial \theta_m}\theta _{ji}^2 $ ## Derivative of Softmax Since the Softmax function returns probabilities, we can write it as $\mathbf P$ conditional on the observation $x^{(i)}$ and $\theta$ parameters. $ \mathbf P(y^{(i)} = j | x^{(i)}, \theta) =\frac{\exp(\theta_j\cdot x^{(i)}/\tau)}{\sum_{l=0}^{k-1}\exp(\theta_l\cdot x^{(i)}/\tau)} $ To differentiate the latter, we need to apply the [[Differentiation Rules#Quotient Rule|Quotient Rule]]. Since we differentiate w.r.t. $\theta_m$, we can see that the derivative of the numerator looks different, dependent if the currently evaluated class $j=$ 🥭 or not. - When $j=m$: $ \begin{aligned} u(x) &= \exp(\theta_m\cdot x^{(i)}/\tau) & u^\prime(x) &= \exp(\theta_m\cdot x^{(i)}/\tau) * (x^{(i)}/\tau) \\[4pt] v(x) &= \sum_{l=0}^{k-1}\exp(\theta_l\cdot x^{(i)}/\tau) & v^\prime(x) &= \exp(\theta_m\cdot x^{(i)}/\tau) * (x^{(i)}/\tau) \end{aligned} $ - When $j \not = m$: $ \begin{aligned} u(x) &= \exp(\theta_j\cdot x^{(i)}/\tau) & u^\prime(x) &=0 \\[4pt] v(x) &= \sum_{l=0}^{k-1}\exp(\theta_l\cdot x^{(i)}/\tau) & v^\prime(x) &= \exp(\theta_m\cdot x^{(i)}/\tau) * (x^{(i)}/\tau) \end{aligned} $ For shorter notation we will denote the prediction score $z_j = \theta_j \cdot x^{(i)}$. Now we apply the quotient rule for both cases. - When $j=m$: $ \begin{align} \frac{\partial}{\partial \theta_m}&= \frac{ \exp (\frac{z_m}{\tau}) * \frac{x^{(i)}}{\tau} * \left(\sum_{l=0}^{k-1}\exp (\frac{z_l}{\tau}) - \exp (\frac{z_m}{\tau})\right) * \exp (\frac{z_m}{\tau}) * \frac{x^{(i)}}{\tau} } { \left[\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})\right]^2 } \\[10pt] &=\frac { \exp (\frac{z_m}{\tau}) * \frac{x^{(i)}}{\tau} * \left(\sum_{l=0}^{k-1}\exp (\frac{z_l}{\tau})-\exp (\frac{z_m}{\tau} )\right) } {\left[\sum_{l=0}^{k-1}\exp (\frac{z_l}{\tau} )\right]^2} \\[10pt] &=\frac{ \exp \left(\frac{z_m}{\tau} \right) * \frac{x^{(i)}}{\tau} } {\sum_{l=0}^{k-1}\exp \left(\frac{z_l}{\tau} \right)} * \frac { \sum_{l=0}^{k-1}\exp \left( \frac{z_l}{\tau} \right)- \exp \left(\frac{z_m}{\tau} \right)} {\sum_{l=0}^{k-1}\exp \left (\frac{z_l}{\tau} \right)}\\[10pt] &=\mathbf P \left(y^{(i)} = m \Big \vert x^{(i)}, \theta \right)*\frac{x^{(i)}}{\tau}* \frac {\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})}{\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})}-\frac{\exp(\frac{z_m}{\tau})} {\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})} \\[10pt] &=\mathbf P \left(y^{(i)} = m \Big \vert x^{(i)}, \theta \right)*\frac{x^{(i)}}{\tau}* \left(1-\mathbf P \left(y^{(i)} = m \Big \vert x^{(i)}, \theta \right)\right) \\[10pt] &=\frac{x^{(i)}}{\tau}* \mathbf P_m* (1-\mathbf P_m) \end{align} $ - When $j \not = m$: $ \begin{align} \frac{\partial}{\partial \theta_j} &= \frac {0*\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})-\exp(\frac{z_j}{\tau})*\exp(\frac{z_m}{\tau}) * \frac{x^{(i)}}{\tau}} {\Big[\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})\Big]^2} \\[10pt] &=-\frac{\exp(\frac{z_j}{\tau})}{\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})}* \frac{\exp(\frac{z_m}{\tau}) * \frac{x^{(i)}}{\tau}} {\sum_{l=0}^{k-1}\exp(\frac{z_l}{\tau})} \\[14pt] &=-\mathbf P \left(y^{(i)} = j \Big \vert x^{(i)}, \theta \right)*\mathbf P \left (y^{(i)} = m \Big \vert x^{(i)}, \theta \right)*\frac{x^{(i)}}{\tau}\\[16pt] &=-\frac{x^{(i)}}{\tau} *\mathbf P_j*\mathbf P_m \end{align} $ ## Derivative of Log Softmax Since we take the $\log$ of the Softmax function, we need to apply the chain rule on $\log(\mathbf P_j)$ since $\mathbf P_j$ contains $\theta_m$ itself. Remember that $\frac{d}{dx} \log(x) = \frac{1}{x}* x^\prime$. - When $j=m$: $ \frac{\partial}{\partial \theta_m} \log(\mathbf P_m)= \frac{1}{\mathbf P_m}*\left(\frac{x^{(i)}}{\tau}* \mathbf P_m* (1-\mathbf P_m)\right) = \frac{x^{(i)}}{\tau}* (1-\mathbf P_m) $ - When $j \not = m$: $ \frac{\partial}{\partial \theta_m} \log(\mathbf P_j)= \frac{1}{\mathbf P_j}*\left(-\frac{x^{(i)}}{\tau} *\mathbf P_j*\mathbf P_m\right) = -\frac{x^{(i)}}{\tau} *\mathbf P_m $ ## Derivative of Single Observation Now we can take the derivative of the summation over all $k$ classes for a single observation. $ \begin{align} \frac{\partial}{\partial \theta_m} \Bigg[ \sum_{j=0}^{k-1} \mathbf 1_{j=y^{(i)}} \log (\mathbf P_j) \Bigg] & = \sum_{j=0; j \not = m}^{k-1}\left(\mathbf 1_{j=y^{(i)}}*-\frac{x^{(i)}}{\tau}*\mathbf P_m \right)+ \left(\mathbf 1_{m=y^{(i)}}*\frac{x^{(i)}}{\tau}*(1-\mathbf P_m) \right)\\[14pt] &=\frac{x^{(i)}}{\tau}*\left[\sum_{j=0; j \not =m}^{k-1} \left(\mathbf 1_{j=y^{(i)}}*-\mathbf P_m \right) + \left(\mathbf 1_{m=y^{(i)}}*(1- \mathbf P_m)\right) \right]\\[16pt] & =\frac{x^{(i)}}{\tau}\left( \mathbf 1_{m=y^{(i)}}-\mathbf P_m\right) \end{align} $ We can do the last step of simplification, by combining the indicator functions, which resemble two possible outcomes: - When $j=m: \frac{x^{(i)}}{\tau}*(1-\mathbf P_m)$ - When $j \not = m : \frac{x^{(i)}}{\tau}*(0-\mathbf P_m)$ ## Derivative of Loss Function Finally we plug our derivative of a single observation into the full loss function, which iterates over all observations and adds a regularization term. $ \begin{align} \frac{J}{\partial \theta_m} &= -\frac{1}{n}\Bigg[ \sum _{i=1}^n \frac{\partial}{\partial \theta_m} \sum _{j=0}^{k-1} \mathbf 1_{y^{(i)}=j} \log {\frac{e^{\theta _ j \cdot x^{(i)} / \tau }} {\sum _{l=0}^{k-1} e^{\theta _ l \cdot x^{(i)} / \tau }}}\Bigg] + \frac{\lambda }{2}\sum _{j=0}^{k-1}\sum _{i=0}^{d-1} \frac{\partial}{\partial \theta_m}\theta _{ji}^2\\[16pt] &=-\frac{1}{n}* \sum_{i=1}^n\left(\frac{x^{(i)}}{\tau}\left( \mathbf 1_{m=y^{(i)}}-\mathbf P_m\right)\right)+\lambda \theta_m\\[12pt] &=-\frac{1}{\tau n}* \sum_{i=1}^n\left(x^{(i)}\left( \mathbf 1_{m=y^{(i)}}-\mathbf P_m\right)\right)+\lambda \theta_m \end{align} $ **Intuition:** For our model we want to find the minimum of this gradient of the loss function. The above formulation makes sense, since correct examples with $\mathbf P_m \approx 1$ and incorrect examples with $\mathbf P_m \approx 0$ for a certain class $m$, give the lowest outcomes.