Let us take the problem of personalized movie recommendations. We are given a sparse input matrix $Y$ with dimensions $(n \times m)$, where $n$ is the number of users and $m$ is the number of movies.
We want to generate an output matrix $X$, where every single element is filled out. Elements that are already filled out in the input matrix, are part of set $D$.
$ D= \{(a,i) \, \vert \, Y_{ai} \text{ is given}\} $
## Objective Function (Naive Approach)
By minimizing the following function we want to achieve two things.
$ J(X) = \frac{1}{2}\sum_{(a,i) \in D} (Y_{ai} - X_{ai})^2 + \frac{\lambda }{2} \underbrace{\sum_{a=1}^n \sum_{i=1}^m X_{ai}^2}_{\vert \vert X \vert \vert^2} $
- *Matching the input matrix:* For the elements $\in D$ we already have labels in the input matrix. We want that these elements are represented in the output matrix $X_{ai}$ with values that are as close as possible to their actual labels in $Y_{ai}$.
- *Regularization:* We make sure that the elements in the output matrix (known & unknown elements) do not blow in magnitude. Therefore we introduce the regularization term $\lambda$, that drags all elements equally towards $0$.
We can find the gradient of this cost function, for each element $X_{ai}$ and then update each element using [[Gradient Descent]]. Note that the gradient looks different, dependent on $(a,i) \in D$ or not.
If $(a,i) \in D:$
$ \begin{aligned} \frac{\partial J(X_{ai})}{\partial X_{ai}} = - (Y_{ai} - X_{ai}) + \lambda X_{ai} &\stackrel{!}{=}0 \\[8pt] X_{ai} + \lambda X_{ai} &=Y_{ai} \\[8pt] X_{ai}(1+\lambda)&=Y_{ai} \\[4pt] X_{ai} &= \frac{Y_{ai}}{1+\lambda} \end{aligned} $
If $(a,i) \not \in D:$
$ \begin{aligned} \frac{\partial J(X_{ai})}{\partial X_{ai}} = \lambda X_{ai} \stackrel{!}{=}0 \\ X_{ai}=0 \end{aligned} $
So we can see the updates of $X_{ai}$ lead to trivial (not helpful) solutions. In fact, any observed datapoint $X_{ai} \in D$ is corrupted by the regularization term, while all unobserved datapoints $X_{ai} \not \in D$ are set to $0$.
>[!note:]
>The reason why this is not helpful is because we model each datapoint independently and we have not captured patterns between users and movies.
## Applying Matrix Factorization
Therefore we make the assumption that our output matrix $X$ is low [[Matrix Rank]]. This means, we assume some patterns i.e. relationship across the $n$ users and $m$ movies that can be summarized by a lower number of dimensions.
Assuming this low rank, allows us to summarize $X$ in a more compact form via [[Matrix Factorization]]. Since we can now decompose $X$ into the two smaller matrices $U,V$, we can also substitute them into the objective function.
$ X=UV^T $
$ J(U, V) = \underbrace{\frac{1}{2}\sum _{(a,i) \in D} (Y_{ai} - [UV^ T]_{ai})^2}_{\text {Squared Error}} + \underbrace{\frac{\lambda }{2}\sum _{a=1}^ n \sum _{j=1}^ k U_{aj}^2 + \frac{\lambda }{2}\sum _{i=1}^ m \sum _{j=1}^ k V_{ij}^2}_{\text {Regularization}} $
**Process:**
1. Initialize $V$ with values $v$.
2. Restate $UV^T$ with the initialized values from $v$.
3. Insert updated $UV^T$ into $J(U,V)$ and differentiate w.r.t. each $u_i$ separately.
1. Apply gradient descent update on each $u_i$ OR
2. Set each of the derivatives $=0$ to directly find $u_i$ values that minimize the objective function individually.
4. Update $u$ with the minimizing values.
5. Repeat procedure with $v$ and alternate between the two until both of them converge.
>[!note:]
>The original matrix $X$ has $n*m$ parameters. By assuming rank $1$ we remain with only $m+n$ parameters. However, such a low rank assumption might not reflect the complex relationship between different users groups and segments of movies.
>[!note:]
>Theoretically the substitution of $X=UV^T$ would yield in a complex sum of $(UV^T)^2$. However in matrix factorization it is convention to regularize the squared version of each $U,V$ separately.
## Example
In case of rank $1$, $u,v$ are 1-dimensional vectors. We can interpret them as the “user”-vector $u$ that captures the rating tendency of each user, and the “movies”-vector $v$ that captures the overall popularity of movies.
$ Y= \begin{bmatrix} 5&?&7\\ 1&2&? \end{bmatrix}, \quad u= \begin{bmatrix} u_1\\u_2 \end{bmatrix} \quad v= \begin{bmatrix} v_1\\v_2\\v_3 \end{bmatrix} $
We will find the optimal parameters for $u,v$ in an iterative approach. Thereby we initialize one of the vectors ($u$ or $v$) and do updates of the other vector in an alternating way.
We initialize $v$ with some values and express $X$ in terms of $u_1$.
$ v = \begin{bmatrix}2\\7\\8\end{bmatrix} \implies X=uv^T= \begin{bmatrix} 2u_1 & 7u_1 & 8u_1\\ 2u_2 & 7u_2 & 8 u_2 \end{bmatrix} $
**Preparing for Differentiation:**
- We can minimize $u_1$ and $u_2$ separately, as their respective $\arg \min$ does not dependent on each other.
- Differentiating w.r.t. $u_1$ treats all other $u_i$ terms as constants. Therefore the original summation $\sum_{(a,i) \in D}$ over all users and movies can be simplified to just iterating over all movies with fixed user $a=1$.
$
\begin{align}
J(U,V) &= \frac{1}{2} \sum_{(a,i) \in D} \Big(y_{ai}-u_av_i \Big)^2 + \frac{\lambda}{2}\sum_{a=1}^n u_a^2+\frac{\lambda}{2}\sum_{i=1}^m v_i^2 \\[10pt]
J(u_1,V) &= \frac{1}{2} \sum_{(a,i) \in D; \, a=1} \Big(y_{1i}-u_1v_i \Big)^2 + \frac{\lambda}{2} u_1^2 \\[10pt]
J(u_2,V) &= \frac{1}{2} \sum_{(a,i) \in D; \, a=2} \Big(y_{2i}-u_2v_i\Big)^2 + \frac{\lambda}{2} u_2^2 \\[10pt]
\end{align}
$
where:
- (1) Since we will differentiate w.r.t $u_1$, we can ignore the summation of all $v$ terms.
- (2) Since we will differentiate w.r.t $u_1$, we can ignore all other $u$ terms, and thus replace $u_i$ with $u_1$.
**Differentiation:**
$
\begin{align}
\frac{\partial J}{\partial u_1} &= \frac{1}{2}\Big((5-2u_1)^2+(7-8u_1)^2\Big)+ \frac{\lambda}{2}(u_1^2) \\[10pt]
&= 57u_1-56+\lambda u_1 \implies 0 \\[6pt]
u_1&=\frac{56}{57+\lambda}
\end{align}
$