# Implementing Probabilistic Matrix Factorization in PyTorch¶

I found myself wanting to learn more about models for recommendation systems. After a bit of digging, I found what appears to be one of the better options for collaborative filtering called Probabilistic Matrix Factorization (PMF). What really excited me about this particular model is that it is a pretty straightforward Bayesian model. Also, implementing it with PyTorch would be quite fun. I'll outline the idea of PMF in this post, you can read more in the original paper.

We're going to use the MovieLens dataset from here. This dataset has movie ratings from a number of different users. There are different sizes provided, but I'll use the smallest dataset to keep things simple. The dataset I'm using has a bit over 100,000 ratings from 610 users, for 9724 movies.

With this model, we want to predict for our users which movies they'll like. We can do this by finding feature vectors for our users and movies. The vectors are learned from users' ratings and encode each users movie preferences as well as hidden information about the movies.

We can use those feature vectors to compare users and movies. Users with similar feature vectors should have similar tastes in movies, so we can make recommendations that way. Movies with similar feature vectors should have similar qualities such as being in the same genre. Again, we can use these movie features to make recommendations because if movie A and movie B are similar, then a user who loves movie A should also love movie B. Finally, we can take the inner product of a user's feature vector and a movie's feature vector to calculate an estimated rating.

We start with a big matrix $R$ of movie ratings, with users as rows and movies as columns. Since users don't rate all movies, this is a pretty sparse matrix with a lot of missing values. What we can do is factor $R$ into two matrices $U$ and $V$ representing latent feature vectors for users and movies. With these latent features, we can compare users with each other, movies with each other, and predict ratings for movies users haven't seen.

The big idea behind this paper is that we're going to treat the latent vectors as parameters in a Bayesian model. As a reminder, Bayes theorem:

$$ \large P(\theta \mid D) \propto P(D \mid \theta) P(\theta) $$In our model we try to predict the rating matrix $R$ with $U$ and $V$

$$ \large \hat{R} = U^T V$$In our model we assume the ratings are drawn from a normal distribution with mean $\hat{R}$. What's really cool is that we can place priors on our latent features.

$$ \begin{aligned} R &\sim \mathrm{Normal}(U^T V, \sigma^2) \\ U &\sim \mathrm{Normal}(0, \sigma_U^2) \\ V &\sim \mathrm{Normal}(0, \sigma_V^2) \end{aligned} $$They also put priors on the parameters $\sigma_U$ and $\sigma_V$.

$$ \begin{aligned} \sigma_U &\sim \mathrm{Normal}(0, \Theta_U^2) \\ \sigma_V &\sim \mathrm{Normal}(0, \Theta_V^2) \end{aligned} $$The authors of the paper go further and build a hierachical structure for the user vectors that takes into account users with low numbers of reviews.

$$ \begin{aligned} U_i &= Y_i + \frac{\sum_k^M I_{ik}W_k}{\sum_k^M I_{ik}} \\ Y &\sim \mathrm{Normal}(0, \sigma_U^2) \\ W &\sim \mathrm{Normal}(0, \sigma_W^2) \end{aligned} $$With this model, we can maximize the posterior probability with respect to $U$, $V$, $\sigma_U^2$, and $\sigma_U^2$ (or $V$, $Y$, $W$, and $\sigma_U^2$ for the hierarchical model). In effect this is just a linear model with fancy regularization.

The authors also do things like converting the ratings to be between 0 and 1, then taking the sigmoid of $U^T V$. For empirical reasons.

```
import pandas as pd
import torch
```

First, going to load in and pre-process the data. The data comes in as a table of user IDs, movie IDs, and ratings. We need to pivot these columns to get our rating matrix.

```
ratings = pd.read_csv('ml-latest-small/ratings.csv')
```

```
ratings.head()
```

```
rating_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')
n_users, n_movies = rating_matrix.shape
# Scaling ratings to between 0 and 1, this helps our model by constraining predictions
min_rating, max_rating = ratings['rating'].min(), ratings['rating'].max()
rating_matrix = (rating_matrix - min_rating) / (max_rating - min_rating)
# Replacing missing ratings with -1 so we can filter them out later
rating_matrix[rating_matrix.isnull()] = -1
rating_matrix = torch.FloatTensor(rating_matrix.values)
```

Here I'm defining the model as a PyTorch module. Our model will return a value proportional to the posterior, given the parameters (our feature vectors).

$$ P(U, V, \sigma) \propto P(R \mid U, V, \sigma^2) P(U) P(V) P(\sigma) $$I want to actually use the log probabilities for all these because the products become sums and we can avoid problems with floating point precision due to some probabilites being very close to zero or one. So the model returns

$$ p = \log P(R \mid U, V, \sigma^2) + \log P(U) + \log P(V) + \log P(\sigma) $$I also put a prior on $\sigma$, the prediction error, as it helps with finding the best feature vectors. I also tried placing priors on $\sigma_u$ and $\sigma_v$ but I couldn't get the training to converge using SGD. The authors of the PMT paper used expectation maximation to optimize $\sigma_u$ and $\sigma_v$ so I'm guessing they ran into a similar problem. I didn't implement that part of the paper, so there are improvements to my code here.

We can use PyTorch's `distributions`

module to calculate log probabilities for our model. Then we just sum up the likelihood and priors for each of our parameters.

```
from torch.distributions import Normal, HalfNormal
class PMFLoss(torch.nn.Module):
def __init__(self, sigma_u=0.3, sigma_v=0.3):
super().__init__()
self.sigma = torch.tensor([1.], requires_grad=True)
self.sigma_u = sigma_u
self.sigma_v = sigma_v
def forward(self, matrix, u_features, v_features):
predicted = torch.sigmoid(torch.mm(u_features, v_features.t()))
likelihood = Normal(predicted, self.sigma)
llh = likelihood.log_prob(matrix) #log-likelihood
# We only want to sum up the likelihood where we actually have data
# so set all missing data to 0 before we sum
non_zero_mask = (matrix != -1).type(torch.FloatTensor)
total_llh = torch.sum(llh * non_zero_mask)
u_prior = Normal(0, self.sigma_u)
v_prior = Normal(0, self.sigma_v)
logp_u = u_prior.log_prob(u_features).sum()
logp_v = v_prior.log_prob(v_features).sum()
logp_sig = HalfNormal(100.).log_prob(self.sigma).sum()
return (total_llh + logp_u + logp_v + logp_sig)
```

```
# Actual training loop now
# Defining the user and movie feature vectors. The number of latent features
# is arbitrary. The more vectors you use, the better your predictions will be
# but the slower the model trains.
latent_vectors = 30
user_features = torch.randn(n_users, latent_vectors, requires_grad=True)
user_features.data.mul_(0.01)
movie_features = torch.randn(n_movies, latent_vectors, requires_grad=True)
movie_features.data.mul_(0.01)
pmferror = PMFLoss()
optimizer = torch.optim.Adam([user_features, movie_features, pmferror.sigma], lr=0.03)
epochs = 1000
for step in range(epochs):
optimizer.zero_grad()
# SGD minimizes, but we want to maximize the posterior probability so negate it!
loss = -pmferror(rating_matrix, user_features, movie_features)
loss.backward()
optimizer.step()
if step % 50 == 0:
print(f"Step {step}, {loss:.3f}")
```

With our model trained, we can use the learned feature vectors to predict ratings for users. As an example, here I'm testing to see if our model can accurately predict existing ratings for one user.

```
# Checking if our model can reproduce the true user ratings
user_idx = 4
user_ratings = rating_matrix[user_idx, :]
true_ratings = user_ratings != -1
predictions = torch.sigmoid(torch.mm(user_features[user_idx, :].view(1, -1), movie_features.t()))
predicted_ratings = (predictions.squeeze()[true_ratings]*(max_rating - min_rating) + min_rating).round()
actual_ratings = (user_ratings[true_ratings]*(max_rating - min_rating) + min_rating).round()
print("Predictions: \n", predicted_ratings)
print("Truth: \n", actual_ratings)
```

There's a lot more you can do with this. As I noted above, the authors of the original paper extended the model to set priors on $\sigma_u$ and $\sigma_v$ and built a hierarchical model for the user vectors to improve predictions for users with few ratings. You can also use cosine similarity or other measures to identify similar users or movies. I believe in general you'd want to use this model in an ensemble with other collaborative filtering models for best results.