A Recommendation Model with PyTorch

Posted by Mat Leonard on Mon 08 April 2019

Implementing Probabilistic Matrix Factorization in PyTorch

I found myself wanting to learn more about models for recommendation systems. After a bit of digging, I found what appears to be one of the better options for collaborative filtering called Probabilistic Matrix Factorization (PMF). What really excited me about this particular model is that it is a pretty straightforward Bayesian model. Also, implementing it with PyTorch would be quite fun. I'll outline the idea of PMF in this post, you can read more in the original paper.

We're going to use the MovieLens dataset from here. This dataset has movie ratings from a number of different users. There are different sizes provided, but I'll use the smallest dataset to keep things simple. The dataset I'm using has a bit over 100,000 ratings from 610 users, for 9724 movies.

With this model, we want to predict for our users which movies they'll like. We can do this by finding feature vectors for our users and movies. The vectors are learned from users' ratings and encode each users movie preferences as well as hidden information about the movies.

We can use those feature vectors to compare users and movies. Users with similar feature vectors should have similar tastes in movies, so we can make recommendations that way. Movies with similar feature vectors should have similar qualities such as being in the same genre. Again, we can use these movie features to make recommendations because if movie A and movie B are similar, then a user who loves movie A should also love movie B. Finally, we can take the inner product of a user's feature vector and a movie's feature vector to calculate an estimated rating.

matrix factorization

We start with a big matrix $R$ of movie ratings, with users as rows and movies as columns. Since users don't rate all movies, this is a pretty sparse matrix with a lot of missing values. What we can do is factor $R$ into two matrices $U$ and $V$ representing latent feature vectors for users and movies. With these latent features, we can compare users with each other, movies with each other, and predict ratings for movies users haven't seen.

The big idea behind this paper is that we're going to treat the latent vectors as parameters in a Bayesian model. As a reminder, Bayes theorem:

$$ \large P(\theta \mid D) \propto P(D \mid \theta) P(\theta) $$

In our model we try to predict the rating matrix $R$ with $U$ and $V$

$$ \large \hat{R} = U^T V$$

In our model we assume the ratings are drawn from a normal distribution with mean $\hat{R}$. What's really cool is that we can place priors on our latent features.

$$ \begin{aligned} R &\sim \mathrm{Normal}(U^T V, \sigma^2) \\ U &\sim \mathrm{Normal}(0, \sigma_U^2) \\ V &\sim \mathrm{Normal}(0, \sigma_V^2) \end{aligned} $$

They also put priors on the parameters $\sigma_U$ and $\sigma_V$.

$$ \begin{aligned} \sigma_U &\sim \mathrm{Normal}(0, \Theta_U^2) \\ \sigma_V &\sim \mathrm{Normal}(0, \Theta_V^2) \end{aligned} $$

The authors of the paper go further and build a hierachical structure for the user vectors that takes into account users with low numbers of reviews.

$$ \begin{aligned} U_i &= Y_i + \frac{\sum_k^M I_{ik}W_k}{\sum_k^M I_{ik}} \\ Y &\sim \mathrm{Normal}(0, \sigma_U^2) \\ W &\sim \mathrm{Normal}(0, \sigma_W^2) \end{aligned} $$

With this model, we can maximize the posterior probability with respect to $U$, $V$, $\sigma_U^2$, and $\sigma_U^2$ (or $V$, $Y$, $W$, and $\sigma_U^2$ for the hierarchical model). In effect this is just a linear model with fancy regularization.

The authors also do things like converting the ratings to be between 0 and 1, then taking the sigmoid of $U^T V$. For empirical reasons.

In [1]:
import pandas as pd
import torch

First, going to load in and pre-process the data. The data comes in as a table of user IDs, movie IDs, and ratings. We need to pivot these columns to get our rating matrix.

In [2]:
ratings = pd.read_csv('ml-latest-small/ratings.csv')
In [3]:
ratings.head()
Out[3]:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
In [4]:
rating_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')
n_users, n_movies = rating_matrix.shape

# Scaling ratings to between 0 and 1, this helps our model by constraining predictions
min_rating, max_rating = ratings['rating'].min(), ratings['rating'].max()
rating_matrix = (rating_matrix - min_rating) / (max_rating - min_rating)

# Replacing missing ratings with -1 so we can filter them out later
rating_matrix[rating_matrix.isnull()] = -1
rating_matrix = torch.FloatTensor(rating_matrix.values)

Here I'm defining the model as a PyTorch module. Our model will return a value proportional to the posterior, given the parameters (our feature vectors).

$$ P(U, V, \sigma) \propto P(R \mid U, V, \sigma^2) P(U) P(V) P(\sigma) $$

I want to actually use the log probabilities for all these because the products become sums and we can avoid problems with floating point precision due to some probabilites being very close to zero or one. So the model returns

$$ p = \log P(R \mid U, V, \sigma^2) + \log P(U) + \log P(V) + \log P(\sigma) $$

I also put a prior on $\sigma$, the prediction error, as it helps with finding the best feature vectors. I also tried placing priors on $\sigma_u$ and $\sigma_v$ but I couldn't get the training to converge using SGD. The authors of the PMT paper used expectation maximation to optimize $\sigma_u$ and $\sigma_v$ so I'm guessing they ran into a similar problem. I didn't implement that part of the paper, so there are improvements to my code here.

We can use PyTorch's distributions module to calculate log probabilities for our model. Then we just sum up the likelihood and priors for each of our parameters.

In [5]:
from torch.distributions import Normal, HalfNormal

class PMFLoss(torch.nn.Module):
    def __init__(self, sigma_u=0.3, sigma_v=0.3):
        super().__init__()
        self.sigma = torch.tensor([1.], requires_grad=True)
        self.sigma_u = sigma_u
        self.sigma_v = sigma_v
    
    def forward(self, matrix, u_features, v_features):

        predicted = torch.sigmoid(torch.mm(u_features, v_features.t()))
        likelihood = Normal(predicted, self.sigma)
        llh = likelihood.log_prob(matrix) #log-likelihood
        
        # We only want to sum up the likelihood where we actually have data
        # so set all missing data to 0 before we sum
        non_zero_mask = (matrix != -1).type(torch.FloatTensor)
        total_llh = torch.sum(llh * non_zero_mask)

        u_prior = Normal(0, self.sigma_u)
        v_prior = Normal(0, self.sigma_v)
        logp_u = u_prior.log_prob(u_features).sum()
        logp_v = v_prior.log_prob(v_features).sum()
        
        logp_sig = HalfNormal(100.).log_prob(self.sigma).sum()
        
        return (total_llh + logp_u + logp_v + logp_sig)
In [6]:
# Actual training loop now

# Defining the user and movie feature vectors. The number of latent features
# is arbitrary. The more vectors you use, the better your predictions will be
# but the slower the model trains.
latent_vectors = 30
user_features = torch.randn(n_users, latent_vectors, requires_grad=True)
user_features.data.mul_(0.01) 
movie_features = torch.randn(n_movies, latent_vectors, requires_grad=True)
movie_features.data.mul_(0.01)

pmferror = PMFLoss()
optimizer = torch.optim.Adam([user_features, movie_features, pmferror.sigma], lr=0.03)
epochs = 1000
for step in range(epochs):
    optimizer.zero_grad()
    # SGD minimizes, but we want to maximize the posterior probability so negate it!
    loss = -pmferror(rating_matrix, user_features, movie_features)
    loss.backward()
    optimizer.step()
    if step % 50 == 0:
        print(f"Step {step}, {loss:.3f}")
Step 0, 8585.745
Step 50, -109049.531
Step 100, -165917.859
Step 150, -174563.375
Step 200, -176913.188
Step 250, -177872.328
Step 300, -178502.625
Step 350, -178917.453
Step 400, -179248.922
Step 450, -179598.094
Step 500, -179811.031
Step 550, -179658.922
Step 600, -180062.516
Step 650, -180042.797
Step 700, -180428.688
Step 750, -180534.766
Step 800, -180664.047
Step 850, -180735.406
Step 900, -180906.047
Step 950, -180980.672

With our model trained, we can use the learned feature vectors to predict ratings for users. As an example, here I'm testing to see if our model can accurately predict existing ratings for one user.

In [7]:
# Checking if our model can reproduce the true user ratings
user_idx = 4
user_ratings = rating_matrix[user_idx, :]
true_ratings = user_ratings != -1
predictions = torch.sigmoid(torch.mm(user_features[user_idx, :].view(1, -1), movie_features.t()))
predicted_ratings = (predictions.squeeze()[true_ratings]*(max_rating - min_rating) + min_rating).round()
actual_ratings = (user_ratings[true_ratings]*(max_rating - min_rating) + min_rating).round()

print("Predictions: \n", predicted_ratings)
print("Truth: \n", actual_ratings)
Predictions: 
 tensor([4., 4., 4., 4., 3., 4., 5., 4., 3., 3., 4., 5., 3., 4., 3., 1., 5., 5.,
        3., 2., 3., 3., 3., 2., 3., 4., 2., 3., 4., 4., 5., 3., 5., 4., 3., 4.,
        3., 5., 3., 5., 5., 5., 3., 3.], grad_fn=<RoundBackward>)
Truth: 
 tensor([4., 4., 4., 4., 3., 4., 5., 4., 3., 3., 4., 5., 3., 4., 3., 1., 5., 5.,
        3., 2., 3., 3., 3., 2., 3., 4., 2., 3., 4., 4., 5., 3., 5., 4., 3., 4.,
        3., 5., 3., 5., 5., 5., 3., 3.])

There's a lot more you can do with this. As I noted above, the authors of the original paper extended the model to set priors on $\sigma_u$ and $\sigma_v$ and built a hierarchical model for the user vectors to improve predictions for users with few ratings. You can also use cosine similarity or other measures to identify similar users or movies. I believe in general you'd want to use this model in an ensemble with other collaborative filtering models for best results.