I found myself wanting to learn more about models for recommendation systems. After a bit of digging, I found what appears to be one of the better options for collaborative filtering called Probabilistic Matrix Factorization (PMF). What really excited me about this particular model is that it is a pretty straightforward Bayesian model. Also, implementing it with PyTorch would be quite fun. I'll outline the idea of PMF in this post and show how we can implement it with PyTorch.
Way back in the year 2015, I discovered Bayesian statistics and immediately fell in love. At the time I found a lot of use for Bayesian models in my research. However, the Python libraries built for Bayesian data analysis were difficult to use for my purposes. So, as any good postdoc will do, I spent a couple months building my own library, Sampyl. Fast forward a bit and I'm teaching deep learning at Udacity. Initially we used TensorFlow as it was by far the most used and useable deep learning framework available. However, PyTorch released in mid-2017 and it was a revelation. Again, I immediately fell in love. Building neural networks wasn't just easy, it was fun. Turns out, the exact features that make PyTorch great also make it an excellent choice as a backend for a Bayesian data analysis librarie. Here is my initial work building a library for Bayesian data analysis using PyTorch.
I've been working on building a Bayesian model to infer the firing rate of neurons over time. Python has a few packages available which use Markov Chain Monte Carlo (MCMC) methods to sample from the posterior distribution of a Bayesian model. The one I'm most familiar with is PyMC, the newest version (PyMC3) implements the No-U-Turn Sampler (NUTS) developed by Matthew Hoffman and Andrew Gelman. NUTS is based on Hamiltonian MCMC which uses the gradient of the posterior log probability to avoid the random walk nature of Metropolis-Hastings samplers, but requires very little tuning by the user. In the new version, the authors of PyMC chose to use Theano to calculate the log-p gradients. This is where I ran into problems.
For the past few months, I've been commuting to work on my bicycle. I've always been a walker, but I've been out of shape and slowly gaining fat for some time now. The new activity has led to some obvious weight loss. This has inspired me to keep working at it and track my progress. As part of this, I wanted to measure my percent body fat using tools I have around my apartment. You can find calculators out on the internet which give you a singular estimate. Being a scientist though, I want some knowledge of the uncertainty of the estimate. I decided to build my own model from data which I can use to get an estimate, with the uncertainty, of my body fat percentage.
In my down time, I've been writing documentation for Sampyl, a necessary and sometimes fun task. I built the documentation with Sphinx, a very nice package that allows you to focus on the content. Then, I wanted to find somewhere to host the documentation online for free. My first attempt was with Read the Docs. After running into some problems with Read the Docs building my documentation, I tried hosting on GitHub. That didn't work immediately either, but after fixing the issues, I wanted to share my experiences as to guide others.
Everytime we present results in science, we must also address our uncertainty. This is true any time a measurement is presented, such as in election polling, which why you see polls that report "56% of likely voters prefer candidate A with a margin of 4 points." A result without a statement of the uncertainty is basically meaningless. However, common statistics such as p-values are supposed to help us gauge uncertainty, yet are uncertain themselves. In this post, I'll explore the uncertainty inherent in hypothesis testing and why p-values are a poor way to measure uncertainty.
For a while I've been thinking about Yelp reviews, in particular about the information lost by distilling the reviews down to one number. It isn't clear how this number, the average rating, is calculated either. Is it an average over all time? Is it only considering the last month? Or, is it weighted such that more recent reviews have a larger effect on the average? A lot of the information lost is in the time domain, the change in time of a business' ratings. Presumably, a change in ownership or management could result in a change in the quality of a business, positively or negatively. Also, a business that just opened might get poor reviews but over time improves through addressing feedback or from the staff gaining more experience. These sort of changes should be present in user reviews on Yelp. I'd like to find a way to see these changes to get a better sense of the quality of a business.