Matatat.org

Yelp Rating Timelines

For a while I've been thinking about Yelp reviews, in particular about the information lost by distilling the reviews down to one number. It isn't clear how this number, the average rating, is calculated either. Is it an average over all time? Is it only considering the last month? Or, is it weighted such that more recent reviews have a larger effect on the average?

A lot of the information lost is in the time domain, the change in time of a business' ratings. Presumably, a change in ownership or management could result in a change in the quality of a business, positively or negatively. Also, a business that just opened might get poor reviews but over time improves through addressing feedback or from the staff gaining more experience. These sort of changes should be present in user reviews on Yelp. I'd like to find a way to see these changes to get a better sense of the quality of a business.

First, we'll need to grab a bunch of reviews from Yelp. Yelp provides an API for searching and an API for requesting business data. An API is an Application Programming Interface, basically a list of instructions for interfacting with Yelp's data. The way these work is similar to viewing a web page. When you point your browser to a website, you do it with a URL (http://www.yelp.com for instance). Yelp sends you back data containing HTML, CSS, and Javascript. Your browser uses this data to construct the page that you see. The API works similarly, you request data with a URL (http://api.yelp.com/stuff), but instead of getting HTML and such, you get data formatted as JSON.

Using the APIs, we can search like you would on the website, but through Python code. We can also request data about businesses we find from searching. I'll get started by creating a function to perform API requests.

Men have journals, women have diaries

Last night I found an old Moleskine notebook I had hidden under some papers on my desk. As I looked through it, there were a few pages of random projects I worked on once. Thinking of my girlfriend's recent nights spent writing in her diary, I decided to use ...

Hypothesis testing and the origin of p-values

Everytime we present results in science, we must also address our uncertainty. This is true any time a measurement is presented, such as in election polling, which why you see polls that report "56% of likely voters prefer candidate A with a margin of 4 points." A result without a statement of the uncertainty is basically meaningless.

There are two types of uncertainty we come across in experiments, systematic errors and random errors. A systematic error is uncertainty introduced by the experimental process. For instance, a measurement device could be miscalibrated, or there could be a loose cable that results in faster than light neutrinos. You can also introduce systematic errors through the model you choose. Often, you won't be able to account for every factor affecting your measurements and this will introduce uncertainty in the results.

Random errors are due to fluctuations in the measurement. The uncertainty from random errors can be reduced by repeating the experiment. This type of uncertainty can be quantified by statistical analysis, unlike systematic errors.

Many experiments, especially in life sciences, report uncertainty using hypothesis testing. Hypothesis testing compares experimental results with the null hypothesis that the results don't exist. This is done because we know that due to random errors, we might see a large experimental result, even if no true effect exists.

For instance, if you are testing a pharmaceutical drug, you typically have a control (placebo) group and a treatment group. You find that the drug has some effect - it lowers cholesterol in the treatment group, maybe - and then you ask, "What is the probability I would see this effect due to random fluctuations if there was actually no effect?" Here the null hypothesis is that the two groups have equal means, \(\mu_c = \mu_t\). The analysis of this question leads to a p-value, the probability you would see an equal or greater effect under the null hypothesis. When the p-value is below some critical value, typically \(p < 0.05\), then the result is declared statistically significant and the null hypothesis is rejected. The p-value is at the heart of a massive controversy occuring in science currently, to the point where some journals are banning hypothesis testing completely.

I will offer my opinions about p-values later, but for now, I want to discuss where the p-value comes from and how it is calculated. Partially I'm doing this because I've seen few places online that explain how it is calculated, mostly because it is somehow too complicated, which as you will see, is not true. These days, most statistical testing is done with software packages like R with methods that are basically black boxes. You put your data in, out comes a p-value, then you either get to publish your results or collect more data. Here, I will try to dig into that black box and reveal what is happening.

Hosting Sphinx documentation on GitHub

In my down time, I've been writing documentation for Sampyl, a necessary and sometimes fun task. I built the documentation with Sphinx, a very nice package that allows you to focus on the content. Then, I wanted to find somewhere to host the documentation online for free. My first attempt was with Read the Docs. After running into some problems with Read the Docs building my documentation, I tried hosting on GitHub. That didn't work immediately either, but after fixing the issues, I wanted to share my experiences as to guide others.

Predicting body fat percentage

For the past few months, I've been commuting to work on my bicycle. I've always been a walker, but I've been out of shape and slowly gaining fat for some time now. The new activity has led to some obvious weight loss. This has inspired me to keep working at it and track my progress. As part of this, I wanted to measure my percent body fat using tools I have around my apartment. You can find calculators out on the internet which give you a singular estimate. Being a scientist though, I want some knowledge of the uncertainty of the estimate. I decided to build my own model from data which I can use to get an estimate, with the uncertainty, of my body fat percentage.

I found data from a study which measured the body density and various anatomical measurements (such as neck and chest circumferences) of a group of men. From my research, I found that body density can be measured accurately using water or air displacement. However, it is unclear how to convert density into body fat percentage because you must assume a distribution of lean and fatty tissues. There are more than a few methods, but for this analysis I'm going to use Brozek's method.

First we'll start by importing some packages we'll use and then import the data.

A/B Testing with Sampyl

I've been working on building a Bayesian model to infer the firing rate of neurons over time. Python has a few packages available which use Markov Chain Monte Carlo (MCMC) methods to sample from the posterior distribution of a Bayesian model. The one I'm most familiar with is ...