In How Not To Sort By Average Rating, Evan Miller gives two wrong ways to generate an aggregate rating from a collection of positive and negative votes, and one method he thinks is correct. But the "correct" method is complicated, poorly motivated, insufficiently parameterized, and founded on frequentist statistics. A much simpler model based on a prior beta distribution has more solid theoretical foundation and would give more accurate results.
Evan mentions the sad reality that big organizations are using obviously naive methods. In contrast, more dynamic sites such as Reddit have adopted the model he suggested. But I fear that it would cause irreparable damage if the world settles on this solution.
Should anything be done about it? What can be done?
This is also somewhat meta in that LW also aggregates ratings, and I believe changing the model was once discussed (and maybe the beta model was suggested).
In the Bayesian model, as in Evan's model, we assume for every item there is some true probability p of upvoting, representing its quality and the rating we wish to give. Every vote is a Bernoulli trial which gives information on p. The prior for p is the beta distribution with some parameters a and b. After observing n actual votes, of which k are positive, the parameters of the posterior distribution are a+k and b+(n-k), so the posterior mean of p is (a+k)/(a+b+n). This gives the best estimate for the true quality, and reproduces all the desired effects - convergence to the proportion of positive ratings, where items with insufficient data are pulled towards the prior mean.
The specific parameters a and b depend on the quality distribution in the specific system. a/(a+b) is the average quality and can be taken as simply the empirical proportion of positive votes among all votes in the system. a+b is an inverse measure of variance - a high value means most items are average quality, and a low value means items are either extremely good or extremely bad. It is harder to calibrate, but can still be done using the overall data (e.g., MLE from the entire voting data).
For the specific problem of sorting, there are other considerations than mere quality. A comment can be in universal agreement, but not otherwise interesting or notable. These may not deserve as prominent a mention as controversial comments which provoke stronger reactions. For this purpose, the "sorting rating" can be multiplied by some function of the total number of votes, such as the square root. If the identity function is used, this becomes similar to a simple difference between the number of positive and negative votes.
It means that the model used per item doesn't have enough parameters to encode what we know about the specific domain (where domain is "Reddit comments", "Urban dictionary definitions", etc.)
The formulas discussed define a certain mapping between pairs (positive votes, negative votes) to a quality score. In Miller's model, the same mapping is used everywhere without consideration of the characteristics of the specific domain. In my model, there are parameters a and b (or alternatively, a/(a+b) and a+b) that we first train per-domain, and then apply per item.
For example, let's say you want to decide the order of a (5, 0) item and a (40, 10) item. Miller's model just gives one answer. My model gives different answers depending on:
The average quality - if the overall item quality is high (say, most items have 100% positive votes), the (5,0) item should be higher because it's likely one of those 100% items, while (40,10) has proven itself to be of lower quality. If, however, most items have low quality, (40,10) will be higher because it has proven itself to be one of the rare high-quality items, while (5,0) is more likely to be a low-quality item which lucked out.
The variance in quality - say the average quality is 50%. If the variance in quality is low, (5,0) will be lower because it is likely to be an average item which lucked out, while (40, 10) has proven to be of high quality. If the variance is high (with most items being either 100% or 0%), (5,0) will be higher because in all likelihood it is one of the 100% items, while (40, 10) has proven to be only 80%.
In short, using a cookie-cutter model without any domain-specific parameters doesn't make the most efficient use of the data possible.