Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

# novalis comments on Why We Can't Take Expected Value Estimates Literally (Even When They're Unbiased) - Less Wrong

67 18 August 2011 11:34PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

## Comments (243)

Sort By: Best

You are viewing a single comment's thread.

Comment author: 18 August 2011 05:30:18PM 8 points [-]

How Not To Sort By Average Ranking explains how you should actually choose which restaurant to go to. BeerAdvocate's method is basically a hack, with no general validity. Why is the minimum number of review ten? That number should in fact depend on the variance of reviews, I think.

Comment author: 18 August 2011 08:21:16PM 5 points [-]

This is the frequentist answer to the same question. Cue standard bayesian vs. frequentist debate.

Of course, you're right that BeerAdvocate could get more accurate rankings with a more fine-tuned prior, but other than that I don't see what's wrong with their method.

Comment author: 31 August 2011 10:47:46PM 1 point [-]

I think the "basically a hack" argument isn't entirely without merit in this case, bayesian or frequentist - from what is said in the article, BeerAdvocate chose m without a lot of attention to:

• frequentist hat: the relative rate of Type I and Type II errors.

• Bayesian hat: the relative probability of a rating increasing versus decreasing with the addition of more reviews.

Comment author: 18 August 2011 08:34:00PM -1 points [-]

Well, that's just the entire point of this LW post -- what prior to choose matters a lot. It even matters specifically in the case of BeerAdvocate, who apparently got appreciably different results from changing their value of 10 reviews.

Comment author: 18 August 2011 08:42:22PM *  3 points [-]

And the solution you suggest is to just go with p=0.05 and pretend this problem isn't unavoidable, right?

Comment author: 18 August 2011 09:26:19PM 1 point [-]

I think I specifically said that variance matters. I'll also say that your application matters -- when choosing beers, I would be OK with p much worse than 0.05 since I can afford to order another beer. When choosing charities, it is a harder question.

Comment author: 18 August 2011 07:04:49PM 3 points [-]

Of course, the use of the parameter .95 is pretty arbitrary as well. :)

Comment author: 18 August 2011 07:47:40PM 0 points [-]

P = 0.05 is the standard value for "statistically significant" in science articles, so it's actually not that arbitrary. The website does also explain how to adjust for a different statistical certainty if desired :)

Comment author: 18 August 2011 08:59:05PM 15 points [-]

That's precisely why it is arbitrary -- it's a cultural artifact, not an inherently meaningful level.

Comment author: 18 August 2011 09:46:14PM 4 points [-]

What would an "inherently meaningful" confidence level look like?

Comment author: 27 December 2011 09:43:15PM 2 points [-]

necroreply: Back up to the actual use of the data, which is identification of tasty beers - an "inherently meaningful" confidence level is one which provides the most useful recommendations to the end user. This is reflected in the way the post describes BeerAdvocate changing their system - they had their confidence level set so high that only extremely popular beers could move significantly away from the average, and they concluded that this was reducing the value of their ratings.

Comment author: 12 January 2012 11:27:12PM 1 point [-]

Fair, but I think capturing that is possibly beyond the scope of their article. If you can come up with a good way to evaluate that beyond gut instinct and vague heuristics on how a specific data set "ought" to behave/look, I would love to hear it - it's been an area I've had trouble with before :)

Comment author: 13 January 2012 02:55:06AM 0 points [-]

I can think of two possibilities right off the bat - there are probably others (customer satisfactions surveys?) that I'm not thinking of that would work:

1. Measure the ability of the scoring rubric to correlate with trusted expert rankings.

2. Measure the ability of the scoring rubric to predict future votes.

(Of course, 2 has the problem that it is basically measuring the variable that Bayesians maximize...)

Comment author: 19 January 2012 12:20:17AM *  0 points [-]

Item 1 would only seem useful when you have sufficient trusted expert ranking to calibrate, but still need to use the votes to extrapolate elsewhere (and where you expect trusted experts to align with your audience - if experts routinely downvote dark ales, and your audience prefers them, you're going to get a wonky heuristic). Basically, at that point, you're JUST using votes as a method to try predicting and extrapolating expert rankings, and I'd expect there's usually better heuristics for that which don't require user votes.

Item 2 strikes me as clever and ideal, but I'd think you'd need quite a lot of data before you'd be able to actually calibrate that. So you're stuck using 0.05 until you have quite a lot of data.

(Customer satisfaction surveys, etc. also run in to the "resource intensive" issue)

(edit: apparently pound makes the whole row a header or something)

Comment author: 19 January 2012 03:38:00AM 0 points [-]

Item 1 would only seem useful when you have sufficient trusted expert ranking to calibrate, but still need to use the votes to extrapolate elsewhere [...]

Exactly. Remember, the whole point of this procedure is to tweak how much credibility you give to voters as a function of the number of voters you have - the only reason I mention experts is that they bypass the sample size problem.

(and where you expect trusted experts to align with your audience - if experts routinely downvote dark ales, and your audience prefers them, you're going to get a wonky heuristic)

Okay, that's a problem. I think it falls as a subset of the earlier problem of finding trusted expert rankings, however.

Item 2 strikes me as clever and ideal, but I'd think you'd need quite a lot of data before you'd be able to actually calibrate that. So you're stuck using 0.05 until you have quite a lot of data.

If you don't have a lot of data, you're not going to have much to offer your users anyway.

Comment author: 18 August 2011 10:04:32PM 1 point [-]

I'm not sure there is one.

It seems to me that posterior probability density : confidence interval :: topographical map : contour . (Roughly, ignoring the important distinction between confidence intervals and credibility intervals.) They're useful summaries, but discard much information. Different choices of contours or confidence intervals may be more or less useful for particular problems.

Comment author: 18 August 2011 10:10:30PM 1 point [-]

It seems like the most useful rating system would be to show a topology, then? (which I know Amazon and NewEgg both do, but only when you've gone in to the details of a review).

For a simple one-value summary, it seems like this is probably a pretty good formula. You can, as mentioned, adjust the confidence if 95% gives you trouble with your data set.

It seems like "this is what scientific papers go with" is pretty sane as far as defaults go, and as "non-arbitrary" as a default value really could be.

Comment author: 18 August 2011 10:39:05PM *  0 points [-]

I wouldn't expect there to be one -- though you could always use 1/2 -- but the original requirement is not for a confidence level but for a way of generating a total ordering. I.e. the arbitrariness is that you have a free parameter in the first place, not so much the choice of it. That's what makes it a free parameter, really!

Comment author: 19 August 2011 06:01:22AM 6 points [-]

but because it is the standard value, you can be more confident that they didn't "shop around" for the p value that was most convenient for the argument they wanted to make. It's the same reason people like to see quarterly data for a company's performance - if a company is trying to raise capital and reports its earnings for the period "January 6 - April 12", you can bet that there were big expenses on January 5 and April 13 that they'd rather not include. This is much less of a worry if they are using standard accounting periods.

Comment author: 19 August 2011 10:21:54PM 1 point [-]

It's still a pretty significant worry. If you know that some fiscal quarter or year will be used to qualify you for something important, it is often possible to arrange for key revenue and expenses to move around the boundaries to suit what you wish to portray in your report.

Comment author: 19 August 2011 07:34:17AM 1 point [-]

That's true. Arbitrary means different things, from "not chosen by nature", to "not chosen by an outside standard".

Comment author: 19 August 2011 01:30:13PM 2 points [-]

I think if we tabooed (taboo'd?) "arbitrary", we would all find ourselves in agreement about our actual predictions.

Comment author: 19 August 2011 12:24:45PM 2 points [-]

In this case, the fact that it evidently was chosen to conform to scientific culture, and not for some ulterior motive, is bayesian evidence in favor of the validity of the frequentist methodology.