Test Your Calibration!

alyssavance

LESSWRONG
LW

Test Your Calibration! — LessWrong

25 Test Your Calibration!

by alyssavance

11th Nov 2009

2 min read

25

In my journeys across the land, I have, to date, encountered four sets of probability calibration tests. (If you just want to make bets on your predictions, you can use Intrade or another prediction market, but these generally don't record calibration data, only which of your bets paid out.) If anyone knows of other tests, please do mention them in the comments, and I'll add them to this post. To avoid spoilers, please do not post what you guessed for the calibration questions, or what the answers are.

The first, to boast shamelessly, is my own, at http://www.acceleratingfuture.com/tom/?p=129. My tests use fairly standard trivia questions (samples: "George Washington actually fathered how many children?", "Who was Woody Allen's first wife?", "What was Paul Revere's occupation?"), with an emphasis towards history and pop culture. The quizzes are scored automatically (by computer) and you choose whether to assign a probability of 96%, 90%, 75%, 50%, or 25% to your answer. There are five quizzes with fifty questions each: Quiz #1, Quiz #2, Quiz #3, Quiz #4 and Quiz #5.

The second is a project by John Salvatier (LW account) of the University of Washington, at http://calibratedprobabilityassessment.org/. There are three sets of questions with fifty questions each; two sets of general trivia, and one set of questions about relative distances between American cities (the fourth set, unfortunately, does not appear to be working at this time). The questions do not rotate, but are re-ordered upon refreshing. The probabilities are again multiple choice, with ranges of 51-60%, 61-70%, 71-80%, 81-90%, and 91-100%, for whichever answer you think is more probable. These quizzes are also scored by computer, but instead of spitting back numbers, the computer generates a graph, showing the discrepancy between your real accuracy rate and your claimed accuracy rate. Links: US cities, trivia #1, trivia #2.

The third is a quiz by Steven Smithee of Black Belt Bayesian (LW account here) at http://www.acceleratingfuture.com/steven/?p=96. There are three sets, of five questions each, about history, demographics, and Google rankings, and two sets of (non-testable) questions about the future and historical counterfactuals. (EDIT: Steven has built three more tests in addition to this one, at http://www.acceleratingfuture.com/steven/?p=102, http://www.acceleratingfuture.com/steven/?p=106, and http://www.acceleratingfuture.com/steven/?p=136). This test must be graded manually, and the answers are in one of the comments below the test (don't look at the comments if you don't want spoilers!).

The fourth is a website by Tricycle Developments, the web developers who built Less Wrong, at http://predictionbook.com/. You can make your own predictions about real-world events, or bet on other people's predictions, at whatever probability you want, and the website records how often you were right relative to the probabilities you assigned. However, since all predictions are made in advance of real-world events, it may take quite a while (on the order of months to years) before you can find out how accurate you were.

CalibrationExercises / Problem-Sets

Personal Blog

25

Mentioned in

30Calibration for continuous quantities

15Draft: Reasons to Use Informal Probabilities

4Meetup : Madison

New Comment

34 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:26 AM

[-]Rune16y230

Advice for future creators of tests: There are people who live outside the US. No one outside the US cares about the 3rd person to be the second dead uncle of the fourth president of the US.

For instance, a majority of tommccabe's quiz questions are highly US-specific.

The point here is that non-Americans will end up guessing almost all questions, making the whole exercise painful and useless.

[-]SoerenMind11y20

The best calibration IMO exercises I was able to find (which also work for non-Americans) can be downloaded from the website of How to Measure Anything.

http://www.howtomeasureanything.com/

[-]alyssavance16y10

Noted, but I didn't write those questions, they were taken from the open-source MisterHouse project. If you know of any sources of free trivia questions that aren't US-specific please do PM me.

[-]Cyan16y30

You can get a ton of free non-U.S.-specific trivia from the CIA World Factbook.

[-]Alicorn16y30

It seems like it'd be pretty easy to write your own trivia questions by permitting yourself to surf Wikipedia for a while and extract facts from the articles. What's the advantage to trivia questions you don't write yourself - just speed, or something else too?

[-]alyssavance16y70

Just speed. At two minutes per trivia question it would take a full day to make another set of 250.

[-]bentarm16y130

Why isn't there a 33% option for your test? What if I'm pretty certain that 1 of the answers is wrong, but have no clue which of the others is most likely to be right? Then my confidence is exactly 33%, and I have to either overestimate or underestimate it. The 50% and 25% options seem to cover the other two versions of this scenario (I can eliminate either 2 or 0 of the options almost certainly) but this appears to be a gap.

(incidentally, this only occurred to me because it happened to be the case for the first question on the first of your quizzes...)

[-]alyssavance16y10

There probably should be, mea culpa.

[-]gerg16y70

Part of the output of your quizzes is a line of the form "Your chance of being well calibrated, relative to the null hypothesis, is 50.445538580926 percent." How is this number computed?

I chose "25% confident" for 25 questions and got 6 of them (24%) right. That seems like a pretty good calibration ... but 50.44% chance of being well calibrated relative to null doesn't seem that good. Does that sentence mean that an observer, given my test results, would assign a 50.44% probability to my being well calibrated and a 49.56% probability to my not being well calibrated? (or to my randomly choosing answers?) Or something else?

[-]skepsci14y50

It's also completely ridiculous, with a sample size of ~10 questions, to give the success rate and probability of being well calibrated as percentages with 12 decimals. Since the uncertainty in such a small sample is on the order of several percent, just round to the nearest percentage.

[-]MTGandP11y00

It probably just computes it as a float and then prints the whole float.

(I do recognize the silliness of replying to a three-year old comment that itself is replying to a six-year old comment.)

[-]Soothsilver10y00

It's not silly. I still find these newer comments useful.

[-]MTGandP10y00

And here we are one year later!

[-]Sunny from QAD7y10

Yes, do it for posterity!

[-]Matteo De Stefano4y10

I would like to chime in and point out that as today the domain "acceleratingfuture (dot) com" is owned by a russian bookmaker.

[-]XelaP10mo10

Does "chance relative to null is x%" mean "An observer, given my results, would assign an x% to me being calibrated"

No! P(Test results | Perfect calibration) / P(Test results | Whatever the null is) ≠ P(Perfect Calibration | Test results) !

You can also lodge this is a problem with null hypothesis testing - I would've thought that perfect calibration would be the null. Perhaps the null is a model where you just randomly say a probability from 0 to 100.

I'm assuming that they really calculated a likelihood function P(Data|Perfect) / P(Data|Null) instead of the posteriorP(Perfect|Data) / P(Null|Data) as the words they used would mean if taken literally. But maybe they have some priors P(Perfect) / P(Null) that they used. (The thing they should do is just report the likelihood ratio, instead of their posterior).

If you have your data and want to compute P(Data|Perfect), you can compute a total product Π_i (p_i if it happened, (1-p_i) if it didn't)

So for example if I predicted 20%, 70%, 30% and the actual results were No, Yes, Yes, then P(Data|Perfect) = .8 * .7 * .3. If you have some other hypothesis (e.g. whatever their null is), you can compute P(Data|Other Hypothesis) by using the predictions that hypothesis makes for how your reported probabilities relate to propensities of events. A hypothesis here should be a function f(reported) = P(Event happens | reported).

[-]elazdins4y50

Just launched my own version of a calibration test here - https://calibration.lazdini.lv/ it is pretty much identical to http://confidence.success-equation.com/ except the questions should be different each time you visit the site, allowing for regular calibration/recalibration. Questions are retrieved from the free API provided by https://opentdb.com/.

[-]jimrandomh16y50

I would like to see a calibration test with open-ended questions rather than multiple choice. Multiple choice makes it easier to judge confidence, but I'm afraid the calibrations won't transfer well to other domains.

(The test-taker would have to grade their test, since open ended questions may have multiple answers, and typos and minor variations shouldn't count as errors. But other than that, the test would be pretty much the same.)

[-]Isaac King4y10

An open-ended probability calibration test is something I've been planning to build. I'd be curious to hear your thoughts on how the specifics should be implemented. How should they grade their own test in a way that avoids bias and still gives useful results?

[-]SK216y40

I have seen a problem with selection bias in calibration tests, where trick questions are overrepresented. For example, in this PDF article, the authors ask subjects to provide a 90% confidence interval estimating the number of employees IBM has. They find that fewer than 90% of subjects select a suitable range, which they conclude results from overconfidence. However, IBM has almost 400,000 employees, which is atypically high (more than 4x Microsoft). The results of this study have just as much to do with the question asked as with the overconfidence of the subjects.

Similarly, trivia questions are frequently (though not always) designed to have interesting/unintuitive answers, making them problematic for a calibration quiz where people are expecting straightforward questions. I don't know that to be the case for the AcceleratingFuture quizzes, but it is an issue in general.

[-]Blueberry16y20

That really shouldn't matter. Your calibration should include the chances of the question being a "trick question". If fewer than 90% of subjects give confidence intervals containing the actual number of employees, they're being overconfident by underestimating the probability that the question has an unexpected answer.

[-]SK216y80

Imagine an experiment where we randomize subjects into two groups. All subjects are given a 20-question quiz that asks them to provide a confidence interval on the temperatures in various cities around the world on various dates in the past year. However, the cities and dates for group 1 are chosen at random, whereas the cities and dates for group 2 are chosen because they were record highs or lows.

This will result in two radically different estimates of overconfidence. The fact that the result of a calibration test depends heavily on the questions being asked should suggest that the methodology is problematic.

What this comes down to is: how do you estimate the probability that a question has an unexpected answer? See this quiz: maybe the quizzer is trying to trick you, maybe he's trying to reverse-trick you, or maybe he just chose his questions at random. It's a meaningless exercise because you're being asked to estimate values from an unknown distribution. The only rational thing to do is guess at random.

People taking a calibration test should first see the answers to a sample of the data set they will be tested on.

[-]pengvado16y30

I think the two of you are looking at different parts of the process.

"Amount of trickiness" is a random variable that is rolled once per quiz. Averaging over a sufficiently large number of quizzes will eliminate any error it causes, which makes it a contribution to variance, not systematic bias.

Otoh, "estimate of the average trickiness of quizzes" is a single question that people can be wrong about. No amount of averaging will reduce the influence of that question on the results, so if your reason for caring about calibration isn't to get that particular question right, it does cause a systematic bias when applying the results to every other situation.

[-]JamesAndrix16y40

Wow, hmm I took quiz 1 so far, and all my high confidence answer groups all scored much lower. For now I blame too much experience with easy multiple choice tests. I only got 19 out of 50 overall

Another problem, I think :

You marked your answers to 17 questions as '25% accurate'. Out of these, 1 answers were correct, for a success rate of 5.8823529411765 percent.

Now I thought I was choosing 25% when I didn't know the answer, but this seems to indicate that I had some information, and was biased against playing my (sometimes correct) hunches when marking 25%.

[-]steven046116y30

(I believe this is his LW account, but feel free to correct me)

This is my current LW account.

There were sequels to the Aumann game here here and here; these have better questions but probably the lack of auto-scoring makes it not worth the effort.

[-]alyssavance16y00

Added, thanks!

[-]jimmy16y20

If anyone is thinking about creating their own, I would suggest questions with numerical answers so you can give upper and lower bounds of varying confidence, rather than trying to pick your confidence on a binary question and try to force binning or do some sort of filtering.

Also, this lets you give several probability estimates for each question.

[-]aretae16y20

Douglas Hubbard writes on the topic of calibration as well. He focuses on RW application of this stuff, and calibration is clearly a part of that.

His 1st book: http://www.amazon.com/How-Measure-Anything-Intangibles-ebook/dp/B001BPE8ZQ/ref=sr_1_3?ie=UTF8&s=books&qid=1258133710&sr=8-3

His site: http://www.hubbardresearch.com/dotnetnuke/

[-]gwern15y00

I found How to Measure Anything pretty interesting in its thorough application of calibration and Fermi calculation to all sorts of problems, although I didn't find the digressions into Excel very useful. Definitely recommended if you don't already have the mental knack for Fermi stuff.