In my journeys across the land, I have, to date, encountered four sets of probability calibration tests. (If you just want to make bets on your predictions, you can use Intrade or another prediction market, but these generally don't record calibration data, only which of your bets paid out.) If anyone knows of other tests, please do mention them in the comments, and I'll add them to this post. To avoid spoilers, please do not post what you guessed for the calibration questions, or what the answers are.
The first, to boast shamelessly, is my own, at http://www.acceleratingfuture.com/tom/?p=129. My tests use fairly standard trivia questions (samples: "George Washington actually fathered how many children?", "Who was Woody Allen's first wife?", "What was Paul Revere's occupation?"), with an emphasis towards history and pop culture. The quizzes are scored automatically (by computer) and you choose whether to assign a probability of 96%, 90%, 75%, 50%, or 25% to your answer. There are five quizzes with fifty questions each: Quiz #1, Quiz #2, Quiz #3, Quiz #4 and Quiz #5.
The second is a project by John Salvatier (LW account) of the University of Washington, at http://calibratedprobabilityassessment.org/. There are three sets of questions with fifty questions each; two sets of general trivia, and one set of questions about relative distances between American cities (the fourth set, unfortunately, does not appear to be working at this time). The questions do not rotate, but are re-ordered upon refreshing. The probabilities are again multiple choice, with ranges of 51-60%, 61-70%, 71-80%, 81-90%, and 91-100%, for whichever answer you think is more probable. These quizzes are also scored by computer, but instead of spitting back numbers, the computer generates a graph, showing the discrepancy between your real accuracy rate and your claimed accuracy rate. Links: US cities, trivia #1, trivia #2.
The third is a quiz by Steven Smithee of Black Belt Bayesian (LW account here) at http://www.acceleratingfuture.com/steven/?p=96. There are three sets, of five questions each, about history, demographics, and Google rankings, and two sets of (non-testable) questions about the future and historical counterfactuals. (EDIT: Steven has built three more tests in addition to this one, at http://www.acceleratingfuture.com/steven/?p=102, http://www.acceleratingfuture.com/steven/?p=106, and http://www.acceleratingfuture.com/steven/?p=136). This test must be graded manually, and the answers are in one of the comments below the test (don't look at the comments if you don't want spoilers!).
The fourth is a website by Tricycle Developments, the web developers who built Less Wrong, at http://predictionbook.com/. You can make your own predictions about real-world events, or bet on other people's predictions, at whatever probability you want, and the website records how often you were right relative to the probabilities you assigned. However, since all predictions are made in advance of real-world events, it may take quite a while (on the order of months to years) before you can find out how accurate you were.
That really shouldn't matter. Your calibration should include the chances of the question being a "trick question". If fewer than 90% of subjects give confidence intervals containing the actual number of employees, they're being overconfident by underestimating the probability that the question has an unexpected answer.
Imagine an experiment where we randomize subjects into two groups. All subjects are given a 20-question quiz that asks them to provide a confidence interval on the temperatures in various cities around the world on various dates in the past year. However, the cities and dates for group 1 are chosen at random, whereas the cities and dates for group 2 are chosen because they were record highs or lows.
This will result in two radically different estimates of overconfidence. The fact that the result of a calibration test depends heavily on the questions being a... (read more)