There's actually a big problem with using Brier scores for open-ended questions like this, which is that the optimal option if you're, say, 50% confident you have the right answer, is to instead report "Don't know / bleeblabloo, probability 0.0001". Then you get a good Brier score for knowing you would be wrong.
We ran this at our meetup today and it was the subject of much discussion. A big conclusion seemed to be that Brier scores work best when there is a fixed, limited number of possibilities to guess from; when the number of possibilities is large/unknown and you can guess "I don't know," you get this bad behavior.
We came up with a kind of hacky solution that gave you negative points for wrong answers and positive points for right ones, scaled to the probability you gave, plus regular Brier scores for the True/False questions. It's unlikely that solution was a proper scoring rule, but it was somewhat better in removing the incentive to always guess "[wrong answer] with probability epsilon."
The quick hack I'd use if I didn't want people to be able to easily guess wrong with high certainty would be to use True/False or multiple choice questions. That said, I don't currently think of this as a big problem?
There are two scores; Calibration and Correct Answers. If someone has remarkably good calibration and almost no correct answers, then they're probably deliberately guessing outlandish answers and being sure that they're wrong. That's not worth bragging rights, it's the equivalent of running to the side of the obstacles on an obstacle course. Someone who's correctly 20% confident on most of the questions can get a lower Brier but six Correct Answer points, or an excellent Brier and zero Correct Answer points, and the former is (to me) more impressive. If you are actually totally clueless, then "[wrong answer] with probability epsilon" is actually the right response.
"I notice that I don't actually know this" is (in my opinion) a useful skill to pick up, if you can avoid also picking up "I should pretend that I know nothing." Still, the option to make it multiple choice exists, and there might be a better scoring rule. (I deliberately avoided making some kind of combined score, because I didn't want less obvious strategic exchange rates between correct answers and calibration.)
The people with the best calibration scores will not be those with the most skill at calibration. It will be those who "don't guess" on the trivia questions -- they either know it or they don't (100% of 0% chance of getting it right). This is because if you guess and have (e.g.) a 50% chance of getting it right, then even if you are perfectly calibrated about that 50%, you will still get a Brier score of 0.25, as opposed to a score of 0 for someone who "doesn't guess".
Consequently, I don't really see this game as being very useful at measuring calibration.
Feedback and suggestions for improvement are very welcome!
It's true that someone can easily get an excellent calibration score at the cost of getting no points. This tends to be very obvious when you read out the leaderboard. A quick patch is to turn all the questions into statements and have people estimate how likely they think the statement is true. "What is the element with Atomic Weight 29" becomes "The element with Atomic Weight 29 is Copper." Then there is no easy path to excellent scores of either kind.
That version is a little less fun and I don't think the change is necessary. I'm curious is if that patch would satisfy your objection? It might be relevant that I don't view the goal as measuring calibration, but to train it. When I've run this, I often see a rapid change in confidences over the course of the first dozen questions as some people who hadn't previously practiced the skill begin to use numbers other than the highest and lowest available.
Sure, that patch wouldn't have the problem I described.
Anyway, do whatever works for you -- if you find this exercise helps people train their calibration, then I suppose that's a good thing. I guess my main point would be not to take too seriously what this method tells us about who is "best" at calibration -- and I guess you're saying people already don't take seriously in the case of someone who is doing badly at the trivia portion, but I think the failure mode is a bit more general than that. Anyway, I guess it doesn't matter too much.
If you and your audience have smartphones, we suggest making use of a copy of this spreadsheet and google form.
are "spreadsheet" and "google form" meant to be linked to something?
Updated with a couple of variations and a link to a google drive folder with multiple question sets. The True/False version is from the comments and suggestions people left. The range version is from Ben Orlin's Outrangeous.
The link labeled "Calibration Trivia Sets" goes to a single slideshow labeled "Calibration Trivia Set 1 TF" rather than a folder with multiple sets; I assume (with 95% probability :) ) that this is a mistake?
You are correct that's a mistake, one which I believe is now fixed. Thanks for pointing it out!
Summary: A game of trivia where you answer factual questions about the world, but stating how sure you are that you’re right and trying to be well calibrated.
Tags: Large, Repeatable
Purpose: Calibration Trivia is designed to practice proper calibration – recognizing when you're very sure of something vs when you aren't very sure of it.
Materials: Minimally, you need a list of trivia questions and some writing implements for your audience. If you and your audience have smartphones, I suggest making use of Fatebook.io (if you're using true/false questions) or a google form (if you have a spreadsheet set up to score it.) In both cases, a timer can be useful to time each question, though it's perfectly acceptable to just advance to the next question after what feels like a couple of minutes or when it looks like most people are done.
Announcement: We’re planning to host a trivia game with a twist! If you’ve never been to a trivia night before, one the person running it will call out questions, we'll write our answers, and a good time is had by all. In addition to answering the question however, you'll be able to write down how confident you are in your guess and at the end we check if you're well calibrated – that is, do you know when you do and do not know the answer? Categories are Literature, Math and Science, History, Sports, and Tabletop Roleplaying Games.
Please bring a smartphone or similar device, as you'll need it to enter your answers!
Note: You should make sure to change the categories to match whatever you're using. You should also remove the smartphone line if you're using another method, such as having people write down their answers and hand them to you.
Description:
1. Describe the following rules to the participants.
"This is a game of trivia, with a special tweak. For anyone unfamiliar, the way trivia works is that I'll present a question, and you'll have a couple of minutes to write down an answer. Then I'll reveal the answer, and if you got it right then you'll get one point. Feel free to chat with each other once you're done guessing and while you're waiting for the next question."
"The tweak is, in addition to writing your answer down, you will also write down how confident you are that your answer is correct in the form of a percentage. If you are very confident, you might write 95, which means if you were this sure about twenty questions you'd expect to only get one of them wrong. If you were guessing wildly, you might write down 1, which means if you were that uncertain about a hundred things, you think you'd get one of them right mostly by coincidence. You'll be scored on calibration according to what's called a Brier Score, which is a Strictly Proper Scoring Rule for predictions – that means that you want to give your actual estimation of how likely you are to be right. You'll do generally do worse if you try and answer higher or lower than your actual estimation. Does anyone have any questions?"
Note: The scoring mechanism suggested is (1-their probability)^2 if they're right, and (0-their probability)^2 if they're wrong. Average the scores from each question together. Someone who correctly answered with a 90% confidence gets scored (1-.9)^2=.01. The best theoretical Brier Score would be 0, which is impossible to achieve but one can try and get close.
2. One at a time, read each question aloud. (A collection of questions is included below, under "Calibration Trivia Questions.") Be sure to speak clearly and loudly enough for everyone to hear. If you happen to have a projector or screen, it can help to put the question up there as well.
Every six questions, announce or display the current points and scores. If you have a very large crowd, it can speed things up to only announce the top five for Correct Answers and the top five for Best Calibrated. In both cases, I suggest it's more fun to announce from the bottom up, starting with the worst scorer and ending with the best.
Repeat until the entire set of questions has been worked through.
3. Announce the final points and scores.
Notes: You'll want a venue where you can talk loud enough for everyone to hear you. You may also want to adjust the question list or the number of questions based on how the interests of your group or how long you wish the event to run for.
Calibration Trivia Questions: Calibration Trivia Sets, example scoresheet 1
Variations: Brier scores are used to judge between two binary options, correct or incorrect. Here, I'm abusing it a bit having people write their answer from all the possibilities, then guess if they're right or wrong. The easy patch is to make all the questions in the form of statements, and then ask if those statements are true or false. In the Calibration Trivia Sets, any set marked TF is in the form of statements which are either true or false, meaning people just need to answer with their confidence in its truth.
(If you're using Fatebook, I suggest setting up the questions in a tournament, clicking the option to hide other people's answers, and making the question titles just "Trivia Question 1" and so on then displaying the text of the question on a projector.)
Another variation in how to write question sets is to make all questions have a numerical answer, and then ask for a range. You can score this by having the narrowest correct range win, or ask for 90% confidence intervals and see how often people are right.
Outrangeous and Breaking Rank are trivia games in their own right, not just variations of Calibration trivia. That said if you want something like Calibration Trivia but different, or you want a format where you don't need to have a set of questions prepared in advance, give them a try!
Notes: I advise giving several minutes for each question, longer than is needed to just write down the answer. Some people will spend more time thinking than you expect. Often people who have finished writing their answer will talk and socialize with each other in the gaps.