D_Malik comments on Open thread, Feb. 23 - Mar. 1, 2015 - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (161)
People use PredictionBook to make predictions about many heterogeneous questions, in order to train calibration. Couldn't we train calibration more efficiently by making a very large number of predictions about a fairly small, homogeneous group of questions?
For instance, at the moment people are producing a single probability for each of n questions about e.g. what will happen in HPMOR's final arc. This has a high per-question cost (people must think up individual questions, formalize them, judge edge cases, etc.) and you only get one piece of data from each question (the probability assigned to the correct outcome).
Suppose instead we get some repeatable, homogeneous question-template with a numerical answer, e.g. "what time is it?", "how many dots are in this randomly-generated picture?", or "how long is the Wikipedia article named _?". Then instead of producing only one probability for each question, you give your {1,5,10,20,...,90,95,99}-percentile estimates. Possible advantages of this approach:
Possible disadvantages of this approach:
Overall, I'd guess:
An alternative, roughly between the two groups discussed above, would be to find some repeatable way of generating questions that are at least slightly interesting. For instance, play online Mafia and privately make lots of predictions about which players have which roles, who will be lynched or murdered, etc. Or predict chess or poker. Or predict karma scores of LW/Reddit comments. Or use a spaced repetition system, but before showing the answer estimate the probability that you got the answer right. Any better ideas?
The Credence game does this for better or worse. Saying which of two metals has the higher boiling temperature on the other hand isn't that fun.
The key issue is getting interesting question templates.
I think homogenous mass prediction gets you better calibrated at predicting the kind of questions in the sample group but doesn't help a lot otherwise. If you have ever played the the calibration game that someone made a couple years ago it's a lot like that. After using it for a while you get better and better at it, but I don't thikn that made me better calibrated in general.
You can predict how long tasks/projects will take you (stopwatch and/or calendar time). Even if calibration doesn't generalize, it's potentially useful on its own there. And while you can't quite mass-produce questions/predictions, it's not such a hassle to rack up a lot if you do them in batches. Malcolm Ocean wrote about doing this with a spreadsheet, and I threw together an Android todo-with-predictions app for a similar self experiment.