D_Malik comments on Open thread, Feb. 23 - Mar. 1, 2015 - Less Wrong

3 Post author: MrMind 23 February 2015 08:01AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (161)

You are viewing a single comment's thread.

Comment author: D_Malik 23 February 2015 10:05:05PM 6 points [-]

People use PredictionBook to make predictions about many heterogeneous questions, in order to train calibration. Couldn't we train calibration more efficiently by making a very large number of predictions about a fairly small, homogeneous group of questions?

For instance, at the moment people are producing a single probability for each of n questions about e.g. what will happen in HPMOR's final arc. This has a high per-question cost (people must think up individual questions, formalize them, judge edge cases, etc.) and you only get one piece of data from each question (the probability assigned to the correct outcome).

Suppose instead we get some repeatable, homogeneous question-template with a numerical answer, e.g. "what time is it?", "how many dots are in this randomly-generated picture?", or "how long is the Wikipedia article named _?". Then instead of producing only one probability for each question, you give your {1,5,10,20,...,90,95,99}-percentile estimates. Possible advantages of this approach:

  • Questions are mass-produced. We can write a program to ask the same question over and over, for different times / pictures / Wikipedia articles. Each question gives more data, since you're producing several percentile estimates for each rather than just a single probability.
  • The task is simpler; more time is spent converting intuition into numbers. There's less system II thinking to do, more just e.g. guessing how many dots are in a picture. You only need to mentally construct a single probability distribution, over a 1-dimensional answer-space, then read off some numbers, rather than constructing a distribution over some high-dimensional answer-space (e.g. what will happen in the final HPMOR arc), deciding which outcomes count as "true" vs "false" for your question, then summing up all the mass counted as "true".

Possible disadvantages of this approach:

  • Less entertaining - speculating about HPMOR's final arc is more fun than speculating about the number of dots in a randomly-generated picture. IMO purchasing fun and skill separately is better than trying to purchase both at the same time.
  • If calibration training doesn't generalize, then you'll only get well-calibrated about numbers of dots, not about something actually important like HPMOR. I'm pretty sure that calibration generalizes, though.
  • Making predictions trains not only calibration but also discrimination, i.e. it decreases the entropy of your probability distribution. Discrimination doesn't generalize much. Improved discrimination about numbers of dots is less useful than about other things.
  • Heterogeneous "messy" questions are probably more representative of the sorts of questions we actually care about, e.g. "when will AGI come?", or "how useful would it be to know more math?". So insofar as calibration and discrimination do not generalize, messy questions are better.
  • The estimates you mass-produce will tend to be correlated, so would provide less information about how well-calibrated you are than the same number of estimates produced more independently, I think.

Overall, I'd guess:

  • The homogeneous mass-prediction approach is better at training calibration.
  • You should use domain-specific training in the domain you want to predict things in, to develop discrimination.
  • It's inefficient to make heterogeneous predictions in domains you don't care very much about.

An alternative, roughly between the two groups discussed above, would be to find some repeatable way of generating questions that are at least slightly interesting. For instance, play online Mafia and privately make lots of predictions about which players have which roles, who will be lynched or murdered, etc. Or predict chess or poker. Or predict karma scores of LW/Reddit comments. Or use a spaced repetition system, but before showing the answer estimate the probability that you got the answer right. Any better ideas?

Comment author: ChristianKl 24 February 2015 10:19:10AM 4 points [-]

The Credence game does this for better or worse. Saying which of two metals has the higher boiling temperature on the other hand isn't that fun.

The key issue is getting interesting question templates.

Comment author: drethelin 23 February 2015 10:41:10PM 4 points [-]

I think homogenous mass prediction gets you better calibrated at predicting the kind of questions in the sample group but doesn't help a lot otherwise. If you have ever played the the calibration game that someone made a couple years ago it's a lot like that. After using it for a while you get better and better at it, but I don't thikn that made me better calibrated in general.

Comment author: whales 27 February 2015 05:34:13AM *  2 points [-]

You can predict how long tasks/projects will take you (stopwatch and/or calendar time). Even if calibration doesn't generalize, it's potentially useful on its own there. And while you can't quite mass-produce questions/predictions, it's not such a hassle to rack up a lot if you do them in batches. Malcolm Ocean wrote about doing this with a spreadsheet, and I threw together an Android todo-with-predictions app for a similar self experiment.