D_Malik comments on Open Thread, Jul. 27 - Aug 02, 2015 - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (220)
There's been far less writings on improving rationality here on LW during the last few years. Has everything important been said about the subject, or have you just given up on trying to improve your rationality? Are there diminishing returns on improving rationality? Is it related to the fact that it's very hard to get rid off most of cognitive bias, no matter how hard you try to focus on them? Or have people moved talking about these on different forums, or in real life?
Or like Yvain said on 2014 Survey results.
About that survey... Suppose I ask you to guess the result of a biased coin which comes up heads 80% of the time. I ask you to guess 100 times, of which ~80 times the right answer is "heads" (these are the "easy" or "obvious" questions) and ~20 times the right answer is "tails" (these are the "hard" or "surprising" questions). Then the correct guess, if you aren't told whether a given question is "easy" or "hard", is to guess heads with 80% confidence, for every question. Then you're underconfident on the "easy" questions, because you guessed heads with 80% confidence but heads came up 100% of the time. And you're overconfident on the "hard" questions, because you guessed heads with 80% confidence but got heads 0% of the time.
So you can get apparent under/overconfidence on easy/hard questions respectively, even if you're perfectly calibrated, if you aren't told in advance whether a question is easy or hard. Maybe the effect Yvain is describing does exist, but his post does not demonstrate it.
Wow, that's a great point. We can't measure anyone's "true" calibration by asking them a specific set of questions, because we're not drawing questions from the same distribution as nature! That's up there with the obvious-in-retrospect point that the placebo effect gets stronger or weaker depending on the size of the placebo group in the experiment. Good work :-)
I am probably misunderstanding something here, but doesn't this
Basically say, "if you have no calibration whatsoever?" If there are distinct categories of questions (easy and hard) and you can't tell which questions belong to which category, then simply guessing according to your overall base rate will make your calibration look terrible - because it is
Replace "if you don't know" with "if you aren't told". If you believe 80% of them are easy, then you're perfectly calibrated as to whether or not a question is easy, and the apparent under/overconfidence remains.
I am still confused.
You don't measure calibration by asking "Which percentage of this set of questions is easy?". You measure it by offering each question one by one and asking "Is this one easy? What about that one?".
Calibration applies to individual questions, not to aggregates. If, for some reason, you believe that 80% of the questions in the set is easy but you have no idea which ones, you are not perfectly calibrated, in fact your calibration sucks because you cannot distinguish easy and hard.
Calibration for single questions doesn't make any sense. Calibration applies to individuals, and is about how their subjective probability of being right about questions in some class relates to what proportion of the questions in that class they are right about.
Well, let's walk through the scenario.
Alice is given 100 calibration questions. She knows that some of them are easy and some are hard. She doesn't know how many are easy and how many are hard.
Alice goes through the 100 questions and at the end -- according to how I understand D_Malik's scenario -- she says "I have no idea whether any particular question is hard or easy, but I think that out of this hundred 80 questions are easy. I just don't know which ones". And, under the assumption that 80 question were indeed easy, this is supposed to represent perfect calibration.
That makes no sense to me at all.
D_Malik's scenario illustrates that it doesn't make sense to partition the questions based on observed difficulty and then measure calibration, because this will induce a selection effect. The correct procedure to partition the questions based on expected difficulty and then measure calibration.
For example, I say "heads" every time for the coin, with 80% confidence. That says to you that I think all flips are equally hard to predict prospectively. But if you were to compare my track record for heads and tails separately--that is, look at the situation retrospectively--then you would think that I was simultaneously underconfident and overconfident.
To make it clearer what it should look like normally, suppose there are two coins, red and blue. The red coin lands heads 80% of the time and the blue coin lands heads 70% of the time, and we alternate between flipping the red coin and the blue coin.
If I always answer heads, with 80% when it's red and 70% when it's blue, I will be as calibrated as someone who always answers heads with 75%, but will have more skill. But retrospectively, one will be able to make the claim that we are underconfident and overconfident.
Yes, I agree with that. However it still seems to me that the example with coins is misleading and that the given example of "perfect calibration" is anything but. Let me try to explain.
Since we're talking about calibration, let's not use coin flips but use calibration questions.
Alice gets 100 calibration questions. To each one she provides an answer plus her confidence in her answer expressed as a percentage.
In both yours and D_Malik's example the confidence given is the same for all questions. Let's say it is 80%. That is an important part: Alice gives her confidence for each question as 80%. This means that for her the difficulty of each question is the same -- she cannot distinguish between then on the basis of difficulty.
Let's say the correctness of the answer is binary -- it's either correct or not. It is quite obvious that if we collect all Alice's correct answers in one pile and all her incorrect answers in another pile, she will look to be miscalibrated, both underconfident (for the correct pile) and overconfident (for the incorrect pile).
But now we have the issue that some questions are "easy" and some are "hard". My understanding of these terms is that the test-giver, knowing Alice, can forecast which questions she'll be able to mostly answer correctly (those are the easy ones) and which questions she will not be able to mostly answer correctly (those are the hard ones). If this is so (and assuming the test-giver is right about Alice which is testable by looking at the proportions of easy and hard questions in the correct and incorrect piles), then Alice fails calibration because she cannot distinguish easy and hard questions.
You are suggesting, however, that there is an alternate definition of "easy" and "hard" which is the post-factum assignment of the "easy" label to all questions in the correct pile and of the "hard" label to all questions in the incorrect pile. That makes no sense to me as being an obviously a stupid thing to do, but it may be that the original post argued exactly against this kind of stupidity.
P.S. And, by the way, the original comment which started this subthread quoted Yvain and then D_Malik pronounced Yvain's conclusions suspicious. But Yvain did not condition on the outcomes (correct/incorrect answers), he conditioned on confidence! It's a perfectly valid exercise to create a subset of questions where someone declared, say, 50% confidence, and then see if the proportion of correct answers is around that 50%.
Suppose that I am given a calibration question about a racehorse and I guess "Secretariat" (since that's the only horse I remember) and give a 30% probability (since I figure it's a somewhat plausible answer). If it turns out that Secretariat is the correct answer, then I'll look really underconfident.
But that's just a sample size of one. Giving one question to one LWer is a bad method for testing whether LWers are overconfident or underconfident (or appropriately confident). So, what if we give that same question to 1000 LWers?
That actually doesn't help much. "Secretariat" is a really obvious guess - probably lots of people who know only a little about horseracing will make the same guess, with low to middling probability, and wind up getting it right. On that question, LWers will look horrendously underconfident. The problem with this method is that, in a sense, it still has a sample size of only one, since tests of calibration are sampling both from people and from questions.
The LW survey had better survey design than that, with 10 calibration questions. But Yvain's data analysis had exactly this problem - he analyzed the questions one-by-one, leading (unsurprisingly) to the result that LWers looked wildly underconfident on some questions and wildly overconfident on others. That is why I looked at all 10 questions in aggregate. On average (after some data cleanup) LWers gave a probability of 47.9% and got 44.0% correct. Just 3.9 percentage points of overconfidence. For LWers with 1000+ karma, the average estimate was 49.8% and they got 48.3% correct - just a 1.4 percentage point bias towards overconfidence.
Being well-calibrated does not only mean "not overconfident on average, and not underconfident on average". It also means that your probability estimates track the actual frequencies across the whole range from 0 to 1 - when you say "90%" it happens 90% of the time, when you say "80%" it happens 80% of the time, etc. In D_Malik's hypothetical scenario where you always answer "80%", we aren't getting any data on your calibration for the rest of the range of subjective probabilities. But that scenario could be modified to show calibration across the whole range (e.g., several biased coins, with known biases). My analysis of the LW survey in the previous paragraph also only addresses overconfidence on average, but I also did another analysis which looked at slopes across the range of subjective probabilities and found similar results.
I agree that if Yvain had predicted what percentage of survey-takers would get each question correct before the survey was released, that would be useful as a measure of the questions' difficulty and an interesting analysis. That was not done in this case.
The labeling is not obviously stupid--what questions the LW community has a high probability of getting right is a fact about the LW community, not about Yvain's impression of the LW community. The usage of that label for analysis of calibration does suffer from the issue D_Malik raised, which is why I think Unnamed's analysis is more insightful than Yvain's and their critiques are valid.
It is according to what calibration means in the context of probabilities. Like Unnamed points out, if you are unhappy that we are assigning a property of correct mappings ('calibration') to a narrow mapping ("80%"->80%) instead of a broad mapping ("50%"->50%, "60%"->60%, etc.), it's valid to be skeptical that the calibration will generalize--but it doesn't mean the assessment is uncalibrated.