You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

asd comments on Open Thread, Jul. 27 - Aug 02, 2015 - Less Wrong Discussion

5 Post author: MrMind 27 July 2015 07:16AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (220)

You are viewing a single comment's thread.

Comment author: [deleted] 27 July 2015 02:42:56PM *  6 points [-]

There's been far less writings on improving rationality here on LW during the last few years. Has everything important been said about the subject, or have you just given up on trying to improve your rationality? Are there diminishing returns on improving rationality? Is it related to the fact that it's very hard to get rid off most of cognitive bias, no matter how hard you try to focus on them? Or have people moved talking about these on different forums, or in real life?

Or like Yvain said on 2014 Survey results.

It looks to me like everyone was horrendously underconfident on all the easy questions, and horrendously overconfident on all the hard questions. To give an example of how horrendous, people who were 50% sure of their answers to question 10 got it right only 13% of the time; people who were 100% sure only got it right 44% of the time. Obviously those numbers should be 50% and 100% respectively.

This builds upon results from previous surveys in which your calibration was also horrible. This is not a human universal - people who put even a small amount of training into calibration can become very well calibrated very quickly. This is a sign that most Less Wrongers continue to neglect the very basics of rationality and are incapable of judging how much evidence they have on a given issue. Veterans of the site do no better than newbies on this measure.

Comment author: sixes_and_sevens 27 July 2015 03:31:07PM 15 points [-]

LW's strongest, most dedicated writers all seem to have moved on to other projects or venues, as has the better part of its commentariat.

In some ways, this is a good thing. There is now, for example, a wider rationalist blogosphere, including interesting people who were previously put off by idiosyncrasies of Less Wrong. In other ways, it's less good; LW is no longer a focal point for this sort of material. I'm not sure if such a focal point exists any more.

Comment author: Baughn 28 July 2015 08:00:05AM *  4 points [-]

Where, exactly? All I've noticed is that there's less interesting material to read, and I don't know where to go for more.

Okay, SSC. That's about it.

Comment author: Vaniver 28 July 2015 07:18:52PM *  3 points [-]

Here's one discussion. One thing that came out of it is the RationalistDiaspora subreddit.

Comment author: Username 07 August 2015 12:39:51PM 1 point [-]

Agentfoundations.org

Comment author: Benito 28 July 2015 03:37:55PM -2 points [-]

Tumblr is the new place.

Comment author: Lumifer 27 July 2015 04:17:50PM 2 points [-]

LW as an incubator?

Comment author: sixes_and_sevens 27 July 2015 04:34:23PM 0 points [-]

Or a host for a beautiful parasitic wasp?

Comment author: Lumifer 27 July 2015 04:46:59PM 3 points [-]

Toxoplasmosis is a better metaphor if you want to go that way :-D

Comment author: [deleted] 27 July 2015 09:09:18PM 5 points [-]

A lot of this has moved to blogs. See malcolmocean.com, mindingourway.com, themindsui,com, agentyduck.blogspot.com, and slatestarcodex.com for more of this discussion.

That being said, I think writing/reading about rationality is very different than becoming good at it. I think someone who did a weekend at CFAR, or the Hubbard Research AIE level 2 workshop would rank much higher on rationality than someone who spent months reading through all the sequences.

Comment author: D_Malik 27 July 2015 04:43:20PM *  11 points [-]

About that survey... Suppose I ask you to guess the result of a biased coin which comes up heads 80% of the time. I ask you to guess 100 times, of which ~80 times the right answer is "heads" (these are the "easy" or "obvious" questions) and ~20 times the right answer is "tails" (these are the "hard" or "surprising" questions). Then the correct guess, if you aren't told whether a given question is "easy" or "hard", is to guess heads with 80% confidence, for every question. Then you're underconfident on the "easy" questions, because you guessed heads with 80% confidence but heads came up 100% of the time. And you're overconfident on the "hard" questions, because you guessed heads with 80% confidence but got heads 0% of the time.

So you can get apparent under/overconfidence on easy/hard questions respectively, even if you're perfectly calibrated, if you aren't told in advance whether a question is easy or hard. Maybe the effect Yvain is describing does exist, but his post does not demonstrate it.

Comment author: cousin_it 27 July 2015 06:54:34PM *  4 points [-]

Wow, that's a great point. We can't measure anyone's "true" calibration by asking them a specific set of questions, because we're not drawing questions from the same distribution as nature! That's up there with the obvious-in-retrospect point that the placebo effect gets stronger or weaker depending on the size of the placebo group in the experiment. Good work :-)

Comment author: tim 28 July 2015 02:24:27AM *  2 points [-]

I am probably misunderstanding something here, but doesn't this

Then the correct guess, if you don't know whether a given question is "easy" or "hard"...

Basically say, "if you have no calibration whatsoever?" If there are distinct categories of questions (easy and hard) and you can't tell which questions belong to which category, then simply guessing according to your overall base rate will make your calibration look terrible - because it is

Comment author: D_Malik 28 July 2015 04:50:03PM 0 points [-]

Replace "if you don't know" with "if you aren't told". If you believe 80% of them are easy, then you're perfectly calibrated as to whether or not a question is easy, and the apparent under/overconfidence remains.

Comment author: Lumifer 28 July 2015 05:15:04PM -1 points [-]

If you believe 80% of them are easy, then you're perfectly calibrated as to whether or not a question is easy, and the apparent under/overconfidence remains.

I am still confused.

You don't measure calibration by asking "Which percentage of this set of questions is easy?". You measure it by offering each question one by one and asking "Is this one easy? What about that one?".

Calibration applies to individual questions, not to aggregates. If, for some reason, you believe that 80% of the questions in the set is easy but you have no idea which ones, you are not perfectly calibrated, in fact your calibration sucks because you cannot distinguish easy and hard.

Comment author: tut 28 July 2015 06:08:24PM *  2 points [-]

Calibration for single questions doesn't make any sense. Calibration applies to individuals, and is about how their subjective probability of being right about questions in some class relates to what proportion of the questions in that class they are right about.

Comment author: Lumifer 28 July 2015 06:30:09PM *  1 point [-]

Well, let's walk through the scenario.

Alice is given 100 calibration questions. She knows that some of them are easy and some are hard. She doesn't know how many are easy and how many are hard.

Alice goes through the 100 questions and at the end -- according to how I understand D_Malik's scenario -- she says "I have no idea whether any particular question is hard or easy, but I think that out of this hundred 80 questions are easy. I just don't know which ones". And, under the assumption that 80 question were indeed easy, this is supposed to represent perfect calibration.

That makes no sense to me at all.

Comment author: Vaniver 28 July 2015 07:11:54PM *  3 points [-]

D_Malik's scenario illustrates that it doesn't make sense to partition the questions based on observed difficulty and then measure calibration, because this will induce a selection effect. The correct procedure to partition the questions based on expected difficulty and then measure calibration.

For example, I say "heads" every time for the coin, with 80% confidence. That says to you that I think all flips are equally hard to predict prospectively. But if you were to compare my track record for heads and tails separately--that is, look at the situation retrospectively--then you would think that I was simultaneously underconfident and overconfident.

To make it clearer what it should look like normally, suppose there are two coins, red and blue. The red coin lands heads 80% of the time and the blue coin lands heads 70% of the time, and we alternate between flipping the red coin and the blue coin.

If I always answer heads, with 80% when it's red and 70% when it's blue, I will be as calibrated as someone who always answers heads with 75%, but will have more skill. But retrospectively, one will be able to make the claim that we are underconfident and overconfident.

Comment author: Lumifer 28 July 2015 07:41:02PM *  1 point [-]

D_Malik's scenario illustrates that it doesn't make sense to partition the questions based on observed difficulty and then measure calibration, because this will induce a selection effect. The correct procedure to partition the questions based on expected difficulty and then measure calibration.

Yes, I agree with that. However it still seems to me that the example with coins is misleading and that the given example of "perfect calibration" is anything but. Let me try to explain.

Since we're talking about calibration, let's not use coin flips but use calibration questions.

Alice gets 100 calibration questions. To each one she provides an answer plus her confidence in her answer expressed as a percentage.

In both yours and D_Malik's example the confidence given is the same for all questions. Let's say it is 80%. That is an important part: Alice gives her confidence for each question as 80%. This means that for her the difficulty of each question is the same -- she cannot distinguish between then on the basis of difficulty.

Let's say the correctness of the answer is binary -- it's either correct or not. It is quite obvious that if we collect all Alice's correct answers in one pile and all her incorrect answers in another pile, she will look to be miscalibrated, both underconfident (for the correct pile) and overconfident (for the incorrect pile).

But now we have the issue that some questions are "easy" and some are "hard". My understanding of these terms is that the test-giver, knowing Alice, can forecast which questions she'll be able to mostly answer correctly (those are the easy ones) and which questions she will not be able to mostly answer correctly (those are the hard ones). If this is so (and assuming the test-giver is right about Alice which is testable by looking at the proportions of easy and hard questions in the correct and incorrect piles), then Alice fails calibration because she cannot distinguish easy and hard questions.

You are suggesting, however, that there is an alternate definition of "easy" and "hard" which is the post-factum assignment of the "easy" label to all questions in the correct pile and of the "hard" label to all questions in the incorrect pile. That makes no sense to me as being an obviously a stupid thing to do, but it may be that the original post argued exactly against this kind of stupidity.

P.S. And, by the way, the original comment which started this subthread quoted Yvain and then D_Malik pronounced Yvain's conclusions suspicious. But Yvain did not condition on the outcomes (correct/incorrect answers), he conditioned on confidence! It's a perfectly valid exercise to create a subset of questions where someone declared, say, 50% confidence, and then see if the proportion of correct answers is around that 50%.

Comment author: Unnamed 28 July 2015 09:25:30PM 3 points [-]

Suppose that I am given a calibration question about a racehorse and I guess "Secretariat" (since that's the only horse I remember) and give a 30% probability (since I figure it's a somewhat plausible answer). If it turns out that Secretariat is the correct answer, then I'll look really underconfident.

But that's just a sample size of one. Giving one question to one LWer is a bad method for testing whether LWers are overconfident or underconfident (or appropriately confident). So, what if we give that same question to 1000 LWers?

That actually doesn't help much. "Secretariat" is a really obvious guess - probably lots of people who know only a little about horseracing will make the same guess, with low to middling probability, and wind up getting it right. On that question, LWers will look horrendously underconfident. The problem with this method is that, in a sense, it still has a sample size of only one, since tests of calibration are sampling both from people and from questions.

The LW survey had better survey design than that, with 10 calibration questions. But Yvain's data analysis had exactly this problem - he analyzed the questions one-by-one, leading (unsurprisingly) to the result that LWers looked wildly underconfident on some questions and wildly overconfident on others. That is why I looked at all 10 questions in aggregate. On average (after some data cleanup) LWers gave a probability of 47.9% and got 44.0% correct. Just 3.9 percentage points of overconfidence. For LWers with 1000+ karma, the average estimate was 49.8% and they got 48.3% correct - just a 1.4 percentage point bias towards overconfidence.

Being well-calibrated does not only mean "not overconfident on average, and not underconfident on average". It also means that your probability estimates track the actual frequencies across the whole range from 0 to 1 - when you say "90%" it happens 90% of the time, when you say "80%" it happens 80% of the time, etc. In D_Malik's hypothetical scenario where you always answer "80%", we aren't getting any data on your calibration for the rest of the range of subjective probabilities. But that scenario could be modified to show calibration across the whole range (e.g., several biased coins, with known biases). My analysis of the LW survey in the previous paragraph also only addresses overconfidence on average, but I also did another analysis which looked at slopes across the range of subjective probabilities and found similar results.

Comment author: Vaniver 28 July 2015 11:41:53PM *  0 points [-]

My understanding of these terms is that the test-giver, knowing Alice, can forecast which questions she'll be able to mostly answer correctly (those are the easy ones) and which questions she will not be able to mostly answer correctly (those are the hard ones).

I agree that if Yvain had predicted what percentage of survey-takers would get each question correct before the survey was released, that would be useful as a measure of the questions' difficulty and an interesting analysis. That was not done in this case.

That makes no sense to me as being an obviously a stupid thing to do, but it may be that the original post argued exactly against this kind of stupidity.

The labeling is not obviously stupid--what questions the LW community has a high probability of getting right is a fact about the LW community, not about Yvain's impression of the LW community. The usage of that label for analysis of calibration does suffer from the issue D_Malik raised, which is why I think Unnamed's analysis is more insightful than Yvain's and their critiques are valid.

However it still seems to me that the example with coins is misleading and that the given example of "perfect calibration" is anything but.

It is according to what calibration means in the context of probabilities. Like Unnamed points out, if you are unhappy that we are assigning a property of correct mappings ('calibration') to a narrow mapping ("80%"->80%) instead of a broad mapping ("50%"->50%, "60%"->60%, etc.), it's valid to be skeptical that the calibration will generalize--but it doesn't mean the assessment is uncalibrated.

Comment author: Viliam 28 July 2015 09:09:25AM 4 points [-]

1) There are diminishing returns on talking about improving rationality.

2) Becoming more rational could make you spend less time online, including on LessWrong. (The time you would have spent in the past writing beautiful and highly upvoted blog articles is now spent making money or doing science.) Note: This argument is not true if building a stronger rationalist community would generate more good than whatever you are doing alone instead. However, there may be a problem with capturing the generated value. (Eliezer indirectly gets paid for having published on LessWrong. But most of the others don't.)

Comment author: Unnamed 27 July 2015 06:23:59PM 9 points [-]

I re-analyzed the calibration data, looking at all 10 question averaged together (which I think is a better approach than going question-by-question, for roughly the reasons that D_Malik gives), and found that veterans did better than newbies (and even newbies were pretty well calibrated). I also found similar results for other biases on the 2012 LW survey.

Comment author: pcm 27 July 2015 07:01:13PM 0 points [-]

Some of the discussion has moved to CFAR, although that involves more focus on how to get better cooperation between System 1 and System 2, and less on avoiding specific biases.

Maybe the most rational people don't find time to take surveys?