LESSWRONG
LW

Comment Permalink

satt points out that (via the Bienaymé formula) “An RCT with a sample size of e.g. 400 would still be 10 times better than 4 self-experiments by this metric.”

Since this has come up again, I may as well point out that this is a very abstruse argument.

First of all, if the standard error in a random variable is low to begin with, or I've already done many experiments, decreasing the standard error of my estimate by a factor of 10 is much less valuable.
And second of all, this analysis doesn't connect with anything actionable. What does decreasing the standard error of my estimate by a factor of 10 even mean in actionable terms? How often will this actually end up changing what I do?

The way I'm thinking about this argument is to picture a normal distribution representing my uncertainty about some value. When I do 100 times as many experiments, the distribution

becomes skinnier by a factor of 10, and
centers itself at a new location, where the probability of the new location is determined by the original distribution. If my original distribution is especially wide, more experiments could be valuable, especially if the new distribution ends up jumping somewhere far from the center of the original distribution. But if my original distribution was plenty skinny to begin with, making it skinnier won't help me.

See also this comment of mine, which does math showing that just a few perfectly done self-experiments can be quite valuable in actionable terms: http://lesswrong.com/lw/bs0/knowledge_value_knowledge_quality_domain/6d9k

See in context

51 Case Study: Testing Confirmation Bias

by gwern

2nd May 2012

1 min read

51

Master copy lives on gwern.net

Confirmation Bias

Frontpage

51

New Comment

63 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:10 AM

[-]orthonormal13y250

I support your focus on testing confirmation bias, but I don't think that it was worth it to explicitly falsify results (for a short time), compared to saying "oh well" and repeating the process until you do legitimately get an inconvenient result on a self-experiment. You've demonstrated that you're willing to break the taboo (or injunction) of never falsifying object-level results of scientific experience, which makes all of your data less valuable.

I found this to be a good and informative post, nonetheless.

[-]John_Maxwell13y130

I found this to be a good and informative post, nonetheless.

Really? Are you really surprised that people are reluctant to broadcast data that doesn't fit their theory? Have you read any political blogs?

By my model, it takes a pretty unusual person to give anywhere near equal weight to confirming and disconfirming evidence. We're holding Seth Roberts to a very high standard here--one that Gwern himself has not necessarily achieved. Criticizing is easy.

This is a great example of what Frank Adamek talked about in his recent post re: lowering other people's status. The reason folks subconsciously avoid disconfirming evidence is so they can preserve their status. In an ideal world preserving status would be a nonissue and disconfirming evidence would be fine. But then someone like Gwern comes along and snipes someone's status, validating the concern with status that leads to confirmation bias in the first place.

(Stop violating useful social norms Gwern! Punish the norm violator! Just kidding, saying that would make me a hypocrite. I'll assume Gwern posted this in good faith and didn't mean to erode useful social norms.)

So can future articles on individual irrationality please be restricted to people writing about themselves?

[-]orthonormal13y20

To clarify: I'm in support of doing psychological tests on small scales and writing up the results on Less Wrong. I'm not in support of breaking certain ethical injunctions in the process.

If gwern had legitimately gotten a different self-experiment than Seth Roberts, and then the same process had transpired, I'd be entirely in favor of this post. It's an important caveat to self-experimentation that you need to really watch out for confirmation bias, and trust people more if they're willing to publicize negative results as well as positive ones.

But falsifying results to achieve that, even temporarily, was a bad choice (and it makes me less willing to invest my time in reading gwern's self-experimentation in the future).

[-]John_Maxwell13y20

Since I'm objecting, I may as well clarify: I would appreciate it if people told me unimportant lies (that were corrected later) in order to test me for biases, as long as the results of the test were just between the two of us, and possibly also in other circumstances. (Let's say you have to pay me one dollar for each additional person who knows up to the first 20 people, with additional people being free after the first 20.)

[-]gwern13y00

How would you test yourself for confirmation bias?

[-][anonymous]13y20

If I were Seth Roberts, I would look into my blog archive for the initial anecdotal results I posted on experiments now proven to have negative results. If most of these posts seemed positive, I probably have confirmation bias.

[-]gwern13y20

I don't think that can be done, since I don't know of any of his theories which have been 'now proven to have negative results'. I think a post linked somewhere here accuses Roberts of actively avoiding clinical trials, where Roberts replies that he worked with a SUNY professor on 20 case-studies for the Shangri-La diet. Since the diet is his centerpiece and the subject of his only book (AFAIK), it probably represents the best-case testing of any of his theories...

EDIT: http://andrewgelman.com/2010/03/clippin_it/#comment-53303

[-]jsalvatier13y00

Perhaps you're thinking of Andrew Gelman's recent post.

[This comment is no longer endorsed by its author]Reply

[-]lavalamp13y240

This was possibly an expensive experiment in terms of social capital...

I think it would have been better to have waited longer. After only three days, his response seems reasonable. Perhaps after two weeks, it would be more difficult to believe that he would have ever published your data.

[-]John_Maxwell13y80

He doesn't even tell us what the publication lag for the first experiment was.

[-]gwern13y10

The first experiment? You mean the SIAI habit formation thing? I thought it was obvious from the intro specifying when the call for applicants went up and when I posted, but I've edited it to be more explicit.

Or do you mean the vitamin D evening experiment? The results didn't contradict any of his theories, and to the extent it matters to the theory at all, his theory predicts that it ought to damage sleep in the evening since it's influencing circadian rhythms and it isn't a mere matter of vitamin D deficiency.

[-]John_Maxwell13y20

How long before he linked to your initial vitamin D results?

[-]gwern13y-20

Dunno. As I said, it didn't matter.

It just occurred to me - I have an active experiment going with deleting random external links on Wikipedia, but even though this affects a rough minimum of ~335,445 readers of Wikipedia articles (based on the summed March statistics of the affected articles), I will probably catch far less flak when I post my results on the WikiEN-l mailing list than I have already caught for this post here. Humans!

[-]Normal_Anomaly13y140

It just occurred to me - I have an active experiment going with deleting random external links on Wikipedia,

I object to this more than I object to the experiment in the OP.

[-]gwern13y00

Bless your soul! I was completely disheartened at the disinterest of even Wikipedians in my earlier experiment demonstrating that suggestions for adding external links get ignored. Anger is better than apathy.

[-]John_Maxwell13y10

I agree, the number of people affected by an amateur experiment you perform is a good measure of how much flak you should catch.

[-]khafra13y40

On the other hand, people would be reading his site and drawing the wrong conclusions about D supplementation for two weeks. That's some further-spread epistemic pollution costs.

[-]gwern13y50

Yes, that was a major reason for only 3 days. Roberts makes it sound like he was going to do an in-depth analysis or whatever before discussing my data, but I don't believe this: if you look at the vitamin D category, you see he posts plenty of people's reports without formally analyzing their data but just describing it, and he had time to post something like 3 blog posts before I published this, one of which was a link roundup perfect for linking my results.

I didn't realize that people would see the 3 day waiting period as super-questionable. Thinking about it some more, I realize now what I should have done: I should have created a separate page on my site just for the fake results, and sent the subject that but linked it nowhere else. The subject would have no reason to be suspicious, the page would indeed be public, but it would not actually get any traffic from normal readers; hence, I could leave the fake page up for months.

(At some point I could even put up the real results on the main page (for the normal readers), since it would be unlikely for the subject to just randomly visit the page and notice the discrepancy.)

[-]SarahNibs13y150

Andrew Gelman also has a post up today about Seth Roberts not being diligent about seeing disconfirmatory evidence: Selection Bias

Edit: Seth makes several responses, including this succinct claim to have avoided confirmation bias: comment

[-]J_Taylor13y130

And testing confirmation bias in this fashion is intrinsically deceptive, so I probably have damaged my online reputation as well.

I have no way to back this up and just posting this tastes like hindsight bias to me. Nonetheless, this was my train-of-thought when I read this post :

Gwern is using what looks like normal communication as a means to experiment.
Gwern is a data-crunching, prediction-making, experiment-performing, freak-of-nature. (I mean this in an extremely complimentary fashion.) Therefore:
This wasn't the first time Gwern did something like this and it will not be the last time.

Of course, as Rational!Beyonce would say "If I thought it then I should have put a prediction on it."

[-]magfrump13y40

Of course, as Rational!Beyonce would say "If I thought it then I should have put a prediction on it."

Amazing.

[-]thomblake13y130

And testing confirmation bias in this fashion is intrinsically deceptive, so I probably have damaged my online reputation as well.

Yes, this test might not have passed a university ethics review - not only did you deceive your subject, which must be done carefully, but he didn't even know he was participating in an experiment!

[-]gwern13y260

Amateur Science -
I do what I must,
because, I can.
For the good of all of us.
Except the ones who were tricked.
But there's no sense crying over all the missed frills,
You just keep on trying until you run out of pills.
And the Science* was fun,
And you get neat posts done
For the people who are, still alive.

* This parody has not been approved by the FDA for any medicinal purposes nor has it been replicated.

[-]Zack_M_Davis13y80

This is cute, but I think it would have been worth trying to salvage the rhyme scheme. (Maybe "... all the missed frills"/"pills" and "And the Science helps most"/"post"?) You also missed a huge opportunity by not starting a line earlier with the words "Amateur science" (sound-alike with Aperture).

[-]gwern13y40

Amateur was a major omission, yeah. I don't think 'helps most/post' really works because it diverges too much. 'Done' is hard to rhyme with, but checking a rhyming dictionary, I realized there was a perfect rhyme: 'fun'. As in, 'And the Science was fun / and you make a neat post'. The rhythms scan, the rhyme is perfect, and it makes sense by reinforcing 'neat'.

[-]TimS13y20

I like how this response is just as relevant to your non-IRB approvable study on what stimuli cause more effort.

Please tell me that I not the first person to note the (ethically difficult) deceptiveness of that study, even if the results are interesting.

[-]Incorrect13y30

Why is it ethically difficult? (I don't particularly care whether it passes a university ethics review, I base whether I consider it ethical on my emotional reaction to it and emotional reaction to its consequences)

[-]TimS13y00

There is some ethical difficulty in running a study (and performing an intervention) on subjects who don't even know that they are in a study. In the application for a research position, people thought they were applying for a position, not tested for their responses to stimuli about the use of their work.

Obviously, certain kinds of results are impossible to obtain when you tell the subjects that they are part of an experiment. The value of the results might justify the deception, particularly in cases like these two studies in which the participants did not suffer any significant harm. But it is unrealistic to pretend that no deception occurred, or that deception is not a flag for potential ethical difficulty.

[-][anonymous]13y00

Section 8.02, for starts. And yes, 8.05 and 8.07 provides an exception, but it's debatable whether it applies here, where harm was actually done to Seth Roberts' reputation. It's less debatable that it applies to the researcher study, where apparently no harm was done.

[-]gwern13y30

Arguably, the researchers were harmed inasmuch as they were induced to apply more effort/time than they otherwise would have.

[-]TimS13y-10

Putting in more work to get another interesting experimental result is a harm to the researcher? On what planet?

[-]gwern13y30

Well, presumably if working harder on their submission was their utility-maximizing choice, they would have done so already sans experimental manipulation; if any more quality time was used up, it probably came at the expense of some other activity...

[-]TimS13y00

It looks like I badly misunderstood your comment. When you wrote "the researchers," I thought that was a coy way of referring to yourself in reference to the two experimental results of which I questioned the ethics.

I'm not arguing for the optimality of compliance with an IRB or other "ethical" guidelines - I'm doubtful they do a reasonable job of creating morally optimal research protocols, and they clearly prevent the discovery of certain interesting or useful results - like your results from these posts that relied on deception. And that doesn't even account for the compliance costs that I now realize was the point of your comment. Oops

[-][anonymous]13y00

I can't find it in my comment history right now, but I've also brought up the apparent lack of ethical oversight in SI-related experiments before. I think the first time was during the first rationality mini-camp.

[-]Douglas_Knight13y10

It would have been passed by the IRB of Gwern U. As long as Gwern is not affiliated with another IRB, that's all that matters.

[-][anonymous]13y130

And testing confirmation bias in this fashion is intrinsically deceptive, so I probably have damaged my online reputation as well.

It's interesting that you mentioned that, because I spent several minutes after reading this being confused, and then I thought:

"It is at least hypothetically possible Gwern is testing confirmation bias with THIS post to Less Wrong."

That was followed by "What evidence would I be looking for to confirm or disprove this hypothesis?"

Which was followed by "It would probably take to much reading to plausibly gather sufficient evidence to make any kind of judgement, and I don't have that much time to assess this post. I'll just put my assessment of this experiment on hold until more verification arrives either way on whether it is or is not a meta experiment."

Was this the kind of reputation damage you were expecting?

[-]gwern13y290

Was this the kind of reputation damage you were expecting?

More 'anything gwern says is a lie and his emails should be ignored and anyone reading his stuff be told he is a self-confessed liar'. (I don't think this is a fair appraisal, since I just wrote the lie up in exhaustive detail, and I only falsified 1 out of 9 results for ~3 days while keeping it as low-key as possible. I could have sent the fake results to Roberts privately, but then his assent or dissent would not be as credible as compared to actually posting it or not.)

For what it's worth, I actually had intended to post this as an Article and not a Discussion if Roberts did fail, but only as a Discussion if he passed. Then I realized this was a publication bias - giving higher billing to positive findings - which leads to confirmation bias, so I resolved to post it as a Discussion no matter the result.

[-]Luke_A_Somers13y100

Upon reading this, I categorized it with "Towards a progressive hermeneutics of quantum gravity" in the deception department (though not in the 'should have been easily caught' department) - the lie was temporary, used as a delicate test of someone else's honesty, and has probably earned you an enemy and gotten a bunch of other people to trust you less.

Speaking of which, it would be best if you could avoid gratuitous deception (like, if you do something like the volunteer experiment, use the data in all cases but neglect to inform half the cohort).

[-]drethelin13y80

Post summary: I, Gwern, am better than Seth Roberts. Watch as I trick him with my superior science skills! This is important, because I have proven that a human is biased, something no one has ever done before.

[-]TrE13y100

Perhaps a bit rude, but this is what one could read from this. I don't think gwern at all intended to show that he's better than Seth but simply tells an interesting story. Still, there's major signalling going on here, and Seth does have every right to be offended, in my opinion.

[-][anonymous]13y70

In the end, what did we learn from the results of this experiment?

(Please take a moment to think about this yourself before proceeding, to avoid priming.)

My answer: Nyzbfg abguvat. Ernfbaf jr qvqa'g yrnea nalguvat nobhg Eboregf:

1) Onfrq ba jung lbh jebgr urer, vg ybbxf yvxr Eboregf punatrq uvf zvaq nobhg gur rssrpgvirarff bs Mrb orgjrra jura lbhe rkcrevzrag fgnegrq naq jura vg raqrq. Guvf tvirf uvz n cresrpgyl yrtvgvzngr ernfba gb cebgrfg ntnvafg lbhe qngn.

2) Bar vagrenpgvba vf fvzcyl abg rabhtu gb tnhtr ubj ovnfrq n crefba vf. Gurer ner znal pbasbhaqvat inevnoyrf urer, naq guvf vf na a = 1 fnzcyr. Eboregf' erfcbafr znl unir orra nssrpgrq ol uvf zbbq, ubj zhpu ur ngr sbe oernxsnfg, naq nal ahzore bs bgure guvatf. Nf lbh fnvq va gur cbfg, a = 1 rkcrevzragf whfg nera'g tbbq fbheprf bs qngn.

3) Rira vs gurfr ceboyrzf jrer pbeerpgrq sbe, tvira gung gur onfr engr sbe pbzzvggvat pbasvezngvba ovnf vf cebonoyl irel uvtu, C(Eboregf vf zber cebar gb pbasvezngvba ovnf guna gur nirentr vaqvivqhny|guvf rkcrevzrag) vf oneryl uvture guna C(Eboregf vf rdhnyyl cebar gb pbasvezngvba ovnf pbzcnerq gb gur nirentr vaqvivqhny|guvf rkcrevzrag). Ubjrire, vs lbh pbhyq pbeerpg sbe gur ceboyrzf V yvfgrq nobir, guvf xvaq bs rkcrevzrag jbhyq or fbzr rivqrapr ntnvafg "Eboregf eneryl snyyf cerl gb pbasvezngvba ovnf."

Gur rkcrevzrag cebirq rira yrff nobhg pbasvezngvba ovnf, zbfgyl sbe gur ernfbaf fgngrq nobir. Nqqvgvbanyyl, Eboregf vfa'g n ercerfragngvir fnzcyr bs nyy uhzna orvatf, naq bar vapvqrag yvxr guvf vfa'g rabhtu gb tvir rivqrapr nobhg n oebnq gurbel yvxr pbasvezngvba ovnf.

Va fhzznel: Guvf rkcrevzragny erfhyg vf vaqvfgvathvfunoyr sebz abvfr, naq qenjvat pbapyhfvbaf sebz vg jbhyq or hawhfgvsvrq.

[-][anonymous]13y00

Not only did you write an entire post in rot13, but in addition your rot13 link is broken.

[-][anonymous]13y00

Whoops. I didn't notice that. Thanks.

[-][anonymous]13y00

Svkrq yvax: uggc://jjj.ebg13.pbz/

[This comment is no longer endorsed by its author]Reply

[-]Incorrect13y70

And testing confirmation bias in this fashion is intrinsically deceptive, so I probably have damaged my online reputation as well.

If anything, you should have more deceptive and convoluted plots if you want to maximize interesting drama.

[-]gwern13y20

I'm not in Chaos Army - fear, uncertainty, and doubt is not my goal.

[-]John_Maxwell13y60

satt points out that (via the Bienaymé formula) “An RCT with a sample size of e.g. 400 would still be 10 times better than 4 self-experiments by this metric.”

Since this has come up again, I may as well point out that this is a very abstruse argument.

First of all, if the standard error in a random variable is low to begin with, or I've already done many experiments, decreasing the standard error of my estimate by a factor of 10 is much less valuable.
And second of all, this analysis doesn't connect with anything actionable. What does decreasing the standard error of my estimate by a factor of 10 even mean in actionable terms? How often will this actually end up changing what I do?

The way I'm thinking about this argument is to picture a normal distribution representing my uncertainty about some value. When I do 100 times as many experiments, the distribution

becomes skinnier by a factor of 10, and
centers itself at a new location, where the probability of the new location is determined by the original distribution. If my original distribution is especially wide, more experiments could be valuable, especially if the new distribution ends up jumping somewhere far from the center of the original distribution. But if my original distribution was plenty skinny to begin with, making it skinnier won't help me.

[-]Stuart_Armstrong13y40

Great work!

[-]erratio13y40

Just out of curiosity, what would you have done if he'd published and linked to the incorrect results? While this version of the story causes you some social damage (gwern is deceptive), wouldn't the damage have been much worse if he'd passed the confirmation bias test?

[-]gwern13y20

I was monitoring my email and RSS closely the night I sent the email. I had already written and proofread the real version, and written an email explaining I had discovered a mistake; so to change my site was just a matter of issuing a single revision-control command (darcs rollback) & re-syncing my site, and then sending the email. I don't think the bad version would have been up for more than 10 or 20 minutes past him posting or replying clearly that he would post it.

[-][anonymous]13y00

Very nicely done! While the sensible thing would have been for him to link to your results, going after Zeo seems like a reasonable thing to do, given {reviews like this one](http://www.amazon.com/review/R1S2WQ7K7ZWERL/ref=cm_cr_dp_title?ie=UTF8&ASIN=B002IY65V4&nodeID=3760901&store=hpc).

[This comment is no longer endorsed by its author]Reply

[-]orthonormal13y00

Two technical glitches: the graph is too large and the relevant portion is blocked by the sidebar, and clicking the gwern.net link sends me to a different essay.

[-]gwern13y00

Image should be fixed.

clicking the gwern.net link sends me to a different essay.

The link sends you to the subsection of that essay (on predictions) corresponding to this page (on confirmation bias), does it not?

[-]orthonormal13y20

It does now, but it didn't earlier. Odd.

[-]Paul Crowley13y00

Excellent work - thank you!

[+]fivelier13y-80

Moderation Log