shminux comments on Too good to be true - LessWrong

24 Post author: PhilGoetz 11 July 2014 08:16PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (119)

You are viewing a single comment's thread.

Comment author: shminux 11 July 2014 09:57:43PM *  1 point [-]

Simple statistics, but eye-opening. I wonder if gwern would be interested enough to do a similar analysis, or maybe he already has.

Comment author: gwern 11 July 2014 11:49:03PM *  20 points [-]

Goetz is re-inventing a meta-analytic wheel here (which is nothing to be ashamed of). It certainly is the case that a body of results can be too good to be true. To Goetz's examples, I'll add acupuncture, but wait, that's not all! We can add everything to the list: "Do Certain Countries Produce Only Positive Results? A Systematic Review of Controlled Trials" is a fun** paper which finds

In studies that examined interventions other than acupuncture ['all papers classed as “randomized controlled trial” or “controlled clinical trial”'], 405 of 1100 abstracts met the inclusion criteria. Of trials published in England, 75% gave the test treatment as superior to control. The results for China, Japan, Russia/USSR, and Taiwan were 99%, 89%, 97%, and 95%, respectively. No trial published in China or Russia/USSR found a test treatment to be ineffective.

'Excess significance' is not a new concept (fun fact: people even use the phrase 'too good to be true' to summarize it, just like Goetz does) and is a valid sign of bias in whatever set of studies on is looking at, and as he says, you can treat it as a binomial to calculate the odds of n studies failing to hit their quota of 5% false positives and instead delivering 0% or whatever. But 5% here is just the lower bound, you can substantially improve by taking into account statistical power, this is how Schimmack's 'incredibility index' basically works*. More recently is the p-curve approach, but I don't understand that as well.

To some extent, you can also diagnose this problem in funnel plots: if studies-datapoints clump 'too tightly' within the cone of precision vs significance and you don't see any small/low-power studies wandering over into the 'bad' area of point-estimates where random noise should be bouncing at least some of them, then there's something funny going on with the data.

* I say a bit because Schimmack intends his II for use in psychology papers of the sort which report, say, 5 experiments testing a particular hypothesis, and mirabile dictu, all 5 support the authors' theory.

Now, if we considered only false positives, the odds of all 5 not being false positives is 0.95^5 or 77.4% - so 5 positives isn't especially damning, nothing like 60 papers all claiming positive results. But we can do better, by looking at the other kind of error.

Shimmack points out that you can look instead at the other side of the coin from alpha/false positives: statistical power, the odds of finding a statistically-significant result assuming the effect actually exists. Given that experiments usually have low power like 50%, that means half of the paper's experiments should have 'failed' even if they were right, so now we ask instead, 'since half the experiments should have failed even in the best case that we're testing a true hypothesis, how likely are these results of all 5 succeeding?' then the calculation is 0.5^5 or 3% - so their results are truly incredible!

(If I understand the logic of NHST correctly, 5% is merely the guaranteed lower bound of error, due to the choice of 0.05 for alpha. But unless every experiment is run with a billion subjects and has statistical power of 100%, the real percentage of 'failed' studies should be much higher, with the exact amount based on how bad the power is.)

** Did I say 'fun'? I actually meant, 'incredibly depressing' and 'makes me fear for the future of science if so much cargo cult science can be done in non-Western countries which have the benefit of centuries of scientific work and philosophy and many of whose scientists trained in the West, and yet somehow, it seems that the spirit of science just didn't get conveyed, and science there has been corrupted into a hollow mockery of itself, creating legions of witch-doctors who run "experiments" and write "papers" and do "statistics" none of which means anything'.

Comment author: IlyaShpitser 12 July 2014 01:03:04AM 4 points [-]

Science is not a magic bullet against bad incentives. I am more optimistic, we are getting a lot done despite bad incentives.

Comment author: gwern 12 July 2014 01:40:17AM *  16 points [-]

Science is not a magic bullet against bad incentives.

But none of the incentives seem particularly strong there. It's not offensive to any state religion, it's not objectionable to local landlords, it's not a subversive creed espoused by revolutionaries who want to depose the emperor. The bad incentives here seem to be small bureaucratic ones along the line of it being easier to judge academics for promotion based on how many papers they publish. If genuine science can't survive that and will degenerate into cargo cult science when hit by such weak incentives...

Comment author: IlyaShpitser 14 July 2014 04:58:14PM 5 points [-]

But none of the incentives seem particularly strong there.

The bad incentives here seem to be small bureaucratic ones along the line of it being easier to judge academics for promotion based on how many papers they publish.

People respond strongly to this in the West also -- "least publishable units", etc.

it seems that the spirit of science just didn't get conveyed

This is almost mystical wording. There is bad science in the West, and good science in the East. I would venture to guess that the crappy state of science in e.g. China is just due to the weak institutions/high corruption levels in their society. If you think you can get away with dumping plastic in milk, a little data faking is the least of your problems. As that gets better, science will get better too.

Comment author: Azathoth123 15 July 2014 02:19:04AM 8 points [-]

I would venture to guess that the crappy state of science in e.g. China is just due to the weak institutions/high corruption levels in their society. If you think you can get away with dumping plastic in milk, a little data faking is the least of your problems.

That explains China and Russia/USSR, it doesn't explain Japan and Taiwan.

Comment author: private_messaging 21 July 2014 05:42:39AM *  3 points [-]

The study was looking at English texts, not Russian, Chinese, or Japanese texts.

edit: a study on foreign language bias in German speaking countries.

Only 35% of German-language articles, compared with 62% of English-language articles, reported significant (p < 0.05) differences in the main endpoint between study and control groups (p = 0.002 by McNemar's test)

And that's Germans, for whom it is piss easy to learn English (compared to Russians, Chinese, or Japanese).

Comment author: gwern 21 July 2014 02:33:45PM 1 point [-]

Why did you omit the part where a third of the sample was published in both English and German, and hence weakens the bias? (That is comparable to the overlap for Chinese & English publications.)

Comment author: gwern 15 July 2014 05:27:51PM *  7 points [-]

People respond strongly to this in the West also -- "least publishable units", etc.

And yet, at least clinical trials fail here, and we don't have peer-review rings being busted or people throwing bales of money out the window as the police raid them for assisting academic fraud. (To name some recent Chinese examples.)

I would venture to guess that the crappy state of science in e.g. China is just due to the weak institutions/high corruption levels in their society.

Again, what incentives? If science cannot survive some 'weak institutions' abroad, which don't strike me as any worse than, say, the Gilded Age in America (and keep in mind the relative per capita GDPs of China now and, say, the golden age of German science before WWII), how long can one expect it to last?

This is almost mystical wording.

It's gesturing to society-wide factors of morality, values, and personality, yes, since it doesn't seem to be related to more mundane factors like per capita GDP.

As that gets better, science will get better too.

Japan is a case in point here. Almost as bad as China on the trial metric despite over a century of Western-style science and a generally uncorrupt society which went through its growing pains decades ago.

Comment author: private_messaging 21 July 2014 05:29:12AM *  2 points [-]

This is almost mystical wording.

There's something that just didn't get conveyed: English language. That paper, with it's idiot finding, was looking at the studies downloaded from Medline and presumably published in English, or at least with an English abstract (the search was done for English terms and no translation efforts were mentioned).

As long as researchers retain freedom to either write their study up in English or not there's going to be an additional publication-in-a-very-foreign-language bias.

With regards to acupuncture, one thing that didn't happen, is soviet union being full of acupuncture centres and posters about awesomeness of acupuncture everywhere on the walls, something that would have happened if there was indeed such a high prevalence of positive findings in locally available literature.

Comment author: V_V 25 July 2014 09:27:50AM 2 points [-]

As long as researchers retain freedom to either write their study up in English or not there's going to be an additional publication-in-a-very-foreign-language bias.

As a rule of thumb, I would say that any research published after the early 1990s in a language other than English is most likely crap.

Comment author: gwern 25 July 2014 07:11:13PM 2 points [-]

Why do you think it changed, and in the early 1990s specifically? (The original study I posted only examined '90s papers and so couldn't show any time-series like that, so it can't be why you think that.)

Comment author: V_V 25 July 2014 08:59:00PM 3 points [-]

I suppose that before the 1990s respectable Soviet scientists published primarily in Russian.

Comment author: gwern 21 July 2014 02:33:32PM *  2 points [-]

As long as researchers retain freedom to either write their study up in English or not there's going to be an additional publication-in-a-very-foreign-language bias.

Yes, but it's not sufficient to explain the results. To use your German example, even a doubling of significance rates in vernacular vs English doesn't give one ~100% success rate in evaluating treatments since their net success rate across the 3 categories is going to be something like 40%. Nor is publishing in English going to be a rare and special event, regardless of how hard English is to learn, because publishing in high-impact English-language journals is part of how Chinese universities are ranked and people are rewarded.

With regards to acupuncture, one thing that didn't happen, is soviet union being full of acupuncture centres and posters about awesomeness of acupuncture everywhere on the walls

Uh huh. But acupuncture is not part of the Russian cultural heritage. What I do see instead is, to name one example (what with not being a Russian and familiar with the particular pathologies of Russian science), tons of bogus nootropics studies (they come up on /r/nootropics periodically as people discover yet another translated abstract on Pubmed of a sketchy substance cursorily tested in animals), because interest in human enhancement is part of Russian culture.

Unsurprisingly, pseudo-medicine and pseudo-science will vary by region - which is, after all, the point of comparing acupuncture studies in the West to studies in East Asia! (If there were millions of acupuncture fanatics in Russia and the UK and the USA just like in China/Korea/Japan, then what would we learn, exactly, from comparing studies?) We expect there to be regional differences and that the West will be less committed & more disinterested than East Asia, closer to the ground truth, and hence the difference gives us a lower bound on how big the biases are.

Comment author: private_messaging 21 July 2014 09:09:50PM *  1 point [-]

Nor is publishing in English going to be a rare and special event

Publication in general doesn't have to be rare and special, only the publications of negative results has to be uncommon. People just care less about publishing negative results and prefer to publish positive results; if there's X amount of effort for publication in a foreign language, and the positive studies already use up all of the X, no X is left for negative results... There's other issues, e.g. how many of those tests were re-testing simple, effective FDA-approved drugs and such?

Also, for the Soviet union, there would be a certain political advantage in finding no efficacy of drugs that are expensive to manufacture or import. And one big aspect of soviet backwardness was always the disbelief that something actually works.

Even assuming that the publications always found what ever experimenter wanted to find, it wouldn't explain that predominantly an effect is found. What's of the chemical safety studies? There's a very strong bias to fail to disprove the null hypothesis.

Unsurprisingly, pseudo-medicine and pseudo-science will vary by region

Yet your paper somehow found a ridiculously high positive rate for acupuncture. The way I think it would work, well, first thing first it's very difficult to blind acupuncture studies and inadequately blinded experiments should find positive result from the placebo effect, secondarily, because that's the case, nobody really cares about that effect, and thirdly, de-facto the system did not result in construction of acupuncture centres.

I haven't really noticed nootropics being a big thing, and various rat maze studies were and are largely complete crap anyway. To the point that the impact of experimenter's gender got only discovered recently.

edit: also if we're looking at Russia from 1991 to 1998, that was the time when scientists and other such government employees were literally not getting paid their wages. I remember that time, my parents were not paid for months at a time, they were reselling shampoo on the side to get some cash.

Comment author: private_messaging 21 July 2014 09:53:12PM *  3 points [-]

Ohh and to add. One big 'thing' in the Soviet Union was research in phage therapy, hoping to replace antibiotics with it, but somehow they didn't end up replacing antibiotics with homebrew phage therapy, something that I'd expect to happen if they were simply finding what they wanted to find, and otherwise not doing science. To summarize, I see this allegation of some grave fault but I fail to see the consequences of this fault. Nor did they end up having all the workers take some 'nootropics' that don't work, or anything likewise stupid.

Comment author: gwern 22 July 2014 01:36:50AM *  3 points [-]

Publication in general doesn't have to be rare and special, only the publications of negative results has to be uncommon.

I realize that, and I've already pointed out why the difference in rates is not going to be that large & that your cite does not explain the excess significance in their sample.

There's other issues, e.g. how many of those tests were re-testing simple, effective FDA-approved drugs and such?

Doesn't matter that much. Power, usually quite low, sets the upper limit to how many of the results should have been positive even if we assume every single one was testing a known-efficacious drug (which hypothesis raises its own problems: how is that consistent with your claims about the language bias towards publishing cool new results?)

Also, for the Soviet union, there would be a certain political advantage in finding no efficacy of drugs that are expensive to manufacture or import.

So? I don't care why the Russian literature is biased, just that it is.

What's of the chemical safety studies? There's a very strong bias to fail to disprove the null hypothesis.

Yes, but toxicology studies being done by industry is not aimed at academic publication, and the ones aimed at academic publication have the usual incentives to find something and so are part of the overall problem.

Yet your paper somehow found a ridiculously high positive rate for acupuncture. The way I think it would work, well, first thing first it's very difficult to blind acupuncture studies and inadequately blinded experiments should find positive result from the placebo effect,

Huh? The paper finds that acupuncture study rates vary by region. USA/Sweden/Germany 53/59%/63%, China/Japan/Taiwan 100% etc

secondarily, because that's the case, nobody really cares about that effect, and thirdly, de-facto the system did not result in construction of acupuncture centres.

How much have you looked? There's plenty of acupuncture centres in the USA despite a relatively low acupuncture success rate.

I haven't really noticed nootropics being a big thing

Does a fish notice water? But fine, maybe you don't, feel free to supply your own example of Russian pseudoscience and traditional medicine. I doubt Russian science is a shining jewel of perfection with no faults given its 91% acupuncture success rate (admittedly on a small base).

but somehow they didn't end up replacing antibiotics with homebrew phage therapy

Not sure that's a good example, as Wikipedia seems to disagree about homebrew phage therapy not being applied: https://en.wikipedia.org/wiki/Phage_therapy#History

When antibiotics were discovered in 1941 and marketed widely in the U.S. and Europe, Western scientists mostly lost interest in further use and study of phage therapy for some time.[12] Isolated from Western advances in antibiotic production in the 1940s, Russian scientists continued to develop already successful phage therapy to treat the wounds of soldiers in field hospitals. During World War II, the Soviet Union used bacteriophages to treat many soldiers infected with various bacterial diseases e.g. dysentery and gangrene. Russian researchers continued to develop and to refine their treatments and to publish their research and results. However, due to the scientific barriers of the Cold War, this knowledge was not translated and did not proliferate across the world.

Anyway,

To summarize, I see this allegation of some grave fault but I fail to see the consequences of this fault.

How do you see the unseen? Unless someone has done a large definitive RCT, how does one ever prove that a result was bogus? Nobody is ever going to take the time and resources to refute those shitty animal experiments with a much better experiment. Most scientific findings never gets that sort of black-and-white refutation, it just gets quietly forgotten and buried, and even the specialists don't know about it. Most bad science doesn't look like Lysenko. Or look at evidence-based medicine in the West: rubbish medicine doesn't look like a crazy doc slicing open patients with a scalpel, it just looks like regular old medicine which 'somehow' turns up no benefit when rigorously tested and is quietly dropped from the medical textbooks.

To diagnose bad science, you need to look at overall metrics and indirect measures - like excess significance. Like 91% of acupuncture studies working.

Comment author: private_messaging 22 July 2014 05:33:13AM *  -2 points [-]

Doesn't matter that much. Power, usually quite low...

If you want to persist in your mythical ideas regarding western civilization by postulating what ever you need and making shit up, there's nothing I or anyone else can do about it.

So? I don't care why the Russian literature is biased, just that it is.

Your study is making a more specific claim than mere bias in research, it's claiming bias in one particular direction.

Not sure that's a good example, as Wikipedia seems to disagree about homebrew phage therapy not being applied:

The point is that the SU was, mostly, using antibiotics (once production was set up, i.e. from some time after ww2).

There's plenty of acupuncture centres in the USA despite a relatively low acupuncture success rate.

Well, and there wasn't a plenty in the soviet union despite supposedly higher success rate.

Huh? The paper finds that acupuncture study rates vary by region. USA/Sweden/Germany 53/59%/63%, China/Japan/Taiwan 100% etc

If you don't know correct rate you can't tell which specific rate is erroneous. It's not realistically possible to construct a blind study of acupuncture, so, unlike, say, homoeopathy, it is a very shitty measure of research errors.

To diagnose bad science, you need to look at overall metrics and indirect measures - like excess significance. Like 91% of acupuncture studies working.

I really doubt that 91% of Russian language acupuncture studies published in Soviet Union found a positive effect (I dunno about 1991-1998 Russia, it was fucked up beyond belief at that time), and I don't know how many studies should have found a positive effect (followed by a note that more adequate blinding must be invented to study it properly).

And we know that what ever was the case there was no Soviet abandonment of normal medicine in favour of acupuncture - the system somehow worked out ok in the end.

Comment author: [deleted] 25 July 2014 11:32:00AM 1 point [-]

Does a fish notice water?

Well, humans do notice air some of the time. (SCNR.)

Comment author: dvasya 11 July 2014 10:56:12PM 2 points [-]

Well, perhaps a bit too simple. Consider this. You set your confidence level at 95% and start throwing a coin. You observe 100 tails out of 100. You publish a report saying "the coin has tails on both sides at a 95% confidence level" because that's what you chose during design. Then 99 other researchers repeat your experiment with the same coin, arriving at the same 95%-confidence conclusion. But you would expect to see about 5 reports claiming otherwise! The paradox is resolved when somebody comes up with a trick using a mirror to observe both sides of the coin at once, finally concluding that the coin is two-tailed with a 100% confidence.

What was the mistake?

Comment author: Douglas_Knight 12 July 2014 08:07:25PM 3 points [-]

I don't know if the original post was changed, but it explicitly addresses this point:

Note: This does not apply in the same way to reviews that show a link between X and Y

Comment author: Vaniver 15 July 2014 01:49:48AM 2 points [-]

The actual situation is described this way:

I have a coin which I claim is fair: that is, there is equal chance that it lands on heads and tails, and each flip is independent of every other flip.

But when we look at 60 trials of the coin flipped 5 times (that is, 300 total flips), we see that there are no trials in which either 0 heads were flipped or 5 heads were flipped. Every time, it's 1 to 4 heads.

This is odd- for a fair coin, there's a 6.25% chance that we would see 5 tails in a row or 5 heads in a row in a set of 5 flips. To not see that 60 times in a row has a probability of only 2.1%, which is rather unlikely! We can state with some confidence that this coin does not look fair; there is some structure to it that suggests the flips are not independent of each other.

Comment author: Caspian 14 July 2014 11:15:37PM 2 points [-]

One mistake is treating 95% as the chance of the study indicating two-tailed coins, given that they were two-tailed coins. More likely it was meant as the chance of the study not indicating two-tailed coins, given that they were not two-tailed coins.

Try this:

You want to test if a coin is biased towards heads. You flip it 5 times, and consider 5 heads as a positive result, 4 heads or fewer as negative. You're aiming for 95% confidence but have to get 31/32 = 96.875%. Treating 4 heads as a possible result wouldn't work either, as that would get you less than 95% confidence.

Comment author: The_Duck 12 July 2014 01:55:53AM *  2 points [-]

This doesn't seem like a good analogy to any real-world situation. The null hypothesis ("the coin really has two tails") predicts the exact same outcome every time, so every experiment should get a p-value of 1, unless the null-hypothesis is false, in which case someone will eventually get a p-value of 0. This is a bit of a pathological case which bears little resemblance to real statistical studies.

Comment author: dvasya 12 July 2014 08:52:37AM 2 points [-]

While the situation admittedly is oversimplified, it does seem to have the advantage that anyone can replicate it exactly at a very moderate expense (a two-headed coin will also do, with a minimum amount of caution). In that respect it may actually be more relevant to real world than any vaccine/autism study.

Indeed, every experiment should get a pretty strong p-value (though never exactly 1), but what gets reported is not the actual p but whether it is above .95 (which is an arbitrary threshold proposed once by Fisher who never intended it to play the role it plays in science currently, but merely as a rule of thumb to see if a hypothesis is worth a follow-up at all.) But even the exact p-values refer to only one possible type of error, and the probability of the other is generally not (1-p), much less (1-alpha).

Comment author: wedrifid 11 July 2014 11:09:16PM *  1 point [-]

What was the mistake?

Neglecting all of the hypotheses which would result in the mirrored observation which do not involve the coin being two tailed. The mistake in your question is the "the". The final overconfidence is the least of the mistakes in the story.

Mistakes more relevant to practical empiricism: Treating ">= 95%" as "= 95%" is a reasoning error, resulting in overtly wrong beliefs. Choosing to abandon all information apart from the single boolean is a (less serious) efficiency error. Listeners can still be subjectively-objectively 'correct', but they will be less informed.

Comment author: dvasya 11 July 2014 11:20:19PM 1 point [-]

Treating ">= 95%" as "= 95%" is a reasoning error

Hence my question in another thread: Was that "exactly 95% confidence" or "at least 95% confidence"? However when researchers say "at a 95% confidence level" they typically mean "p < 0.05", and reporting the actual p-values is often even explicitly discouraged (let's not digress into whether it is justified).

Yet the mistake I had in mind (as opposed to other, less relevant, merely "a" mistakes) involves Type I and Type II error rates. Just because you are 95% (or more) confident of not making one type of error doesn't guarantee you an automatic 5% chance of getting the other.

Comment author: shminux 11 July 2014 11:27:01PM 0 points [-]

I don't see a paradox. After 100 experiments one can conclude that either the confidence level was set too low, or the papers are all biased toward two-tailed coins. But which is it?

Comment author: dvasya 11 July 2014 11:36:40PM 1 point [-]

(1) is obvious, of course--in hindsight. However changing your confidence level after the observation is generally advised against. But (2) seems to be confusing Type I and Type II error rates.

On another level, I suppose it can be said that of course they are all biased! But, by the actual two-tailed coin rather than researchers' prejudice against normal coins.