Here's another excellent book roughly from the same time: "The Phenomenon of Science" by Valentin F. Turchin (http://pespmc1.vub.ac.be/posbook.html). It starts from largely similar concepts and proceeds through the evolution of the nervous system to language to math to science. I suspect it may be even more AI-relevant than Powers.

Off-topic: you seem to be one of the organizers of the Houston meetup. I'll be in town the week of Nov 16, feel free to let me know if there is anything scheduled.

Hi shminux. Sorry, just saw your comment. We don't seem to have a date set for November yet, but let me check with the others. Typically we meet on Saturdays, are you still around on the 22nd? Or we could try Sunday the 16th. Let me know.

The Planning Fallacy explanation makes a lot of sense.

This doesn't seem like a good analogy to any real-world situation. The null hypothesis ("the coin really has two tails") predicts the exact same outcome every time, so every experiment should get a p-value of 1, unless the null-hypothesis is false, in which case someone will eventually get a p-value of 0. This is a bit of a pathological case which bears little resemblance to real statistical studies.

While the situation admittedly is oversimplified, it does seem to have the advantage that anyone can replicate it exactly at a very moderate expense (a two-headed coin will also do, with a minimum amount of caution). In that respect it may actually be more relevant to real world than any vaccine/autism study.

Indeed, every experiment should get a pretty strong p-value (though never exactly 1), but what gets reported is not the actual p but whether it is above .95 (which is an arbitrary threshold proposed once by Fisher who never intended it to play the role it plays in science currently, but merely as a rule of thumb to see if a hypothesis is worth a follow-up at all.) But even the exact p-values refer to only one possible type of error, and the probability of the other is generally *not* (1-p), much less (1-alpha).

I don't see a paradox. After 100 experiments one can conclude that either the confidence level was set too low, or the papers are all biased toward two-tailed coins. But which is it?

(1) is obvious, of course--in hindsight. However changing your confidence level after the observation is generally advised against. But (2) seems to be confusing Type I and Type II error rates.

On another level, I suppose it can be said that *of course* they are all biased! But, by the actual two-tailed coin rather than researchers' prejudice against normal coins.

What was the mistake?

Neglecting all of the hypotheses which would result in the mirrored observation which do not involve the coin being two tailed. The mistake in your question is the "the". The final overconfidence is the least of the mistakes in the story.

Mistakes more relevant to practical empiricism: Treating ">= 95%" as "= 95%" is a reasoning error, resulting in overtly wrong beliefs. Choosing to abandon all information apart from the single boolean is a (less serious) efficiency error. Listeners can still be subjectively-objectively 'correct', but they will be less informed.

Treating ">= 95%" as "= 95%" is a reasoning error

Hence my question in another thread: Was that "exactly 95% confidence" or "at least 95% confidence"? However when researchers say "at a 95% confidence level" they typically mean "*p* < 0.05", and reporting the actual *p*-values is often even explicitly discouraged (let's not digress into whether it is justified).

Yet *the* mistake I had in mind (as opposed to other, less relevant, merely "*a*" mistakes) involves Type I and Type II error rates. Just because you are 95% (or more) confident of not making one type of error doesn't guarantee you an automatic 5% chance of getting the other.

Simple statistics, but eye-opening. I wonder if gwern would be interested enough to do a similar analysis, or maybe he already has.

Well, perhaps a bit too simple. Consider this. You set your confidence level at 95% and start throwing a coin. You observe 100 tails out of 100. You publish a report saying "the coin has tails on both sides at a 95% confidence level" because that's what you chose during design. Then 99 other researchers repeat your experiment with the same coin, arriving at the same 95%-confidence conclusion. But you would expect to see about 5 reports claiming otherwise! The paradox is resolved when somebody comes up with a trick using a mirror to observe both sides of the coin at once, finally concluding that the coin *is* two-tailed with a 100% confidence.

What was the mistake?

No; it's standard to set the threshold for your statistical test for 95% confidence. Studies with larger samples can detect smaller differences between groups with that same statistical power.

How does your choice of threshold (made beforehand) affect your actual data and the information about the actual phenomenon contained therein?

View more: Next

You seem to be exaggerating a bit here: that's a 99.65% reduction. Hope it's the only inaccuracy in your estimates!