Yes, this also seems related to a lot of the stuff I've been talking about with Fermi Modeling, and also a bunch of stuff that Peter Thiel calls Pyrrhonian Scepticism.
Pyrrhonian Scepticism as described there sounds like the thing I'm arguing against as not-quite-right: looking for the negation. The idea implies that you're attached to a hypothesis. It sets a low bar, where you come up with one other hypothesis. I won't deny that this is a useful mental tool, but false dichotomies are almost as bad as attachment to single beliefs, and for the same reason, and it sets up a misleading standard of evidence. The idea that you generate experiments by trying to falsify a hypothesis is confusing. It's better than trying to confirm, but only because it starts to point toward the real thing. You generate experiments, and evidence, by trying to differentiate.
EDIT: Ok, that seems too strong. "trying to disprove"/"looking for the negation" is a convenient whipping-boy for my argument, because it's a pervasive idea which value-of-information beats. Nonetheless, asking "what if I'm wrong about that?" is more like the starting point for generating multiple hypotheses, than it is an alternative. So, the method is inexact because it is incomplete. It's likely, for example, that the way Peter Thiel employs the method amounts to the whole picture I'm gesturing at. But, there's a different way you can employ the method, where the negation of your hypothesis is interpreted to imply absurd things. In this version, you can think you're making the right motions (not falling prey to confirmation bias), and be wrong.
The steel-man of Pyrrhonian Skepticism is something like "look for cruxes" in the double-crux sense. Look for variables which have high value of information for you. Look for things which differentiate between the most plausible hypotheses.
Great article, but as a sidenote: It seems that the spike experiment would be worthwhile running even if it isn't the only theory that could predict the spike. After all, if no current theory predicts the spike, they may very well need to be modified or even abandoned in light of the experiment. Further, challenging theories that are popular seems more important than challenging theories that no-one believes yet.
I agree, you still probably run the experiment. At least it codifies something which was previously implicit or "folk-knowledge" among the scientists. G-complex theory is a (possibly wrong) precise statement of existing intuitions. I would now say that Dr. Y has made a gears-level explanation where one did not previously exist.
If all we care about is gears-level, then assigning high credence to Dr. Y's explanation after the experiment succeeds is a good thing. This is largely true in science; the theory deserves the credit for the explicit prediction.
However, if all we care about is accuracy, something has gone wrong. People are likely to assign too much credence to G-complex theory. Everyone familiar with the area already knew how the experiment would turn out, without G-complex theory.
A possible remedy is to gather predictions from other scientists before performing the experiments. The other scientists might report low endorsement of G-complex theory, but expect the spike.
If everyone expects a spike and it doesn't happen, you now know that it's really a pretty big deal.
There are at least two types of confirmation bias.
The first is selective attention: a tendency to pay attention to, or recall, that which confirms the hypothesis you are thinking about rather than that which speaks against it.
The second is selective experimentation: a tendency to do experiments which will confirm, rather than falsify, the hypothesis.
The standard advice for both cases seems to be "explicitly look for things which would falsify the hypothesis". I think this advice is helpful, but it is subtly wrong, especially for the selective-experimentation type of confirmation bias. Selective attention is relatively straightforward, but selective experimentation is much more complex than it initially sounds.
Looking for Falsification
What the standard (Popperian) advice tells you to do is try as hard as you can to falsify your hypothesis. You should think up experiments where your beloved hypothesis really could fail.
What this advice definitely does do is guard against the mistake of making experiments which could not falsify your hypothesis. Such a test is either violating conservation of expected evidence (by claiming to provide evidence one way without having any possibility of providing evidence the other way), or providing only very weak evidence for your claim (by looking much the same whether your claim is true or false). Looking for tests which can falsify your result steers you towards tests which would provide strong evidence, and helps you avoid violating the law of expected evidence.
However, there are more subtle ways in which confirmation bias can act.
Predicting Results in Advance
You can propose a test which would indeed fit your hypothesis if it came out one way, and which would disconfirm your hypothesis if it came out the other way -- but where you can predict the outcome in advance. It's easy to not realize you are doing this. You'll appear to provide significant evidence for your hypothesis, but actually you've cherry-picked your evidence before even looking at it; you knew enough about the world to know where to look to see what you wanted to see.
Suppose Dr. Y studies a rare disease, Swernish syndrome. Many scientists have formed an intuition that Swernish syndrome has something to do with a chemical G-complex. Dr. Y is thinking on this one night, when the intuition crystallizes into G-complex theory, which would provide a complete explanation of how Swernish syndrome develops. G-complex theory makes the novel prediction that G-complex in the bloodstream will spike during early onset of the disease; if this were false, G-complex theory would have to be false. Dr. Y does the experiment, and finds that the spike does occur. No one has measured this before, nor has anyone else put forward a model which makes that prediction. However, it happens that anyone familiar with the details of Dr. Y's experimental results over the past decade would have strongly suspected the same spike to occur, whether or not they endorsed G-complex theory. Does the experimental result constitute significant evidence?
This is a subtle kind of double-counting of evidence. You have enough evidence to know the result of the experiment; also, your evidence has caused you to generate a hypothesis. You cannot then claim the success of the experiment as more evidence for your hypothesis: you already know what would happen, so it can't alter the certainty of your hypothesis.
If we're dealing only with personal rationality, we could invoke conservation of expected evidence again: if you already predict the outcome with high probability, you cannot simultaneously derive much evidence from it. However, in group rationality, there are plenty of cases where you want to predict an experiment in advance and then claim it as evidence. You may already be convinced, but you need to convince skeptics. So, we can't criticize someone just for being able to predict their experimental results in advance. That would be absurd. The problem is, the hypothesis isn't what did the work of predicting the outcome. Dr. Y had general world-knowledge which allowed him to select an experiment whose results would be in line with his theory.
To Dr. Y, it just feels like "if I am right, we will see the spike. If I am wrong, we won't see it." From the outside, we might be tempted to say that Dr. Y is not "trying hard enough to falsify G-complex theory". But how can Dr. Y use this advice to avoid the mistake? A hypothesis is an explicit model of the world, which guides your predictions. When asked to try to falsify, though, what's your guide? If you find your hypothesis very compelling, you may have difficulty imagining how it could be false. A hypothesis is solid, definite. The negation of a hypothesis includes anything else. As a result, "try to falsify your hypothesis" is very vague advice. It doesn't help that the usual practice is to test against a null hypothesis. Dr. Y tests against the spike not being there, and thinks this sufficient.
Implicit Knowledge
Part of the problem here is that it should be very clear what could and could not have been predicted. There's an interaction between your general world knowledge, which is not explicitly articulated, and your scientific knowledge, which is.
If all of your knowledge was explicit scientific knowledge, many biases would disappear. You couldn't possibly have hindsight bias; each hypothesis would predict the observation with a precise probability, which you can calculate.
Similarly, the failure mode I'm describing would become impossible. You could easily notice that it's not really your new hypothesis doing the work of telling you which experimental result to expect; you would know exactly what other world-knowledge you're using to design your experiment.
I think this is part of why it is useful to orient toward gear-like models. If our understanding of a subject is explicit rather than implicit, we can do a lot more to correct our reasoning. However, we'll always have large amounts of implicit, fuzzy knowledge coming in to our reasoning process; so, we have to be able to deal with that.
Is "Sufficient Novelty" The Answer?
In some sense, the problem is that Dr. Y's experimental result isn't novel enough. It might be a "novel prediction" in the sense that it hasn't been explicitly predicted by anyone, but it is a prediction that could have been made without Dr. Y's new hypothesis. Extraordinary claims require extraordinary evidence, right? It isn't enough that a hypothesis makes a prediction which is new. The hypothesis should make a prediction which is really surprising.
But, this rule wouldn't be any good for practical science. How surprising something is is too subjective, and it is too easy for hindsight bias to make it feel as if the result of the experiment could have been predicted. Besides: if you want science to be able to provide compelling evidence to skeptics, you can't throw out experiments as unscientific just because most people can predict their outcome.
Method of Multiple Hypotheses
So, how could Dr. Y have avoided the mistake?
It is meaningless to confirm or falsify a hypothesis in isolation; all you can really do is provide evidence which helps distinguish between hypotheses. This will guide you away from "mundane" tests where you actually could have predicted the outcome without your hypothesis, because there will likely be many other hypotheses which would be able to predict the outcome of that test. It guides you toward corner cases, where otherwise similar hypotheses make very different predictions.
We can unpack "try to falsify" as "come up with as many plausible alternative hypotheses as you can, and look for experiments which would rule out the others." But actually, "come up with alternative hypotheses" is more than an unpacking of "try to falsify"; it shifts you to trying to distinguish between many hypotheses, rather than focusing on "your" hypothesis as central.
The actual, exactly correct criteria for an experiment is its value-of-information. "Try to falsify your hypothesis" is a lousy approximation of this, which judges experiments by how likely they are to provide evidence against your hypothesis, or the likelihood ratio against your hypothesis in the case where the experiment doesn't go as your hypothesis predicts, or something. Don't optimize for the wrong metric; things'll tend to go poorly for you.
Some might object that trying-to-falsify is a good heuristic, since value of information is too difficult to compute. I'd say that a much better heuristic is to pretend distinguishing the right hypothesis is equally valuable in all cases, and look for experiments that allow you to maximally differentiate between them. Come up with as many possibilities as you can, and try to differentiate between the most plausible ones.
Given that the data was already very suggestive of a G-complex spike, Dr. Y would most likely generate other hypotheses which also involve a G-complex spike. This would make the experiment which tests for the spike uninteresting, and suggest other more illuminating experiments.
I think "coming up with alternatives" is a somewhat underrated debiasing technique. It is discussed more in Heuer's Psychology of Intelligence Analysis and Chamberlin's Method of Multiple Working Hypotheses.