Confound it! Correlation is (usually) not causation! But why not?
It is widely understood that statistical correlation between two variables ≠ causation. But despite this admonition, people are routinely overconfident in claiming correlations to support particular causal interpretations and are surprised by the results of randomized experiments, suggesting that they are biased & systematically underestimating the prevalence of confounds/commoncausation. I speculate that in realistic causal networks or DAGs, the number of possible correlations grows faster than the number of possible causal relationships. So confounds really are that common, and since people do not think in DAGs, the imbalance also explains overconfidence.
I’ve noticed I seem to be unusually willing to bite the correlation≠causation bullet, and I think it’s due to an idea I had some time ago about the nature of reality.
1.1 The Problem
One of the constant problems I face in my reading is that I constantly want to know about causal relationships but usually I only have correlational data, and as we all know, correlation≠causation. If the general public naively thinks correlation=causation, then most geeks know better and that correlation≠causation, but then some go meta and point out that correlation and causation do tend to correlate and so correlation weakly implies causation. But how much evidence…? If I suspect that A→B, and I collect data and establish beyond doubt that A&B correlates r=0.7, how much evidence do I have that A→B?
Now, the correlation could be an illusory correlation thrown up by all the standard statistical problems we all know about, such as toosmall n, false positive from sampling error (A & B just happened to sync together due to randomness), multiple testing, phacking, data snooping, selection bias, publication bias, misconduct, inappropriate statistical tests, etc. I’ve read about those problems at length, and despite knowing about all that, there still seems to be a problem: I don’t think those issues explain away all the correlations which turn out to be confounds  correlation too often ≠ causation.
To measure this directly you need a clear set of correlations which are proposed to be causal, randomized experiments to establish what the true causal relationship is in each case, and both categories need to be sharply delineated in advance to avoid issues of cherrypicking and retroactively confirming a correlation. Then you’d be able to say something like ‘11 out of the 100 proposed A→B causal relationships panned out’, and start with a prior of 11% that in your case, A→B. This sort of dataset is pretty rare, although the few examples I’ve found from medicine tend to indicate that our prior should be under 10%. Not great. Why are our best guesses at causal relationships are so bad?
We’d expect that the a priori odds are good: 1/3! After all, you can divvy up the possibilities as:
 A causes B
 B causes A
 both A and B are caused by a C (possibly in a complex way like Berkson’s paradox or conditioning on unmentioned variables, like a phonebased survey inadvertently generating conclusions valid only for the phoneusing part of the population, causing amusing pseudocorrelations)
If it’s either #1 or #2, we’re good and we’ve found a causal relationship; it’s only outcome #3 which leaves us baffled & frustrated. Even if we were guessing at random, you’d expect us to be right at least 33% of the time, if not much more often because of all the knowledge we can draw on. (Because we can draw on other knowledge, like temporal order or biological plausibility. For example, in medicine you can generally rule out some of the relationships this way: if you find a correlation between taking superdupertetrohydracyline™ and pancreas cancer remission, it seems unlikely that #2 curing pancreas cancer causes a desire to take superdupertetrohydracyline™ so the causal relationship is probably either #1 superdupertetrohydracyline™ cures cancer or #3 a common cause like ‘doctors prescribe superdupertetrohydracyline™ to patients who are getting better’.)
I think a lot of people tend to put a lot of weight on any observed correlation because of this intuition that a causal relationship is normal & probable because, well, “how else could this correlation happen if there’s no causal connection between A & Bâ€˝” And fair enough  there’s no grand cosmic conspiracy arranging matters to fool us by always putting in place a C factor to cause scenario #3, right? If you question people, of course they know correlation doesn’t necessarily mean causation  everyone knows that  since there’s always a chance of a lurking confound, and it would be great if you had a randomized experiment to draw on; but you think with the data you have, not the data you wish you had, and can’t let the perfect be the enemy of the better. So when someone finds a correlation between A and B, it’s no surprise that suddenly their language & attitude change and they seem to place great confidence in their favored causal relationship even if they piously acknowledge “Yes, correlation is not causation, but… [obviously hanging out with fat people can be expected to make you fat] [surely giving babies antibiotics will help them] [apparently femalenamed hurricanes increase death tolls] etc etc”.
So, correlations tend to not be causation because it’s almost always #3, a shared cause. This commonness is contrary to our expectations, based on a simple & unobjectionable observation that of the 3 possible relationships, 2 are causal; and so we often reason as though correlation were strong evidence for causation. This leaves us with a paradox: experimental results seem to contradict intuition. To resolve the paradox, I need to offer a clear account of why shared causes/confounds are so common, and hopefully motivate a different set of intuitions.
1.2 What a Tangled Net We Weave When First We Practice to Believe
Here’s where Bayes nets & causal networks (seen previously on LW & Michael Nielsen) come up. When networks are inferred on realworld data, they often start to look pretty gnarly: tons of nodes, tons of arrows pointing all over the place. Daphne Koller early on in her Probabilistic Graphical Models course shows an example from a medical setting where the network has like 600 nodes and you can’t understand it at all. When you look at a biological causal network like this:
You start to appreciate how everything might be correlated with everything, but not cause each other.
This is not too surprising if you step back and think about it: life is complicated, we have limited resources, and everything has a lot of moving parts. (How many discrete parts does an airplane have? Or your car? Or a single cell? Or think about a chess player analyzing a position: ‘if my bishop goes there, then the other pawn can go here, which opens up a move there or here, but of course, they could also do that or try an en passant in which case I’ll be down in material but up on initiative in the center, which causes an overall shift in tempo…’) Fortunately, these networks are still simple compared to what they could be, since most nodes aren’t directly connected to each other, which tamps down on the combinatorial explosion of possible networks. (How many different causal networks are possible if you have 600 nodes to play with? The exact answer is complicated but it’s much larger than 2^{600}  so very large!)
One interesting thing I managed to learn from PGM (before concluding it was too hard for me and I should try it later) was that in a Bayes net even if two nodes were not in a simple direct correlation relationship A→B, you could still learn a lot about A from setting B to a value, even if the two nodes were ‘way across the network’ from each other. You could trace the influence flowing up and down the pathways to some surprisingly distant places if there weren’t any blockers.
The bigger the network, the more possible combinations of nodes to look for a pairwise correlation between them (eg If there are 10 nodes/variables and you are looking at bivariate correlations, then you have 10 choose 2
= 45 possible comparisons, and with 20, 190, and 40, 780. 40 variables is not that much for many realworld problems.) A lot of these combos will yield some sort of correlation. But does the number of causal relationships go up as fast? I don’t think so (although I can’t prove it).
If not, then as causal networks get bigger, the number of genuine correlations will explode but the number of genuine causal relationships will increase slower, and so the fraction of correlations which are also causal will collapse.
(Or more concretely: suppose you generated a randomly connected causal network with x nodes and y arrows perhaps using the algorithm in Kuipers & Moffa 2012, where each arrow has some random noise in it; count how many pairs of nodes are in a causal relationship; now, n times initialize the root nodes to random values and generate a possible state of the network & storing the values for each node; count how many pairwise correlations there are between all the nodes using the n samples (using an appropriate significance test & alpha if one wants); divide # of causal relationships by # of correlations, store; return to the beginning and resume with x+1 nodes and y+1 arrows… As one graphs each value of x against its respective estimated fraction, does the fraction head toward 0 as x increases? My thesis is it does. Or, since there must be at least as many causal relationships in a graph as there are arrows, you could simply use that as an upper bound on the fraction.)
It turns out, we weren’t supposed to be reasoning ‘there are 3 categories of possible relationships, so we start with 33%’, but rather: ‘there is only one explanation “A causes B”, only one explanation “B causes A”, but there are many explanations of the form “C_{1} causes A and B”, “C_{2} causes A and B”, “C_{3} causes A and B”…’, and the more nodes in a field’s true causal networks (psychology or biology vs physics, say), the bigger this last category will be.
The real world is the largest of causal networks, so it is unsurprising that most correlations are not causal, even after we clamp down our data collection to narrow domains. Hence, our prior for “A causes B” is not 50% (it’s either true or false) nor is it 33% (either A causes B, B causes A, or mutual cause C) but something much smaller: the number of causal relationships divided by the number of pairwise correlations for a graph, which ratio can be roughly estimated on a fieldbyfield basis by looking at existing work or directly for a particular problem (perhaps one could derive the fraction based on the properties of the smallest inferrable graph that fits large datasets in that field). And since the larger a correlation relative to the usual correlations for a field, the more likely the two nodes are to be close in the causal network and hence more likely to be joined causally, one could even give causality estimates based on the size of a correlation (eg. an r=0.9 leaves less room for confounding than an r of 0.1, but how much will depend on the causal network).
This is exactly what we see. How do you treat cancer? Thousands of treatments get tried before one works. How do you deal with poverty? Most programs are not even wrong. Or how do you fix societal woes in general? Most attempts fail miserably and the higherquality your studies, the worse attempts look (leading to Rossi’s Metallic Rules). This even explains why ‘everything correlates with everything’ and Andrew Gelman’s dictum about how coefficients are never zero: the reason datasets like those mentioned by Cohen or Meehl find most of their variables to have nonzero correlations (often reaching statisticalsignificance) is because the data is being drawn from large complicated causal networks in which almost everything really is correlated with everything else.
And thus I was enlightened.
1.3 Comment
Since I know so little about causal modeling, I asked our local causal researcher Ilya Shpitser to maybe leave a comment about whether the above was trivially wrong / alreadyproven / wellknown folklore / etc; for convenience, I’ll excerpt the core of his comment:
But does the number of causal relationships go up just as fast? I don’t think so (although at the moment I can’t prove it).
I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show. We say A “structurally causes” B in a DAG G if and only if there is a directed path from A to B in G. We say A is “structurally dependent” with B in a DAG G if and only if there is a marginal dconnecting path from A to B in G.
A marginal dconnecting path between two nodes is a path with no consecutive edges of the form * > * < * (that is, no colliders on the path). In other words all directed paths are marginal dconnecting but the opposite isn’t true.
The justification for this definition is that if A “structurally causes” B in a DAG G, then if we were to intervene on A, we would observe B change (but not vice versa) in “most” distributions that arise from causal structures consistent with G. Similarly, if A and B are “structurally dependent” in a DAG G, then in “most” distributions consistent with G, A and B would be marginally dependent (e.g. what you probably mean when you say ‘correlations are there’).
I qualify with “most” because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect). But if the arrow is not missing we cannot say anything. Maybe there is dependence, maybe there is independence. An arrow may be present in G, and there may still be independence in a distribution consistent with G. We call such distributions “unfaithful” to G. If we pick distributions consistent with G randomly, we are unlikely to hit on unfaithful ones (subset of all distributions consistent with G that is unfaithful to G has measure zero), but Nature does not pick randomly.. so unfaithful distributions are a worry. They may arise for systematic reasons (maybe equilibrium of a feedback process in bio?)
If you accept above definition, then clearly for a DAG with n vertices, the number of pairwise structural dependence relationships is an upper bound on the number of pairwise structural causal relationships. I am not aware of anyone having worked out the exact combinatorics here, but it’s clear there are many many more paths for structural dependence than paths for structural causality.
But what you actually want is not a DAG with n vertices, but another type of graph with n vertices. The “Universe DAG” has a lot of vertices, but what we actually observe is a very small subset of these vertices, and we marginalize over the rest. The trouble is, if you start with a distribution that is consistent with a DAG, and you marginalize over some things, you may end up with a distribution that isn’t well represented by a DAG. Or “DAG models aren’t closed under marginalization.”
That is, if our DAG is A > B < H > C < D, and we marginalize over H because we do not observe H, what we get is a distribution where no DAG can properly represent all conditional independences. We need another kind of graph.
In fact, people have come up with a mixed graph (containing > arrows and <> arrows) to represent margins of DAGs. Here > means the same as in a causal DAG, but <> means “there is some sort of common cause/confounder that we don’t want to explicitly write down.” Note: <> is not a correlative arrow, it is still encoding something causal (the presence of a hidden common cause or causes). I am being loose here – in fact it is the absence of arrows that means things, not the presence.
I do a lot of work on these kinds of graphs, because these are graphs are the sensible representation of data we typically get – drawn from a marginal of a joint distribution consistent with a big unknown DAG.
But the combinatorics work out the same in these graphs – the number of marginal dconnected paths is much bigger than the number of directed paths. This is probably the source of your intuition. Of course what often happens is you do have a (weak) causal link between A and B, but a much stronger noncausal link between A and B through an unobserved common parent. So the causal link is hard to find without “tricks.”
1.4 Heuristics & Biases
Now assuming the foregoing to be right (which I’m not sure about; in particular, I’m dubious that correlations in causal nets really do increase much faster than causal relations do), what’s the psychology of this? I see a few major ways that people might be incorrectly reasoning when they overestimate the evidence given by a correlation:

they might be aware of the imbalance between correlations and causation, but underestimate how much more common correlation becomes compared to causation.
This could be shown by giving causal diagrams and seeing how elicited probability changes with the size of the diagrams: if the probability is constant, then the subjects would seem to be considering the relationship in isolation and ignoring the context.
It might be remediable by showing a network and jarring people out of a simplistic comparison approach. 
they might not be reasoning in a causalnet framework at all, but starting from the naive 33% baserate you get when you treat all 3 kinds of causal relationships equally.
This could be shown by eliciting estimates and seeing whether the estimates tend to look like base rates of 33% and modifications thereof.
Sterner measures might be needed: could we draw causal nets with not just arrows showing influence but also another kind of arrow showing correlations? For example, the arrows could be drawn in black, inverse correlations drawn in red, and regular correlations drawn in green. The picture would be rather messy, but simply by comparing how few black arrows there are to how many green and red ones, it might visually make the case that correlation is much more common than causation. 
alternately, they may really be reasoning causally and suffer from a truly deep & persistent cognitive illusion that when people say ‘correlation’ it’s really a kind of causation and don’t understand the technical meaning of ‘correlation’ in the first place (which is not as unlikely as it may sound, given examples like David Hestenes’s demonstration of the persistence of Aristotelian folkphysics in physics students as all they had learned was guessing passwords; on the test used, see eg Halloun & Hestenes 1985 & Hestenes et al 1992); in which cause it’s not surprising that if they think they’ve been told a relationship is ‘causation’, then they’ll think the relationship is causation. Ilya remarks:
Pearl has this hypothesis that a lot of probabilistic fallacies/paradoxes/biases are due to the fact that causal and not probabilistic relationships are what our brain natively thinks about. So e.g. Simpson’s paradox is surprising because we intuitively think of a conditional distribution (where conditioning can change anything!) as a kind of “interventional distribution” (no Simpson’s type reversal under interventions: “Understanding Simpson’s Paradox”, Pearl 2014 [see also Pearl’s comments on Nielsen’s blog)).
This hypothesis would claim that people who haven’t looked into the math just interpret statements about conditional probabilities as about “interventional probabilities” (or whatever their intuitive analogue of a causal thing is).
This might be testable by trying to identify simple examples where the two approaches diverge, similar to Hestenes’s quiz for diagnosing belief in folkphysics.
This was originally posted to an open thread but due to the favorable response I am posting an expanded version here.
Comments (34)
Hi, I will put responses to your comment in the original thread here. I will do them slightly out of order.
A Bayesian network is a statistical model. A statistical model is a set of joint distributions (under some restrictions). A Bayesian network model of a DAG G with vertices X1,...,Xk = X is a set of joint distributions that Markov factorize according to this DAG. This set will include distributions of the form p(x1,...,xk) = p(x1) * ... * p(xk) which (trivially!) factorize with respect to any DAG including G, but which also have additional independences between any Xi and Xj even if G has an edge between Xi and Xj. When we are talking about trying to learn a graph from a particular dataset, we are talking about a particular joint distribution in the set (in the model). If we happen to observe a dependence between Xi and Xj in the data then of course the corresponding edge will be "real"  in the particular distribution that generated the data. I am just saying the DAG corresponds to a set rather than any specific distribution for any particular dataset, and makes no universally quantified statements over the set about dependence, only about independence. Same comment applies to causal models  but we aren't talking about just an observed joint anymore. The dichotomy between a "causal structure" and a causal model (a set of causal structures) still applies. A causal model only makes universally quantified statements about independences in "causal structures" in its set.
I will try to clarify this (assuming you are ok w/ interventions). Your question is "why is correlation usually not causation?"
One way you proposed to think about it is combinatorial for all pairwise relationships  if we look at all possible DAGs of n vertices, then you conjectured that the number of "pairwise causal relationships" is much smaller than the number of "pairwise associative relationships." I think your conjecture is basically correct, and can be reduced to counting certain types of paths in DAGs. Specifically, pairwise causal relationships just correspond to directed paths, and pairwise associative relationships (assuming we aren't conditioning on anything) correspond to marginally dconnected paths, which is a much larger set  so there are many more of them. However, I have not worked out the exact combinatorics, in part because even counting DAGs isn't easy.
Another way to look at it, which is what Sander did in his essay, is to see how often we can reduce causal relationships to associative relationships. What I mean by that is that if we are interested in a particular pairwise causal relationship, say whether X affects Y, which we can study by looking at p(y  do(x)), then as we know in general we will not be able to say anything by looking at p(y  x). This is because in general p(y  do(x)) is not equal to p(y  x). But in some DAGs it is! And in other DAGs p(y  do(x)) is not equal to p(y  x), but is equal to some other function of observed data. If we can express p(y  do(x)) as a function of observed data this is very nice because we don't need to run a randomized trial to obtain p(y  do(x)), we can just do an observational study. When people "adjust for confounders" what they are trying to do is express p(y  do(x)) as a function \sum_c p(y  x,c) p(c) of the observed data, for some set C.
So the question is, how often can we reduce p(y  do(x)) to some function of observed data (a weaker notion of "causation might be some sort of association if we massage the data enough"). It turns out, not surprisingly, that if we pick certain causal DAGs G containing X and Y (possibly with hidden variables), there will not be any function of the observed data equal to p(y  do(x)). What that means is that there exist two causal structures consistent with G which disagree on p(y  do(x)) but agree on the observed joint density. So the mapping from causal structures (which tell you what causal relationships there are) to joint distributions (which tell you what associative relationships there are) is many to one in general.
It will thus generally (but not always given some assumptions) be the case that a causal model will contain causal structures which disagree about p(y  do(x)) of interest, but agree on the joint distribution. So there is just not enough information in the joint distribution to get causality. To get around this, we need assumptions on our causal model to prevent this. What Sander is saying is that the assumptions we need to equate p(y  do(x)) with some function of the observed data are generally quite unrealistic in practice.
Another interesting combinatorial question here is: if we pick a pair X,Y, and then pick a DAG (w/ hidden variables potentially) at random, how likely is p(y  do(x)) to be some function of the observed joint (that is, there is "some sense" in which causation is a type of association). Given a particular such DAG and X,Y I have a polytime algorithm that will answer YES/NO, which may prove helpful.
I understand what you are saying, but I don't like your specific proposal because it is conflating two separate issues  a combinatorial issue (if we had infinite data, we would still have many more associative than causal relationships) and a statistical issue (at finite samples it might be hard to detect independences). I think we can do an empirical investigation of asymptotic behavior by just path counting, and avoid statistical issues (and issues involving "unfaithful" or "nearly unfaithful" (faithful but hard to tell at finite samples) distributions).
Nerd sniping question:
What is "\sum{G a DAG w/ n vertices} \sum{r is a directed path in G} 1" as a function of n?
What is "\sum{G a DAG w/ n vertices} \sum{r is a marginal dconnected path in G} 1" as a function of n?
A path is marginal dconnected if it does not contain * > * < * as a subpath.
Edit: I realized this might be confusing, so I will clarify something. I mentioned above that within a given causal model (a set of causal structures) the mapping from causal structures (elements of a "causal model" set) to joint distributions (elements of a "statistical model consistent with a causal model" set) is in general many to one. That is, if our causal model is of a DAG A > B < H > A (H not observed), then there exist two causal structures in this model that disagree on p(b  do(a)), but agree on p(a,b) (observed marginal density).
In addition, the mapping from causal models (sets) to statistical models (sets) consistent with a given causal model is also many to one. That is, the following two causal models A > B > C and A < B < C both map onto a statistical model which asserts that A is independent of C given B. This issue is different from what I was talking about. In both causal models above, we can obtain p(y  do(x)) for any Y,X from { A, B, C } as function of observed data. For example p(c  do(a)) = p(c  a) in A > B > C, and p(c  do(a)) = p(c) in A < B < C. So in some sense the mapping from causal structures to joint distributions is one to one in DAGs with all nodes observed. We just don't know which mapping to apply if we just look at a joint distribution, because we can't tell different causal models apart. That is, these two distinct causal models are observationally indistinguishable given the data (both imply the same statistical model with the same independence). To tell these models apart we need to perform experiments, e.g. in a gene network try to knock out A, and see if C changes.
Naively, I would expect it to be closer to 600^600 (the number of possible directed graphs with 600 nodes).
And in fact, it is some complicated thing that seems to scale much more like n^n than like 2^n: http://en.wikipedia.org/wiki/Directed_acyclic_graph#Combinatorial_enumeration
There's an asymptotic approximation in the OEIS: a(n) ~ n!2^(n(n1)/2)/(M*p^n), with M and p constants. So log(a(n)) = O(n^2), as opposed to log(2^n) = O(n), log(n!) = O(n log(n)), log(n^n) = O(n log(n)).
It appears I've accidentally nerdsniped everyone! I was just trying to give an idea that it was really really big. (I had done some googling for the exact answer but they all seemed rather complicated, and rather than try and get an exact answer wrong, just give a lower bound.)
If we allow cycles, then there are three possibilities for an edge between a pair of vertices in a directed graph: no edge, or an arrow in either direction. Since a graph of n vertices has n choose 2 pairs, the total number of DAGs of n vertices has an upper bound of 3^(n choose 2). This is much smaller than n^n.
edit: the last sentence is wrong.
Gwern, thanks for writing more, I will have more to say later.
It is much larger. = , and is much larger than n.
3^(10 choose 2) is about 10^21.
Since the nodes of these graphs are all distinguishable, there is no need to factor out by graph isomorphism, so 3^(n choose 2) is the exact number.
The precise asymptotic is , as shown on page 4 of this article. Here lambda and omega are constants between 1 and 2.
That's the number of all directed graphs, some of which certainly have cycles.
So it is. 3^(n choose 2) >> n^n stands though.
A lower bound for the number of DAGs can be found by observing that if we drop the directedness of the edges, there are 2^(n choose 2) undirected graphs on a set of n distinguishable vertices, and each of these corresponds to at least 1 DAG. Therefore there are at least that many DAGs, and 2^(n choose 2) is also much larger than n.
Yup you are right, re: what is larger.
The main way to correct for this bias toward seeing causation where there is only correlation follows from this introspection: be more imaginative about how it could happen (other than by direct causation).
[The causation bias (does it have a name?) seems to express the availability bias. So, the corrective is to increase the availability of the other possibilities.]
Maybe. I tend to doubt that eliciting a lot of alternate scenarios would eliminate the bias.
We might call it 'hyperactive agent detection', borrowing a page from the etiology of religious belief: https://en.wikipedia.org/wiki/Agent_detection which now that I think about it, might be stem from the same underlying belief  that things must have clear underlying causes. In one context, it gives rise to belief in gods, in another, interpreting statistical findings like correlation as causation.
Hmm, a very interesting idea.
Related to the human tendency to find patterns in everything, maybe?
Yes. Even more generally... might be an overapplication of Occam's razor: insisting everything be maximally simple? It's maximally simple when A and B correlate to infer that one of them causes the other (instead of postulating a C common cause); it's maximally simple to explain inexplicable events as due to a supernatural agent (instead of postulating a universe of complex underlying processes whose full explication fills up libraries without end and are still poorly understood).
That is another aspect, I think, but I I'd probably consider the underlying drive to be not the desire for simplicity but the desire for the world to make sense. To support this let me point out another universal human tendency  the yearning for stories, narratives that impose some structure on the surrounding reality (and these maps do not seek to match the territory as well as they can) and so provide the illusion of understanding and control.
In other words, humans are driven to always have some understandable map of the world around them, any map, even if not very good and even if it's pretty bad. The lack of some map, the lack of understanding (even if false) of what's happening is wellknown to lead to severe stress and general unhappiness.
That sounds more like a poor understanding of Occam's razor. Complex ontologically basic processes is not simpler than a handful of strict mathematical rules.
Of course it's wrong. But if that's what's going on, it'll manifest as a different pattern of errors and useful interventions than a kind of availability bias: availability bias will be cured by forcing generation of scenarios, but a preference for oversimplification will cause the error even if you lay out the various scenarios on a silver platter, because the subject will still prefer the maximally simple version where A>B rather than A<C>B.
Seems to me like a special case of privileging the hypothesis?
You're missing a 4th possibility. A & B are not meaningfully linked. This is very important when dealing with large sets of variables. Your measure of correlation will have a certain percentage of false positives, and discounting the possibility of false positives is important. If the probability of false positives is 1/X you should expect one false correlation for every X comparisons.
XKCD provides an excellent example. jelly beans
I'm pointing out that your list isn't complete, and not considering this possibility when we see a correlation is irresponsible. There are a lot of apparent correlations, and your three possibilities provide no means to reject false positives.
You are fighting the hypothetical. In the least convenient possible world where no dataset is smaller than a petabyte and no one has ever heard of sampling error, would you magically be able to spin the straw of correlation into the gold of causation? No. Why not? That's what I am discussing here.
I suggest you move that point closer to the list of 3 possibilities  I too read that list and immediately thought, "...and also coincidence."
The quote you posted above ("And we can't explain away...") is an unsupported assertion  a correct one in my opinion, but it really doesn't do enough to direct attention away from false positive correlations. I suggest that you make it explicit in the OP that you're talking about a hypothetical in which random coincidences are excluded from the start. (Upvoted the OP FWIW.)
(Also, if I understand it correctly, Ramsey theory suggests that coincidences are inevitable even in the absence of sampling error.)
I agree with gwern's decision to separate statistical issues from issues which arise even with infinite samples. Statistical issues are also extremely important, and deserve careful study, however we should divide and conquer complicated subjects.
I also agree  I'm recommending that he make that split clearer to the reader by addressing it up front.
I see. I really didn't expect this to be such an issue and come up in both the open thread & Main... I've tried rewriting the introduction a bit. If people still insist on getting snagged on that, I give up.
It ends with “etc.” for Pete's sake!
...no it doesn't?
A critical mistake in the lead analysis is false assumption: where there is a causal relation between two variables, they will be correlated. This ignores that causes often cancel out. (Of course, not perfectly, but enough to make raw correlation a generally poor guide to causality.
I think you have a fundamentally mistaken epistemology, gwern: you don't see that correlations only support causality when they are predicted by a causal theory.
If two variables are dseparated given a third, there is no partial correlation between the two, and the converse holds for almost all probability distributions consistent with the causal model. This is a theorem (Pearl 1.2.4). It's true that not all causal effects are identifiable from statistical data, but there are general rules for determining which effects in a model are identifiable (e.g., frontdoor and backdoor criteria).
Therefore I don't see how something like "causes often cancel out" could be true. Do you have any mathematical evidence?
I see nothing of this "fundamentally mistaken epistemology" that you claim to see in gwern's essay.
Causes do cancel out in some structures, and Nature does not select randomly (e.g. evolution might select for cancellation for homeostasis reasons). So the argument that most models are faithful is not always convincing.
This is a real issue, a causal version of a related issue in statistics where two types of statistical dependence cancel out such that there is a conditional independence in the data, but underlying phenomena are related.
I don't think gwern has a mistaken epistemology, however, because this issue exists. The issue just makes causal (and statistical) inference harder.
I agree completely.
So, um ... how do we assess the likelihood of causation, assuming we can't conduct an impromptu experiment on the spot?
The keywords are 'causal discovery,' 'structure learning.' There is a large literature.