It is widely understood that statistical correlation between two variables ≠ causation. But despite this admonition, people are routinely overconfident in claiming correlations to support particular causal interpretations and are surprised by the results of randomized experiments, suggesting that they are biased & systematically underestimating the prevalence of confounds/common-causation. I speculate that in realistic causal networks or DAGs, the number of possible correlations grows faster than the number of possible causal relationships. So confounds really are that common, and since people do not think in DAGs, the imbalance also explains overconfidence.
Full article: http://www.gwern.net/Causality
Hi, I will put responses to your comment in the original thread here. I will do them slightly out of order.
A Bayesian network is a statistical model. A statistical model is a set of joint distributions (under some restrictions). A Bayesian network model of a DAG G with vertices X1,...,Xk = X is a set of joint distributions that Markov factorize according to this DAG. This set will include distributions of the form p(x1,...,xk) = p(x1) ... p(xk) which (trivially!) factorize with respect to any DAG including G, but which also have additional independences between any Xi and Xj even if G has an edge between Xi and Xj. When we are talking about trying to learn a graph from a particular dataset, we are talking about a particular joint distribution in the set (in the model). If we happen to observe a dependence between Xi and Xj in the data then of course the corresponding edge will be "real" -- in the particular distribution that generated the data. I am just saying the DAG corresponds to a set rather than any specific distribution for any particular dataset, and makes no universally quantified statements over the set about dependence, only about independence. Same comment applies to causal models -- but we aren't talking about just an observed joint anymore. The dichotomy between a "causal structure" and a causal model (a set of causal structures) still applies. A causal model only makes universally quantified statements about independences in "causal structures" in its set.
I will try to clarify this (assuming you are ok w/ interventions). Your question is "why is correlation usually not causation?"
One way you proposed to think about it is combinatorial for all pairwise relationships -- if we look at all possible DAGs of n vertices, then you conjectured that the number of "pairwise causal relationships" is much smaller than the number of "pairwise associative relationships." I think your conjecture is basically correct, and can be reduced to counting certain types of paths in DAGs. Specifically, pairwise causal relationships just correspond to directed paths, and pairwise associative relationships (assuming we aren't conditioning on anything) correspond to marginally d-connected paths, which is a much larger set -- so there are many more of them. However, I have not worked out the exact combinatorics, in part because even counting DAGs isn't easy.
Another way to look at it, which is what Sander did in his essay, is to see how often we can reduce causal relationships to associative relationships. What I mean by that is that if we are interested in a particular pairwise causal relationship, say whether X affects Y, which we can study by looking at p(y | do(x)), then as we know in general we will not be able to say anything by looking at p(y | x). This is because in general p(y | do(x)) is not equal to p(y | x). But in some DAGs it is! And in other DAGs p(y | do(x)) is not equal to p(y | x), but is equal to some other function of observed data. If we can express p(y | do(x)) as a function of observed data this is very nice because we don't need to run a randomized trial to obtain p(y | do(x)), we can just do an observational study. When people "adjust for confounders" what they are trying to do is express p(y | do(x)) as a function \sum_c p(y | x,c) p(c) of the observed data, for some set C.
So the question is, how often can we reduce p(y | do(x)) to some function of observed data (a weaker notion of "causation might be some sort of association if we massage the data enough"). It turns out, not surprisingly, that if we pick certain causal DAGs G containing X and Y (possibly with hidden variables), there will not be any function of the observed data equal to p(y | do(x)). What that means is that there exist two causal structures consistent with G which disagree on p(y | do(x)) but agree on the observed joint density. So the mapping from causal structures (which tell you what causal relationships there are) to joint distributions (which tell you what associative relationships there are) is many to one in general.
It will thus generally (but not always given some assumptions) be the case that a causal model will contain causal structures which disagree about p(y | do(x)) of interest, but agree on the joint distribution. So there is just not enough information in the joint distribution to get causality. To get around this, we need assumptions on our causal model to prevent this. What Sander is saying is that the assumptions we need to equate p(y | do(x)) with some function of the observed data are generally quite unrealistic in practice.
Another interesting combinatorial question here is: if we pick a pair X,Y, and then pick a DAG (w/ hidden variables potentially) at random, how likely is p(y | do(x)) to be some function of the observed joint (that is, there is "some sense" in which causation is a type of association). Given a particular such DAG and X,Y I have a poly-time algorithm that will answer YES/NO, which may prove helpful.
I understand what you are saying, but I don't like your specific proposal because it is conflating two separate issues -- a combinatorial issue (if we had infinite data, we would still have many more associative than causal relationships) and a statistical issue (at finite samples it might be hard to detect independences). I think we can do an empirical investigation of asymptotic behavior by just path counting, and avoid statistical issues (and issues involving "unfaithful" or "nearly unfaithful" (faithful but hard to tell at finite samples) distributions).
Nerd sniping question:
What is "\sum{G a DAG w/ n vertices} \sum{r is a directed path in G} 1" as a function of n?
What is "\sum{G a DAG w/ n vertices} \sum{r is a marginal d-connected path in G} 1" as a function of n?
A path is marginal d-connected if it does not contain -> <- * as a subpath.
Edit: I realized this might be confusing, so I will clarify something. I mentioned above that within a given causal model (a set of causal structures) the mapping from causal structures (elements of a "causal model" set) to joint distributions (elements of a "statistical model consistent with a causal model" set) is in general many to one. That is, if our causal model is of a DAG A -> B <- H -> A (H not observed), then there exist two causal structures in this model that disagree on p(b | do(a)), but agree on p(a,b) (observed marginal density).
In addition, the mapping from causal models (sets) to statistical models (sets) consistent with a given causal model is also many to one. That is, the following two causal models A -> B -> C and A <- B <- C both map onto a statistical model which asserts that A is independent of C given B. This issue is different from what I was talking about. In both causal models above, we can obtain p(y | do(x)) for any Y,X from { A, B, C } as function of observed data. For example p(c | do(a)) = p(c | a) in A -> B -> C, and p(c | do(a)) = p(c) in A <- B <- C. So in some sense the mapping from causal structures to joint distributions is one to one in DAGs with all nodes observed. We just don't know which mapping to apply if we just look at a joint distribution, because we can't tell different causal models apart. That is, these two distinct causal models are observationally indistinguishable given the data (both imply the same statistical model with the same independence). To tell these models apart we need to perform experiments, e.g. in a gene network try to knock out A, and see if C changes.