RichardKennaway comments on Causal Diagrams and Causal Models - Less Wrong

61 Post author: Eliezer_Yudkowsky 12 October 2012 09:49PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (274)

You are viewing a single comment's thread. Show more comments above.

Comment author: Pfft 05 November 2012 01:14:53AM *  0 points [-]

I guess there are two points here.

First, authors like Pearl do not use "causality" to mean just that there is a directed edge in a Bayesian network (i.e. that certain conditional independence properties hold). Rather, he uses it to mean that the model describes what happens under interventions. One can see the difference by comparing Rain -> WetGrass with WetGrass -> Rain (which are equivalent as Bayesian networks). Of course, maybe he is confused and the difference will dissolve under more careful consideration, but I think this shows one should be careful in claiming that Bayes networks encode our best understanding of causality.

Second, do we need Bayesian networks to economically represent distributions? This is slightly subtle.

We do not need the directed arrows when representing a particular distribution. For example, suppose a distribution P(A,B,C) is represented by the Bayesian network A -> B <- C. Expanding the definition, this means that the joint distribution can be factored as

P(A=a,B=b,C=c) = P1(A=a) P2(B=b|A=a,C=c) P3(C=c)

where P1 and P3 are the marginal distributions of A and B, and P2 is the conditional distribution of B. So the data we needed to specify P were two one-column tables specifying P1 and P3, and a three-column table specifying P2(a|b,c) for all values of a,b,c. But now note that we do not gain very much by knowing that these are probability distributions. To save space it is enough to note that P factors as

P(A=a,B=b,C=c) = F1(a) F2(b,a,c) F3(c)

for some real-valued functions F1, F2, and F3. In other words, that P is represented by a Markov network A - B - C. The directions on the edges were not essential.

And indeed, typical algorithms for inference given a probability distribution, such as belief propagation, do not make use of the Bayesian structure. They work equally well for directed and undirected graphs.

Rather, the point of Bayesian versus Markov networks is that the class of probability distributions that can be represented by them are different. So they are useful when we try to learn a probability distribution, and want to cut down the search space by constraining the distribution by some independence relations that we know a priori.

Bayesian networks are popular because they let us write down many independence assumptions that we know hold for practical problems. However, we then have to ask how we know those particular independence relations hold. And that's because they correspond to causual relations! The reason Bayesian networks are popular with human researchers is that they correspond well with the notion of causality that humans use. We don't know that the Armchairians would also find them useful.

Comment author: RichardKennaway 05 November 2012 09:16:21AM 0 points [-]

To save space it is enough to note that P factors as

P(A=a,B=b,C=c) = F1(a) F2(b,a,c) F3(c)

for some real-valued functions F1, F2, and F3. In other words, that P is represented by a Markov network A - B - C. The directions on the edges were not essential.

Can't the directions be recovered automatically from that expression, though? That is, discarding the directions from the notation of conditional probabilities doesn't actually discard them.

The reconstruction algorithm would label every function argument as "primary" or "secondary", begin with no arguments labelled, and repeatedly do this:

For every function with no primary variable and exactly one unlabelled variable, label that variable as primary and all of its occurrences as arguments to other functions as secondary.

When all arguments are labelled, make a graph of the variables with an arrow from X to Y whenever X and Y occur as arguments to the same function, X as secondary and Y as primary. If the functions F1 F2 etc. originally came from a Bayesian network, won't this recover that precise network?

If the original graph was A <- B -> C, the expression would have been F1(a,b) F2(b) F3(c,b).

Comment author: Pfft 05 November 2012 05:04:22PM *  0 points [-]

If the functions F1 F2 etc. originally came from a Bayesian network, won't this recover that precise network?

I think this is right, if you know that the factors were learned by fitting them to a Bayesian network, you can recover what that network must have been. And you can go even further, if you only have a joint distribution you can use the techniques of the original article to see which Bayesian networks could be consistent with it.

But there is a separate question about why we are interested in Bayesian networks in the first place. SilasBarta seemed to claim that you are naturally led to them if you are interested in representing probability distributions efficiently. But for that purpose (I claim), you only need the idea of factors, not the directed graph structure. E.g. a probability distribution which fits the (equivalent) Bayesian networks A -> B -> C or A <- B <- C or A <- B -> C can be efficiently represented as F1(a,b) F2(b,c). You would not think of representing it as F1(a) F2(a,b) F3(b,c) unless you were already interested in causality.