From Finite Factors to Bayes Nets

J Bostock

I was hoping I could produce a one-to-one mapping between diagrams of overlapping factors (from FFS) and Bayes nets. This is not possible, but instead I have been able to produce a similar mapping between factor overlap diagrams and sets of Bayes nets, subject to certain compatibility relationships.

Bayes Nets

The Bayes nets over some variables can be thought of as a category of objects related by "bookkeeping" (and similar) relationships. To be more specific, we can say that a unique morphism exists between two Bayes nets $N_{i} \to N_{j}$ exists if and only if, for all probability distributions $P$ over the elements of $N_{i}$ , the approximation of $N_{j}$ over $P$ is no worse than the approximation of $N_{i}$ over $P$ . For two variables we need the Markov-chain-re-rooting and arrow-adding rules:

We could (with some effort) do three variables, but since there are three options for each pair of variables ( $A \to B$ , $A \leftarrow B$ , no arrow) the number of Bayes nets over $n$ variables is $3^{\frac{1}{2} n (n - 1)}$ . For $n = 3$ this is 27. Let's draw in the adding-an-arrow and Markov-chain-re-rooting relationships:

As you can see it's a total goddamn mess! Even when we exclude the forbidden Bayes nets, ignore the identity morphisms (from an object to itself), use illicit double-headed arrows on the bottom layer, and also use "implied" arrows (i.e. don't bother with $A \to C$ if we have $A \to B \to C$ ) we get a very confusing scenario!

To this problem, we will add another problem! But a problem which will lead us to a solution.

From Bayes Nets to Equivalence Classes

Say we have a mapping from a Finite Factored Set ontology to a Bayes net. This means we can apply a probability distribution over our factors to get a probability distribution over our downstream variables, with the condition that this must satisfy the Bayes net. The problem here is that many different Bayes nets will be satisfied by the probability distribution.

What we want is a one-to-one relationship (ideally). To improve this, we can start by grouping Bayes nets together under an equivalence relation. This is some relationship which has the following properties:

Reflexive: $a \sim a$
Symmetric: $a \sim b ⟺ b \sim a$
Transitive: $a \sim b, b \sim c ⟹ a \sim c$

We can define an equivalence relation over the category of Bayes nets over $X$ trivially: $X_{i} \sim X_{j} ⟺ X_{i} \to X_{j} \land X_{j} \to X_{i}$ . For simpler Bayes nets, we can use undirected lines to make it easier to see what we are doing. As an example over our 3-element Bayes nets:

The square bracket notation $[N]$ means an equivalence class and should be read as "The equivalence class containing $N$ . On the right we've replaced our directed arrow with an undirected one. This is simply a change (perhaps abuse) of notation, which helps to remind us of the equivalence class that is present. We can do this for any Markov chain in our Bayes net, but we don't have to, especially when things get complicated.

Redrawing our map of the three-element Bayes nets:

Much nicer! Only 11 objects now! But we have made a critical mistake! We've assumed that we know all of the "bookkeeping" operations. We've missed one one. I think of this one as the "arrow forgetting" operation [1], and it makes more sense on equivalence classes:

I don't know the generalization of this to more than three elements. Our corrected category of equivalence classes looks like this:

Each equivalence class defines a unique set of constraints on our probability distribution $P (A, B, C)$ .
Our morphisms define a partial ordering on the category we've created. We have an initial object which we will call the discrete Bayes net $N_{D i s}$ , and a terminal object which we will call the indiscrete Bayes net $N_{I n d}$ .

Because of 2, we can define $\leq$ on our objects, such that $N_{D i s} \geq N_{i} \geq N_{I n d}$ . Our choice of the direction of $\leq$ is such that the discrete Bayes net gives us the most information about the independence of variables. We say that $N_{i}$ is finer than $N_{j}$ if $N_{i} \geq N_{j}$ , because it encodes more information about the independence of variables.

Finite Factored Sets

FFS has a different notion of causality, which is both weaker and stronger in important ways. FFS defines the history of an event $X$ to be the minimal set of factors which uniquely specify an element of $X$ . Then an event $Y$ is thought of as "after" $X$ (written $X \leq Y$ ) if the history of $X$ is a subset of the history of $Y$ .

This makes one huge departure from Bayes nets: no information is allowed to be lost. If we have a system where $X$ directly causes $Y$ , but there is some "noise" in $X$ which does not affect the downstream value of $Y$ , then $X$ is not "before" $Y$ . This is especially relevant when we are accessing measurements of $X$ and $Y$ , and not the world directly.

Now let us consider three variables, again $X$ , $Y$ and $Z$ , and the factors making up their histories. We can use a Venn diagram to demonstrate the presence or absence of shared factors, and also write down the time ordering of $X$ , $Y$ , and $Z$ according to FFS. I will call these "factor overlap diagrams".

Note that the networks in the middle column are not Bayes nets, they instead show which of our "downstream" variables depend on which factors.

I won't draw them out, but there are $2^{7} = 128$ different possible Venn diagrams we could draw here.

Mapping Them Together

This is where things become tricky. We might want to begin by noticing that we can ignore the three "outer" parts of the Venn diagram (in the drawings below I'll colour them a paler yellow) because Bayes nets are agnostic to additional factors as long as information is conserved.

This reduces us to just sixteen cases, and using symmetry this reduces to just eight.

So which equivalence classes do these match up to? Well we run into an issue immediately because the first element here ( $X$ , $Y$ , $Z$ independent) clearly corresponds to the discrete Bayes net equivalence class, but therefore it fits just as well to any other equivalence class! We can solve this by only considering a net $N$ to fit the factor diagram if no $M$ exists such that $M$ also fits the factor diagram and $M \geq N$ .

Here we go! [2]

Ugh. Our relationship isn't one-to-one. It isn't even many-to-one. It's many-to-many. The problem is with the (lack of) specificity of equivalence classes. Just because a probability distribution satisfies one equivalence class, doesn't mean it can't also satisfy others. We can avoid this when one Bayes net is finer than the other, but because our ordering is only partial we cannot always define a total ordering.

From Equivalence Classes to Compatible Sets

We will define a set of Bayes nets over $n$ variables as compatible if there exists some probability distribution which satisfies all of the Bayes nets, and no other Bayes nets over $n$ variables. This generally means all of the Bayes nets in one or more equivalence classes, plus all of the Bayes nets in equivalence classes coarser than those. [3]

We designate all of the compatible sets of Bayes nets based on the equivalence classes they descend from. We can also define a partial ordering on the compatible sets by inclusion, i.e. $N \to M ⟺ N \geq M ⟺ N \supseteq M$ . We note that our "finer" compatible sets contain more sets, therefore they impose more constraints on the underlying probability distribution.

We can also create some new notation to describe certain compatible sets:

This notation is unlikely to be ideal, but it does allow us to create the following diagram:

Now we have fifteen objects: four symmetric triples and three singletons. This is one less than our set of factor overlap diagrams. To make the connections one-to-one, we must merge two factor overlap diagrams. The choice is already made for us: two factor diagrams both map to the set containing just the indiscrete Bayes net. (An added benefit of moving to compatible sets is that it decreases any ominous resemblance present in diagram of equivalence classes)

We can make the following equivalence rule for our three elements: if $h (X_{i}) \cap h (X_{j}) \neq {} \forall X_{j} X_{j} \in X$ then we are agnostic over whether $⋂ {h (X_{i}) \forall X_{i} \in X} = {}$ . We'll continue to use pale yellow to denote agnosticism over the presence of elements in a Venn diagram.

We can draw the diagram of possible factor overlap diagrams. We can add arrows to denote whether a given factor overlap diagram contains factors in all of the locations of another one:

Hmm, looks familiar!

Thinking about compatible sets of Bayes nets allows us to think about the algebraic rules of Bayes nets more sensibly. If we notate the set of compatible sets of Bayes nets over $n$ variables ${C S B}_{n}$ , we can think of rules like the Frankenstein rule as maps of the form ${C S B}_{n} \times {C S B}_{n} \to {C S B}_{n}$ i.e. taking two sets of Bayes nets (which are satisfied by a probability distribution) and then returning a new set of Bayes nets. In fact there will only be one general map of this form, which should contain all rules for combining Bayes nets over the same variables as a special case. The stitching rule can be thought of as a map of the form ${C S B}_{l + m} \times {C S B}_{m + n} \to {C S B}_{l + m + n}$ .

(Strictly speaking this might not be true, because we haven't made sure that our morphisms are mapped properly)

This might also relate to our ability to compute the factors of a distribution. It is computable which Bayes nets a distribution fits to, and therefore going from the relevant compatible set to a factor diagram should also be possible.

A few Illustrative Examples

What does it actually mean for certain distributions to obey multiple non-equivalent Bayes nets?

Example 3: $X$ = temp in F, $Y$ = temp in C, $Z$ = phase of water. This satisfies two different Bayes equivalence classes, corresponding to a single compatible set.

Example 2: $X$ = temp in F, $Y$ = temp in C, $Z$ = temp in K. All three depend on the same factor (actual physical temperature). This satisfies the following three equivalence classes:

Conjectures and Limitations

Firstly I conjecture that I've drawn the correct form of the category of compatible sets of Bayes nets over three variables. This requires that I've both found all the important bookkeeping rules, and that I've not missed any possible compatible sets.

Secondly I conjecture that the isomorphism between partially-agnostic Venn diagrams and compatible sets of Bayes nets holds for any $n$ . There are 185 equivalence classes of Bayes nets over 4 variables, and therefore many more compatible sets. Likewise there are $2^{n} - (n + 1)$ shared compartments in a Venn diagram of $n$ variables, meaning there are $2^{2^{n} - (n + 1)}$ possible factor overlap diagrams (minus our agnosticisms). For four diagrams this turns out to be 1617 with a first guess at what the agnosticisms should be. I don't want to try and find 1617 different sets of 185 elements. Since it's not entirely clear how we generalize our agnosticism, this problem gets even more complex.

We also have a couple of limitations on this work:

I've not proven any rules for probability distributions which almost satisfy Bayes nets. Since it's rare to get a distribution which fully satisfies a Bayes net, it's unclear where to go from here.

I'm not sure if all of this works when we consider joint probability distributions with significant loss of information. For example, if we have three measurements of temperature, each with an independent error, then $P [T_{1} | T_{2}, T_{3}] \neq P [T_{1} | T_{2}]$ , since getting both $T_{2}$ and $T_{3}$ will change our estimate of $T_{1}$ compared to $T_{2}$ . Maybe this can be solved by resampling somehow. We could also ignore this, because getting $T_{2}$ gives us far, far more information (in expectation) than getting $T_{3}$ .

Our other option is to change our concepts of agnosticism over our factor diagrams, so that if anything is shared between three variables, we are agnostic about individual connections, as any corresponding Bayes net must be fully connected. Then again, finite factored sets also don't handle loss of information well.

Proofs

1: To prove the arrow forgetting rule on three elements, consider the constraint implied by the first one:

$P [A, B, C] = P [A] P [B | A, C] P [C]$
$P [A, B, C] = P [A] P [B, C | A]$
$P [A, B, C] = P [A] P [B | A] P [C | B, A]$

But since the first diagram also gives us that $A$ and $C$ are independent:

$P [A, B, C] = P [A] P [B | A] P [C | B] . □$

Which is the Markov chain $(A \to B \to C)$ , an element of our second equivalence class. I don't know the generalization of this to more elements.

2: Proof that the given factor diagrams correspond to the given Bayes nets. Correspondences in order:

All variables are independent. $□$
As there are shared factors between all three elements, none are independent of each other, so the Bayes net must connect all three elements. As no factors are only pairwise shared, each element d-separates the other two. Therefore all Markov chains are satisfied. $□$
As there are shared factors between each pair of elements, no pair can be d-separated, so all elements must be connected pairwise. $□$
As 3. $□$
$X$ and $Y$ both share factors. Therefore they must be connected. $Z$ has no shared factors so may be disconnected from both $X$ and $Y$ . $□$
$X$ and $Y$ both share factors not shared by $Z$ so must be connected. All three also share common factors so all three must be connected. $X$ d-separates $Y$ and $Z$ , likewise $Y$ d-separates $X$ and $Z$ , so both Markov chains $X - Y - Z$ and $Y - X - Z$ are satisfied. $□$
$X$ and $Z$ share factors not shared by $Y$ , so must be connected. Likewise $Y$ and $Z$ share factors not shared by $X$ so must be separated. $X$ and $Y$ share no factors so must be independent, so Markov chains are not satisfied, leaving only one Bayes net. $□$
$X$ and $Z$ share factors so must be connected. $Y$ and $Z$ share factors so must be connected. There are factors shared between all three, so a Markov chain must be present, but $Z$ d-separates $X$ and $Y$ , so the only option is the Markov chain $X - Z - Y$ . $□$

3: Formalizing compatible sets: a compatible set $S$ of Bayes nets over $n$ variables is a set of Bayes nets which satisfies two conditions. The first is that there exists a probability distribution over the variables which satisfies every Bayes net in the set. The second is that no superset $S^{'}$ of $S$ exists which also satisfies every probability distribution that $S$ satisfies.

[-]tailcalled1y70

FFS has a different notion of causality, which is both weaker and stronger in important ways. FFS defines the history of an event to be the minimal set of factors which uniquely specify an element of $X$ . Then an event $Y$ is thought of as "after" $X$ (written $X \leq Y$ ) if the history of $X$ is a subset of the history of $Y$ .
This makes one huge departure from Bayes nets: no information is allowed to be lost. If we have a system where $X$ directly causes $Y$ , but there is some "noise" in $X$ which does not affect the downstream value of $Y$ , then $X$ is not "before" $Y$ .

Is this actually different from Bayes nets? Like let's say you have a probability distribution $X \to Y \to Z$ , but then you measure each variable with noise, $X \to^X$ , $Y \to^Y$ , $Z \to^Z$ . In that case the measurements won't satisfy $^X \to^Y \to^Z$ , because $^Z$ will be dependent on $^X$ even after conditioning on $^Y$ .

Yes there will be some special-cases where Bayes nets support it, e.g. if you've got only two variables $X \to Y$ , but these special cases rapidly fail as you have even slightly more complexity.

[-]J Bostock1y10

I was thinking about causality in terms of forced directional arrows in Bayes nets, rather than in terms of d-separation. I don't think your example as written is helpful because Bayes nets rely on the independence of variables to do causal inference: is equivalent to $X \leftarrow Y \leftarrow Z$ .

It's more important to think about cases like $X \to Y \leftarrow Z$ where causality can be inferred. If we change this to $^X,^Y,^Z$ by adding noise then we still get a distribution satisfying $^X \to^Y \leftarrow^Z$ (as $^X$ and $^Z$ are still independent).

Even if we did have other nodes forcing $X \to Y \to Z$ (such as a node $U$ which is parent to $Y$ , and another node $V$ which is parent to $Z$ ), then I still don't think adding noise lets us swap the orders round.

On the other hand, there are certainly issues in Bayes nets of more elements, particularly the "diamond-shaped" net with arrows $W \to X, W \to Y, X \to Z, Y \to Z$ . Here adding noise does prevent effective temporal inference, since, if $^X$ and $^Y$ are no longer d-separated by $^W$ , we cannot prove from correlations alone that no information goes between them through $^Z$ .

[-]Morpheus1y20

Neat!

I plugged 15 and 1617 into OEIS which I assume you have tried already, but only got “recreational math” results.

I had forgotten about OEIS! Anyway Ithink the actual number might be 1577 rather than 1617 (this also gives no answers). I was only assuming agnosticism over factors in the overlap region if all pairs $A B, A C, . . ., C D$ had factors, but I think that is missing some examples. My current guess is that any overlap region like $A B C D$ should be agnostic iff all of the overlap regions "surrounding" it in the Venn diagram ( $A B C$ , $A B D$ , $A C D$ , $B C D$ ) in this situation either have a factor present or agnostic. This gives the series 1, 2, 15, 1577, 3397521 (my computer has not spat out the next element). This also gives nothing on the OEIS.

My reasoning for this condition is that we should be able to "remove" an observable from the system without trouble. If we have an agnosticism, in the intersection $A B C D$ , then we can only remove observable $B$ if this doesn't cause trouble for the new intersection $A B D$ , which is only true if we already have an factor in $A B D$ (or are agnostic about it).

[-]Magdalena Wache1y10

Are you sure the "arrow forgetting" operation is correct?
You say there should be an arrow when

for all probability distributions over the elements of $N_{i}$ , the approximation of $N_{j}$ over $P$ is no worse than the approximation of $N_{i}$ over $P$

wherein $[N_{i}] = [A \to B \leftarrow C]$ here, and $[N_{j}] = [A - B - C]$ . If we take a distribution in which $A ⊥/ C | B$ , then $N_{j}$ is actually a worse approximation, because $A ⊥ C | B$ must hold in any distribution which can be expressed by a graph in $[N_{j}]$ right?

Also, I really appreciate the diagrams and visualizations throughout the post!

Thanks for the feedback. There's a condition which I assumed when writing this which I have realized is much stronger than I originally thought, and I think I should've devoted more time to thinking about its implications.

When I mentioned "no information being lost", what I meant is that in the interaction , each value $b \in B$ (where $B$ is the domain of $P_{B}$ ) corresponds to only one value of $a \in A$ . In terms of FFS, this means that each variable must be the maximally fine partition of the base set which is possible with that variable's set of factors.

Under these conditions, I am pretty sure that $A ⊥ C ⟹ A ⊥ C | B$

I am confused about the last correspondence between Venn Diagram and BN equivalence class (this one: ). The graph X-Z-Y implies that X⊥Y|Z. Could you explain how the Venn diagram encodes this independence/orthogonality?

LESSWRONG
LW

38