Finite Factored Sets

This is the edited transcript of a talk introducing finite factored sets. For most readers, it will probably be the best starting point for learning about factored sets.

Video: 

 (Lightly edited) slides: https://intelligence.org/files/Factored-Set-Slides.pdf

 

1. Short Combinatorics Talk

1m. 

Scott: So I want to start with some context. For people who are not already familiar with my work:

  • My main motivation is to reduce existential risk.
  • I try to do this by trying to figure out how to align advanced artificial intelligence.
  • I try to do this by trying to become less confused about intelligence and optimization and agency and various things in that cluster.
  • My main strategy here is to develop a theory of agents that are embedded in the environment that they're optimizing. I think there are a lot of open hard problems around doing this.
  • This leads me to do a bunch of weird math and philosophy. This talk is going to be an example of some weird math and philosophy.

For people who are already familiar with my work, I just want to say that according to my personal aesthetics, the subject of this talk is about as exciting as Logical Induction, which is to say I'm really excited about it. And I'm really excited about this audience; I'm excited to give this talk right now.

 

1t. 

This talk can be split into 2 parts:

  • Part 1: a short pure-math combinatorics talk.

I suspect that if I were better, I would instead be giving a short pure-math category theory talk; but I'm trained as a combinatorialist, so I'm giving a combinatorics talk upfront.

  • Part 2: a more applied and philosophical main talk.

This talk can also be split into 4 parts differentiated by color: , and . Combining these gives us 8 parts (some of which are not contiguous):

 Part 1: Short TalkPart 2: The Main Talk
1m. Some Context2m. The Pearlian Paradigm
1t. Factoring the Talk2t. We Can Do Better
1b. Set Partitions, etc.2b. Time and Orthogonality, etc.
1e. Enumerating Factorizations2e. Game of Life, etc.

 

1b. 

All right. Here's some background math:

  • A partition of a set  is a set  of non-empty subsets of , called parts, such that for each  there exists a unique part in  that contains .
  • Basically, a partition of  is a way to view  as a disjoint union. We have parts that are disjoint from each other, and they union together to form .
  • We'll write  for the set of all partitions of S.
  • We'll say that a partition  is trivial if it has exactly one part.
  • We'll use bracket notation, , to denote the unique part in  containing . So this is like the equivalence class of a given element.
  • And we'll use the notation  to say that two elements  and  are in the same part in .

You can also think of partitions as being like variables on your set . Viewed in that way, the values of a partition  correspond to which part an element is in.

Or you can think of  as a question that you could ask about a generic element of . If I have an element of  and it's hidden from you and you want to ask a question about it, each possible question corresponds to a partition that splits up  according to the different possible answers.

 We're also going to use the lattice structure of partitions:

  • We'll say that  ( is finer than , and  is coarser than ) if  makes all of the distinctions that  makes (and possibly some more distinctions), i.e., if for all  implies . You can break your set  into parts, , and then break it into smaller parts, .
  •  (the common refinement of  and ) is the coarsest partition that is finer than both  and . This is the unique partition that makes all of the distinctions that either  or  makes, and no other distinctions. This is well-defined, which I'm not going to show here.

Hopefully this is mostly background. Now I want to show something new.

 

1b. 

A factorization of a set  is a set  of nontrivial partitions of , called factors, such that for each way of choosing one part from each factor in , there exists a unique element of  in the intersection of those parts.

So this is maybe a little bit dense. My short tagline of this is: "A factorization of  is a way to view  as a product, in the exact same way that a partition was a way to view  as a disjoint union."

If you take one definition away from this first talk, it should be the definition of factorization. I'll try to explain it from a bunch of different angles to help communicate the concept.

If  is a factorization of , then there exists a bijection between  and  given by . This bijection comes from sending an element of  to the tuple consisting only of parts containing that element. And as a consequence of this bijection, .

So we're really viewing  as a product of these individual factors, with no additional structure.

Although we won't prove this here, something else you can verify about factorizations is that all of the parts in a factor have to be of the same size.

We'll write  for the set of all factorizations of , and we'll say that a finite factored set is a pair , where  is a finite set and .

Note that the relationship between  and  is somewhat loopy. If I want to define a factored set, there are two strategies I could use. I could first introduce the , and break it into factors. Alternatively, I could first introduce the . Any time I have a finite collection of finite sets , I can take their product and thereby produce an , modulo the degenerate case where some of the sets are empty. So  can just be the product of a finite collection of arbitrary finite sets.

To my eye, this notion of factorization is extremely natural. It's basically the multiplicative analog of a set partition. And I really want to push that point, so here's another attempt to push that point:

A partition is a set  of 
 of  such that the obvious
function  the  of
the elements of    is a bijection.
A factorization is a set  of 
 of  such that the obvious
function  the  of
the elements of    is a bijection.

I can take a slightly modified version of the partition definition from before and dualize a whole bunch of the words, and get out the set factorization definition.

Hopefully you're now kind of convinced that this is an extremely natural notion.

 

Andrew Critch: Scott, in one sense, you're treating "subset" as dual to partition, which I think is valid. And then in another sense, you're treating "factorization" as dual to partition. Those are both valid, but maybe it's worth talking about the two kinds of duality.

Scott: Yeah. I think what's going on there is that there are two ways to view a partition. You can view a partition as "that which is dual to a subset," and you can also view a partition as something that is built up out of subsets. These two different views do different things when you dualize.

Ramana Kumar: I was just going to check: You said you can start with an arbitrary  and then build the  from it. It can be literally any set, and then there's always an ...

Scott: If none of them are empty, yes, you could just take a collection of sets that are kind of arbitrary elements. And you can take their product, and you can identify with each of the elements of a set the subset of the product that projects on to that element.

Ramana Kumar: Ah. So the  in that case will just be tuples.

Scott: That's right.

Brendan Fong: Scott, given a set, I find it very easy to come up with partitions. But I find it less easy to come up with factorizations. Do you have any tricks for...?

Scott: For that, I should probably just go on to the examples.

Joseph Hirsh: Can I ask one more thing before you do that? You allow factors to have one element in them?

Scott: I said "nontrivial," which means it does not have one element.

Joseph Hirsh: "Nontrivial" means "not have one element, and not have no elements"?

Scott: No, the empty set has a partition (with no parts), and I will call that nontrivial. But the empty set thing is not that critical.

 

I'm now going to move on to some examples.

 

1e. 

Exercise! What are the factorizations of the set ?

Spoiler space:

.

.

.

.

.

.

.

.

.

.

First, we're going to have a kind of trivial factorization:

We only have one factor, and that factor is the discrete partition. You can do this for any set, as long as your set has at least two elements. 

Recall that in the definition of factorization, we wanted that for each way of choosing one part from each factor, we had a unique element in the intersection of those parts. Since we only have one factor here, satisfying the definition just requires that for each way of choosing one part from the discrete partition, there exists a unique element that is in that part.

And then we want some less trivial factorizations. In order to have a factorization, we're going to need some partitions. And the product of the cardinalities of our partitions are going to have to equal the cardinality of our set , which is 4. 

The only way to express 4 as a nontrivial product is to express it as . Thus we're looking for factorizations that have 2 factors, where each factor has 2 parts.

We noted earlier that all of the parts in a factor have to be of the same size. So we're looking for 2 partitions that each break our 4-element set into 2 sets of size 2.

So if I'm going to have a factorization of  that isn't this trivial one, I'm going to have to pick 2 partitions of my 4-element set that each break the set into 2 parts of size 2. And there are 3 partitions of a 4-element sets that break it up into 2 parts of size 2. For each way of choosing a pair of these 3 partitions, I'm going to get a factorization.

So there will be 4 factorizations of a 4-element set.

In general you can ask, "How many factorizations are there of a finite set of size ?". Here's a little chart showing the answer for :

01
11
21
31
44
51
661
71
81681
95041
1015121
111
1213638241
131
148648641
151816214401
16181880899201
171
1845951781075201
191
203379365788198401
211689515283456001
2214079294028801
231
244454857103544668620801
25538583682060103680001

You'll notice that if  is prime, there will be a single factorization, which hopefully makes sense. This is the factorization that only has one factor.

A very surprising fact to me is that this sequence did not show up on OEIS, which is this database that combinatorialists use to check whether or not their sequence has been studied before, and to see connections to other sequences.

To me, this just feels like the multiplicative version of the Bell numbers. The Bell numbers count how many partitions there are of a set of size . It's sequence number 110 on OEIS out of over 300,000; and this sequence just doesn't show up at all, even when I tweak it and delete the degenerate cases and so on.

I am very confused by this fact. To me, factorizations seem like an extremely natural concept, and it seems to me like it hasn't really been studied before.

 

This is the end of my short combinatorics talk.

 

Ramana Kumar: If you're willing to do it, I'd appreciate just stepping through one of the examples of the factorizations and the definition, because this is pretty new to me.

Scott: Yeah. Let's go through the first nontrivial factorization of :

In the definition, I said a factorization should be a set of partitions such that for each way of choosing one part from each of the partitions, there will be a unique element in the intersection of those parts.

Here, I have a partition that's separating the small numbers from the large numbers: . And I also have a partition that's separating the even numbers from the odd numbers: 

And the point is that for each way of choosing either "small" or "large" and also choosing "even" or "odd", there will be a unique element of  that is the conjunction of these two choices.

In the other two nontrivial factorizations, I replace either "small and large" or "even and odd" with an "inner and outer" distinction.

David Spivak: For partitions and for many things, if I know the partitions of a set  and the partitions of a set , then I know some partitions of  (the disjoint union) or I know some partitions of . Do you know any facts like that for factorizations?

Scott: Yeah. If I have two factored sets, I can get a factored set over their product, which sort of disjoint-unions the two collections of factors. For the additive thing, you're not going to get anything like that because prime sets don't have any nontrivial factorizations.

 

All right. I think I'm going to move on to the main talk.

 

2. The Main Talk (It's About Time)

2m. 

We can't talk about time without talking about Pearlian causal inference. I want to start by saying that I think the Pearlian paradigm is great. This buys me some crackpot points, but I'll say it's the best thing to happen to our understanding of time since Einstein.

I'm not going to go into all the details of Pearl's paradigm here. My talk will not be technically dependent on it; it's here for motivation.

Given a collection of variables and a joint probability distribution over those variables, Pearl can infer causal/temporal relationships between the variables. (In this talk I'm going to use "causal" and "temporal" interchangeably, though there may be more interesting things to say here philosophically.)

Pearl can infer temporal data from statistical data, which is going against the adage that "correlation does not imply causation." It's like Pearl is taking the combinatorial structure of your correlation and using that to infer causation, which I think is just really great.

 

Ramana Kumar: I may be wrong, but I think this is false. Or I think that that's not all Pearl needs—just the joint distribution over the variables. Doesn't he also make use of intervention distributions?

Scott: In the theory that is described in chapter two of the book Causality, he's not really using other stuff. Pearl builds up this bigger theory elsewhere. But you have some strong ability, maybe assuming simplicity or whatever (but not assuming you have access to extra information), to take a collection of variables and a joint distribution over those variables, and infer causation from correlation.

Andrew Critch: Ramana, it depends a lot on the structure of the underlying causal graph. For some causal graphs, you can actually recover them uniquely with no interventions. And only assumptions with zero-measure exceptions are needed, which is really strong.

Ramana Kumar: Right, but then the information you're using is the graph.

Andrew Critch: No, you're not. Just the joint distribution.

Ramana Kumar: Oh, okay. Sorry, go ahead.

Andrew Critch: There exist causal graphs with the property that if nature is generated by that graph and you don't know it, and then you look at the joint distribution, you will infer with probability 1 that nature was generated by that graph, without having done any interventions.

Ramana Kumar: Got it. That makes sense. Thanks.

Scott: Cool.

 

I am going to (a little bit) go against this, though. I'm going to claim that Pearl is kind of cheating when making this inference. The thing I want to point out is that in the sentence "Given a collection of variables and a joint probability distribution over those variables, Pearl can infer causal/temporal relationships between the variables.", the words "Given a collection of variables" are actually hiding a lot of the work.

The emphasis is usually put on the joint probability distribution, but Pearl is not inferring temporal data from statistical data alone. He is inferring temporal data from statistical data and factorization data: how the world is broken up into these variables.

I claim that this issue is also entangled with a failure to adequately handle abstraction and determinism. To point at that a little bit, one could do something like say:

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable. And then I can try to do Pearlian causal inference on this big set of all the variables that I get by forgetting the structure of variables that were given to me."

And the problem is that when you do that, you have a bunch of things that are deterministic functions of each other, and you can't actually infer stuff using the Pearlian paradigm.

So in my view, this cheating is very entangled with the fact that Pearl's paradigm isn't great for handling abstraction and determinism.

 

2t. 

The main thing we'll do in this talk is we're going to introduce an alternative to Pearl that does not rely on factorization data, and that therefore works better with abstraction and determinism.

Where Pearl was given a collection of variables, we are going to just consider all partitions of a given set. Where Pearl infers a directed acyclic graph, we're going to infer a finite factored set.

In the Pearlian world, we can look at the graph and read off properties of time and orthogonality/independence. A directed path between nodes corresponds to one node being before the other, and two nodes are independent if they have no common ancestor. Similarly, in our world, we will be able to read time and orthogonality off of a finite factored set.

(Orthogonality and independence are pretty similar. I'll use the word "orthogonality" when I'm talking about a combinatorial notion, and I'll use "independence" when I'm talking about a probabilistic notion.)

In the Pearlian world, d-separation, which you can read off of the graph, corresponds to conditional independence in all probability distributions that you can put on the graph. We're going to have a fundamental theorem that will say basically the same thing: conditional orthogonality corresponds to conditional independence in all probability distributions that we can put on our factored set.

In the Pearlian world, d-separation will satisfy the compositional graphoid axioms. In our world, we're just going to satisfy the compositional semigraphoid axioms. The fifth graphoid axiom is one that I claim you shouldn't have even wanted in the first place.

Pearl does causal inference. We're going to talk about how to do temporal inference using this new paradigm, and infer some very basic temporal facts that Pearl's approach can't. (Note that Pearl can also sometimes infer temporal relations that we can't—but only, from our point of view, because Pearl is making additional factorization assumptions.)

And then we'll talk about a bunch of applications.

PearlThis Talk
A Given Collection of VariablesAll Partitions of a Given Set
Directed Acyclic GraphFinite Factored Set
Directed Path Between Nodes"Time"
No Common Ancestor"Orthogonality"
d-Separation"Conditional Orthogonality"
Compositional GraphoidCompositional Semigraphoid
d-Separation ↔ Conditional IndependenceThe Fundamental Theorem
Causal InferenceTemporal Inference
Many Many ApplicationsMany Many Applications

Excluding the motivation, table of contents, and example sections, this table also serves as an outline of the two talks. We've already talked about set partitions and finite factored sets, so now we're going to talk about time and orthogonality.

 

2b. 

I think that if you capture one definition from this second part of the talk, it should be this one. Given a finite factored set as context, we're going to define the history of a partition.

Let  be a finite factored set. And let  be partitions of .

The history of , written , is the smallest set of factors  such that for all , if  for all , then .

The history of , then, is the smallest set of factors —so, the smallest subset of —such that if I take an element of  and I hide it from you, and you want to know which part in  it is in, it suffices for me to tell you which part it is in within each of the factors in .

So the history  is a set of factors of , and knowing the values of all the factors in  is sufficient to know the value of , or to know which part in  a given element is going to be in. I'll give an example soon that will maybe make this a little more clear.

We're then going to define time from history. We'll say that  is weakly before , written , if . And we'll say that  is strictly before , written , if .

One analogy one could draw is that these histories are like the past light cones of a point in spacetime. When one point is before another point, then the backwards light cone of the earlier point is going to be a subset of the backwards light cone of the later point. This helps show why "before" can be like a subset relation.

We're also going to define orthogonality from history. We'll say that two partitions  and  are orthogonal, written , if their histories are disjoint: .

Now I'm going to go through an example.

 

2e. 

Let  be the set of all Game of Life computations starting from an  board.

Let  (i.e., cells computable from the initial  board). For , let  be the set of all computations such that the cell at row  and column  is alive at time .

(Minor footnote: I've done some small tricks here in order to deal with the fact that the Game of Life is normally played on an infinite board. We want to deal with the finite case, and we don't want to worry about boundary conditions, so we're only going to look at the cells that are uniquely determined by the initial board. This means that the board will shrink over time, but this won't matter for our example.)

 is the set of all Game of Life computations, but since the Game of Life is deterministic, the set of all computations is in bijective correspondence with the set of all initial conditions. So , the number of initial board states.

This also gives us a nice factorization on the set of all Game of Life computations. For each cell, there's a partition that separates out the Game of Life computations in which that cell is alive at time 0 from the ones where it's dead at time 0. Our factorization, then, will be a set of  binary factors, one for each question of "Was this cell alive or dead at time 0?".

Formally: For , let . Let , where .

There will also be other partitions on this set of all Game of Life computations that we can talk about. For example, you can take a cell and a time  and say, "Is this cell alive at time ?", and there will be a partition that separates out the computations where that cell is alive at time  from the computations where it's dead at time .

Here's an example of that:

The lowest grid shows a section of the initial board state.

The blue, green, and red squares on the upper boards are (cell, time) pairs. Each square corresponds to a partition of the set of all Game of Life computations, "Is that cell alive or dead at the given time ?"

The history of that partition is going to be all the cells in the initial board that go into computing whether the cell is alive or dead at time . It's everything involved in figuring out that cell's state. E.g., knowing the state of the nine light-red cells in the initial board always tells you the state of the red cell in the second board.

In this example, the partition corresponding to the red cell's state is strictly before the partition corresponding to the blue cell. The question of whether the red cell is alive or dead is before the question of whether the blue cell is alive or dead.

Meanwhile, the question of whether the red cell is alive or dead is going to be orthogonal to the question of whether the green cell is alive or dead.

And the question of whether the blue cell is alive or dead is not going to be orthogonal to the question of whether the green cell is alive or dead, because they intersect on the cyan cells.

Generalizing the point, fix , where . Then:

  • .
  •  if and only if  and .
  •  if and only if  or .

We can also see that the blue and green cells look almost orthogonal. If we condition on the values of the two cyan cells in the intersection of their histories, then the blue and green partitions become orthogonal. That's what we're going to discuss next.

 

David Spivak: A priori, that would be a gigantic computation—to be able to tell me that you understand the factorization structure of that Game of Life. So what intuition are you using to be able to make that claim, that it has the kind of factorization structure you're implying there?

Scott: So, I've defined the factorization structure.

David Spivak: You gave us a certain factorization already. So somehow you have a very good intuition about history, I guess. Maybe that's what I'm asking about.

Scott: Yeah. So, if I didn't give you the factorization, there's this obnoxious number of factorizations that you could put on the set here. And then for the history, the intuition I'm using is: "What do I need to know in order to compute this value?"

I actually went through and I made little gadgets in Game of Life to make sure I was right here, that every single cell actually could in some situations affect the cells in question. But yeah, the intuition that I'm working from is mostly about the information in the computation. It's "Can I construct a situation where if only I knew this fact, I would be able to compute what this value is? And if I can't, then it can take two different values."

David Spivak: Okay. I think deriving that intuition from the definition is something I'm missing, but I don't know if we have time to go through that.

Scott: Yeah, I think I'm not going to here.

 

2b. 

So, just to set your expectations: Every time I explain Pearlian causal inference to someone, they say that d-separation is the thing they can't remember. d-separation is a much more complicated concept than "directed paths between nodes" and "nodes without any common ancestors" in Pearl; and similarly, conditional orthogonality will be much more complicated than time and orthogonality in our paradigm. Though I do think that conditional orthogonality has a much simpler and nicer definition than d-separation.

We'll begin with the definition of conditional history. We again have a fixed finite set as our context. Let  be a finite factored set, let , and let .

The conditional history of  given , written , is the smallest set of factors  satisfying the following two conditions:

  • For all , if  for all , then .
  • For all  and , if  for all  and  for all , then .

The first condition is much like the condition we had in our definition of history, except we're going to make the assumption that we're in . So the first condition is: if all you know about an object is that it's in , and you want to know which part it's in within , it suffices for me to tell you which part it's in within each factor in the history .

Our second condition is not actually going to mention . It's going to be a relationship between  and . And it says that if you want to figure out whether an element of  is in , it's sufficient to parallelize and ask two questions:

  • "If I only look at the values of the factors in , is 'this point is in ' compatible with that information?"
  • "If I only look at the values of the factors in , is 'this point is in ' compatible with that information?"

If both of these questions return "yes", then the point has to be in .

I am not going to give an intuition about why this needs to be a part of the definition. I will say that without this second condition, conditional history would not even be well-defined, because it wouldn't be closed under intersection. And so I wouldn't be able to take the smallest set of factors in the subset ordering.

Instead of justifying this definition by explaining the intuitions behind it, I'm going to justify it by using it and appealing to its consequences.

We're going to use conditional history to define conditional orthogonality, just like we used history to define orthogonality. We say that  and  are orthogonal given , written , if the history of  given  is disjoint from the history of  given .

We say  and  are orthogonal given , written , if  for all . So what it means to be orthogonal given a partition is just to be orthogonal given each individual way that the partition might be, each individual part in that partition.

I've been working with this for a while and it feels pretty natural to me, but I don't have a good way to push the naturalness of this condition. So again, I instead want to appeal to the consequences.

 

2b. 

Conditional orthogonality satisfies the compositional semigraphoid axioms, which means finite factored sets are pretty well-behaved. Let  be a finite factored set, and let  be partitions of . Then:

  • If , then . (symmetry)
  • If , then  and . (decomposition)
  • If , then . (weak union)
  • If  and , then . (contraction)
  • If  and If , then . (composition)

The first four properties here make up the semigraphoid axioms, slightly modified because I'm working with partitions rather than sets of variables, so union is replaced with common refinement. There's another graphoid axiom which we're not going to satisfy; but I argue that we don't want to satisfy it, because it doesn't play well with determinism.

The fifth property here, composition, is maybe one of the most unintuitive, because it's not exactly satisfied by probabilistic independence.

Decomposition and composition act like converses of each other. Together, conditioning on  throughout, they say that  is orthogonal to both  and  if and only if  is orthogonal to the common refinement of  and .

 

2b. 

In addition to being well-behaved, I also want to show that conditional orthogonality is pretty powerful. The way I want to do this is by showing that conditional orthogonality exactly corresponds to conditional independence in all probability distributions you can put on your finite factored set. Thus, much like d-separation in the Pearlian picture, conditional orthogonality can be thought of as a combinatorial version of probabilistic independence.

A probability distribution on a finite factored set  is a probability distribution  on  that can be thought of as coming from a bunch of independent probability distributions on each of the factors in . So  for all .

This effectively means that your probability distribution factors the same way your set factors: the probability of any given element is the product of the probabilities of each of the individual parts that it's in within each factor.

The fundamental theorem of finite factored sets says: Let  be a finite factored set, and let  be partitions of . Then  if and only if for all probability distributions  on , and all , and , we have . I.e.,  is orthogonal to  given  if and only conditional independence is satisfied across all probability distributions.

This theorem, for me, was a little nontrivial to prove. I had to go through defining certain polynomials associated with the subsets, and then dealing with unique factorization in the space of these polynomials; I think the proof was eight pages or something.

The fundamental theorem allows us to infer orthogonality data from probabilistic data. If I have some empirical distribution, or I have some Bayesian distribution, I can use that to infer some orthogonality data. (We could also imagine orthogonality data coming from other sources.) And then we can use this orthogonality data to get temporal data.

So next, we're going to talk about how to get temporal data from orthogonality data.

 

2b. 

We're going to start with a finite set , which is our sample space.

One naive thing that you might think we would try to do is infer a factorization of . We're not going to do that because that's going to be too restrictive. We want to allow for  to maybe hide some information from us, for there to be some latent structure and such.

There may be some situations that are distinct without being distinct in . So instead, we're going to infer a factored set model of : some other set , and a factorization of , and a function from  to .

A model of  is a pair , where  is a finite factored set and . ( need not be injective or surjective.)

Then if I have a partition of , I can send this partition backwards across  and get a unique partition of . If , then  is given by .

Then what we're going to do is take a bunch of orthogonality facts about , and we're going to try to find a model which captures the orthogonality facts.

We will take as given an orthogonality database on , which is a pair , where  (for "orthogonal") and  (for "not orthogonal") are each sets of triples  of partitions of . We'll think of these as rules about orthogonality.

What it means for a model  to satisfy a database  is:

  •  whenever , and
  •  whenever .

So we have these orthogonality rules we want to satisfy, and we want to consider the space of all models that are consistent with these rules. And even though there will always be infinitely many models that are consistent with my database, if at least one is—you can always just add more information that you then delete with ⁠—we would like to be able to sometimes infer that for all models that satisfy our database,  is before .

And this is what we're going to mean by inferring time. If all of our models  that are consistent with the database  satisfy some claim about time , we'll say that .

 

2e. 

So we've set up this nice combinatorial notion of temporal inference. The obvious next questions are:

  • Can we actually infer interesting facts using this method, or is it vacuous?
  • And: How does this framework compare to Pearlian temporal inference?

Pearlian temporal inference is really quite powerful; given enough data, it can infer temporal sequence in a wide variety of situations. How powerful is the finite factored sets approach by comparison?

To address that question, we'll go to an example. Let  and  be two binary variables. Pearl asks: "Are  and  independent?" If yes, then there's no path between the two. If no, then there may be a path from  to , or from  to , or from a third variable to both  and .

In either case, we're not going to infer any temporal relationships.

To me, it feels like this is where the adage "correlation does not imply causation" comes from. Pearl really needs more variables in order to be able to infer temporal relationships from more rich combinatorial structures.

However, I claim that this Pearlian ontology in which you're handed this collection of variables has blinded us to the obvious next question, which is: is  independent of ?

In the Pearlian world,  and  were our variables, and  is just some random operation on those variables. In our world,  instead is a variable on the same footing as  and . The first thing I do with my variables  and  is that I take the product  and then I forget the labels  and .

So there's this question, "Is  independent of ?". And if  is independent of , we're actually going to be able to conclude that  is before !

So not only is the finite factored set paradigm non-vacuous, and not only is it going to be able to keep up with Pearl and infer things Pearl can't, but it's going to be able to infer a temporal relationship from only two variables.

So let's go through the proof of that.

 

2e. 

Let , and let , and  be the partitions (/questions):

  • . (What is the first bit?)
  • . (What is the second bit?)
  • . (Do the bits match?)

Let , where  and . If we'd gotten this orthogonality database from a probability distribution, then we would have more than just two rules, since we would observe more orthogonality and non-orthogonality than that. But temporal inference is monotonic with respect to adding more rules, so we can just work with the smallest set of rules we'll need for the proof.

The first rule says that  is orthogonal to . The second rule says that  is not orthogonal to itself, which is basically just saying that  is non-deterministic; it's saying that both of the parts in  are possible, that both are supported under the function . The  indicates that we aren't making any conditions.

From this, we'll be able to prove that .

 

Proof. First, we'll show that that  is weakly before . Let  satisfy . Let  be shorthand for , and likewise let  and .

Since , we have that ; and since , we have that .

Since —that is, since  can be computed from  together with . (Because a partition's history is the smallest set of factors needed to compute that partition.)

And since , this implies , so  is weakly before .

To show the strict inequality, we'll assume for the purpose of contradiction that  = .

Notice that  can be computed from  together with —that is, —and therefore  (i.e., ). It follows that . But since  is also disjoint from , this means that , a contradiction.

Thus , so , so , so 

 

When I'm doing temporal inference using finite factored sets, I largely have proofs that look like this. We collect some facts about emptiness or non-emptiness of various Boolean combinations of histories of variables, and we use these to conclude more facts about histories of variables being subsets of each other.

I have a more complicated example that uses conditional orthogonality, not just orthogonality; I'm not going to go over it here.

One interesting point I want to make here is that we're doing temporal inference—we're inferring that  is before —but I claim that we're also doing conceptual inference.

Imagine that I had a bit, and it's either a 0 or a 1, and it's either blue or green. And these two facts are primitive and independently generated. And I also have this other concept that's like, "Is it grue or bleen?", which is the  of blue/green and 0/1.

There's a sense in which we're inferring  is before , and in that case, we can infer that blueness is before grueness. And that's pointing at the fact that blueness is more primitive, and grueness is a derived property.

In our proof,  and  can be thought of as these primitive properties, and  is a derived property that we're getting from them. So we're not just inferring time; we're inferring facts about what are good, natural concepts. And I think that there's some hope that this ontology can do for the statement "you can't really distinguish between blue and grue" what Pearl can do to the statement "correlation does not imply causation".

 

2b. 

The future work I'm most excited by with finite factored sets falls into three rough categories: inference (which involves more computational questions), infinity (more mathematical), and embedded agency (more philosophical).

Research topics related to inference:

  • Decidability of Temporal Inference
  • Efficient Temporal Inference
  • Conceptual Inference
  • Temporal Inference from Raw Data and Fewer Ontological Assumptions
  • Temporal Inference with Deterministic Relationships
  • Time without Orthogonality
  • Conditioned Factored Sets

There are a lot of research directions suggested by questions like "How do we do efficient inference in this paradigm?". Some of the questions here come from the fact that we're making fewer assumptions than Pearl, and are in some sense more coming from the raw data.

Then I have the applications that are about extending factored sets to the infinite case:

  • Extending Definitions to the Infinite Case
  • The Fundamental Theorem of Finite-Dimensional Factored Sets
  • Continuous Time
  • New Lens on Physics

Everything I've presented in this talk was under the assumption of finiteness. In some cases this wasn't necessary—but in a lot of cases it actually was, and I didn't draw attention to this.

I suspect that the fundamental theorem can be extended to finite-dimensional factored sets (i.e., factored sets where  is finite), but it can not be extended to arbitrary-dimension factored sets.

And then, what I'm really excited about is applications to embedded agency:

  • Embedded Observations
  • Counterfactability
  • Cartesian Frames Successor
  • Unraveling Causal Loops
  • Conditional Time
  • Logical Causality from Logical Induction
  • Orthogonality as Simplifying Assumptions for Decisions
  • Conditional Orthogonality as Abstraction Desideratum

 

I focused on the temporal inference aspect of finite factored sets in this talk, because it's concrete and tangible to be able to say, "Ah, we can do Pearlian temporal inference, only we can sometimes infer more structure and we rely on fewer assumptions."

But really, a lot of the applications I'm excited about involve using factored sets to model situations, rather than inferring factored sets from data.

Anywhere that we currently model a situation using graphs with directed edges that represent information flow or causality, we might instead be able to use factored sets to model the situation; and this might allow our models to play more nicely with abstraction.

I want to build up the factored set ontology as an alternative to graphs when modeling agents interacting with things, or when modeling information flow. And I'm really excited about that direction.

New Comment


97 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Here is how I'm currently thinking about this framework and especially inference, in case it's helpful for other folks who have similar priors to mine (or in case something is still wrong).

A description of traditional causal models:

  • A causal graph with N nodes can be viewed as a model with 2N variables, one for each node of the graph and a corresponding noise variable for each. Each real variable is a deterministic function of its corresponding noise variable + its parents in the causal graph.
  • When we talk about causal inference, we often consider probabilities in "generic position": we ask what kind of graphs would give rise to the observed conditional independence relations (and no others) regardless of the probability distributions and functions.

In the factored sets framework:

  • We still posit a bunch of independent noise variables, and our observations are still a deterministic functions of these noise variables.
  • But we no longer make any structural assumption about an underlying graph---the deterministic functions can be arbitrary.
  • It's easy to prove that for any deterministic function there is a unique "smallest" set of variables on which it potentially depends. We call this the his
... (read more)

I think the definition of history is the most natural way to recover something like causal structure in these models.

I'm not sure how much it's about causality. Imagine there's a bunch of envelopes with numbers inside, and one of the following happens:

  1. Alice peeks at three envelopes. Bob peeks at ten, which include Alice's three.

  2. Alice peeks at three envelopes and tells the results to Bob, who then peeks at seven more.

  3. Bob peeks at ten envelopes, then tells Alice the contents of three of them.

Under the FFS definition, Alice's knowledge in each case is "strictly before" Bob's. So it seems less of a causal relationship and more like "depends on fewer basic facts".

Agree it's not totally right to call this a causal relationship.

That said:

  • The contents of 3 envelopes does seems causally upstream of the contents of 10 envelopes
  • If Alice's perception is imperfect (in any possible world), then "what Alice perceived" is not identical to "the contents of 3 envelopes" and so is not strictly before "what Bob perceived" (unless there is some other relationship between them).
  • If Alice's perception is perfect in every possible world, then there is no possible way to intervene on Alice's perception without intervening on the contents of the 3 envelopes. So it seems like a lot rests on whether you are restricting your attention to possible worlds.
  • Even if Alice's perception is perfect (or if Bob is guaranteed to tell Alice the contents of the 3 envelopes) we can still imagine an intervention on Alice's perception, and in your stories it seems like that's what makes it feel like Alice's perception isn't upstream of Bob's perception. But it feels to me like this imagination ought to track subjective possibility, even if in fact it is probably logically necessary that Alice perceives correctly / Bob reports correctly / whatever.

So I do feel like there's a case t... (read more)

I think I (at least locally) endorse this view, and I think it is also a good pointer to what  seems to me to be the largest crux between the my theory of time and Pearl's theory of time.

8cousin_it
I feel that interpreting "strictly before" as causality is making me more confused. For example, here's a scenario with a randomly changed message. Bob peeks at ten regular envelopes and a special envelope that gives him a random boolean. Then Bob tells Alice the contents of either the first three envelopes or the second three, depending on the boolean. Now Alice's knowledge depends on six out of ten regular envelopes and the special one, so it's still "strictly before" Bob's knowledge. And since Alice's knowledge can be computed from Bob's knowledge but not vice versa, in FFS terms that means the "cause" can be (and in fact is) computed from the "effect", but not vice versa. My causal intuition is just blinking at all this. Here's another scenario. Alice gets three regular envelopes and accurately reports their contents to Bob, and a special envelope that she keeps to herself. Then Bob peeks at seven more envelopes. Now Alice's knowledge isn't "before" Bob's, but if later Alice predictably forgets the contents of her special envelope, her knowledge becomes "before" Bob's. Even though the special envelope had no effect on the information Alice gave to Bob, didn't affect the causal arrow in any possible world. And if we insist that FFS=causality, then by forgetting the envelope, Alice travels back in time to become the cause of Bob's knowledge in the past. That's pretty exotic.
4Scott Garrabrant
I partially agree, which is partially why I am saying time rather than causality. I still feel like there is an ontological disagreement in that it feels like you are objecting to saying the physical thing that is Alice's knowledge is (not) before the physical thing that is Bob's knowledge. In my ontology: 1) the information content of Alice's knowledge is before the information content of Bob's knowledge. (I am curios if this part is controversial.) and then, 2) there is in some sense no more to say about the physical thing that is e.g. Alice's knowledge beyond the information content. So, I am not just saying Alice is before Bob, I am also saying e.g. Alice is before Alice+Bob, and I can't disentangle these statements because Alice+Bob=Bob. I am not sure what to say about the second example. I am somewhat rejecting the dynamics. "Alice travels back in time" is another way of saying that the high level FFS time disagrees with the standard physical time, which is true. The "high level" here is pointing to the fact that we are only looking at the part of Alice's brain that is about the envelopes, and thus talking about coarser variables than e.g. Alice's entire brain state in physical time. And if we are in the ontology where we are only looking at the information content, taking a high level version of a variable is the kind of thing that can change its temporal properties, since you get an entirely new variable. I suspect most of the disagreement is in the sort of "variable nonrealism" of reducing the physical thing that is Alice's knowledge to its information content?
9cousin_it
Not sure we disagree, maybe I'm just confused. In the post you show that if X is orthogonal to X XOR Y, then X is before Y, so you can "infer a temporal relationship" that Pearl can't. I'm trying to understand the meaning of the thing you're inferring - "X is before Y". In my example above, Bob tells Alice a lossy function of his knowledge, and Alice ends up with knowledge that is "before" Bob's. So in this case the "before" relationship doesn't agree with time, causality, or what can be computed from what. But then what conclusions can a scientist make from an inferred "before" relationship?
8Scott Garrabrant
I don't have a great answer, which isn't a great sign. I think the scientist can infer things like. "algorithms reasoning about the situation are more likely to know X but not Y than they are to know Y but not X, because reasonable processes for learning Y tend to learn learn enough information to determine X, but then forget some of that information." But why should I think of that as time? I think the scientist can infer things like "If I were able to factor the world into variables, and draw a DAG (without determinism) that is consistent with the distribution with no spurious independencies (including in deterministic functions of the variables), and X and Y happen to be variables in that DAG, then there will be a path from X to Y." The scientist can infer that if Z is orthogonal to Y, then Z is also orthogonal to X, where this is important because Z is orthogonal to Y can be thought of as saying that Z is useless for learning about Y. (and importantly a version of useless for learning that is closed under common refinement, so if you collect a bunch of different Z orthogonal to Y, you can safely combine them, and the combination will be orthogonal to Y.) This doesn't seem to get at why we want to call it before. Hmm. Maybe I should just list a bunch of reasons why it feels like time to me (in no particular order): 1. It seems like it gets a very reasonable answer in the Game of Life example 2. Prior to this theory, I thought that it made sense to think of time as a closure property on orthogonality, and this definition of time is exactly that closure property on orthogonality, where X is weakly before Y if whenever Z is orthogonal to Y, Z is also orthogonal to X. (where the definition of orthogonality is justified with the fundamental theorem.) 3. If Y is a refinement of X, then Y cannot be strictly before X. (I notice that I don't have a thing to say about why this feels like time to me, and indeed it feels like it is in direct opposition to your "does
4cousin_it
Thanks for the response! Part of my confusion went away, but some still remains. In the game of life example, couldn't there be another factorization where a later step is "before" an earlier one? (Because the game is non-reversible and later steps contain less and less information.) And if we replace it with a reversible game, don't we run into the problem that the final state is just as good a factorization as the initial?
4Scott Garrabrant
Yep, there is an obnoxious number of factorizations of a large game of life computation, and they all give different definitions of "before."
2cousin_it
I think your argument about entropy might have the same problem. Since classical physics is reversible, if we build something like a heat engine in your model, all randomness will be already contained in the initial state. Total "entropy" will stay constant, instead of growing as it's supposed to, and the final state will be just as good a factorization as the initial. Usually in physics you get time (and I suspect also causality) by pointing to a low probability macrostate and saying "this is the start", but your model doesn't talk about macrostates yet, so I'm not sure how much it can capture time or causality. That said, I like really like how your model talks only about information, without postulating any magical arrows. Maybe it has a natural way to recover macrostates, and from them, time?
2Scott Garrabrant
Wait, I misunderstood, I was just thinking about the game of life combinatorially, and I think you were thinking about temporal inference from statistics. The reversible cellular automaton story is a lot nicer than you'd think. if you take a general reversible cellular automaton (critters for concreteness), and have a distribution over computations in general position in which initial conditions cells are independent, the cells may not be independent at future time steps. If all of the initial probabilities are 1/2, you will stay in the uniform distribution, but if the probabilities are in general position, things will change, and time 0 will be special because of the independence between cells. There will be other events at later times that will be independent, but those later time events will just represent "what was the state at time 0." For a concrete example consider the reversible cellular automaton that just has 2 cells, X and Y, and each time step it keeps X constant and replaces Y with X xor Y. 
2cousin_it
Wait, can you describe the temporal inference in more detail? Maybe that's where I'm confused. I'm imagining something like this: 1. Check which variables look uncorrelated 2. Assume they are orthogonal 3. From that orthogonality database, prove "before" relationships Which runs into the problem that if you let a thermodynamical system run for a long time, it becomes a "soup" where nothing is obviously correlated to anything else. Basically the final state would say "hey, I contain a whole lot of orthogonal variables!" and that would stop you from proving any reasonable "before" relationships. What am I missing?
2Scott Garrabrant
I think that you are pointing out that you might get a bunch of false positives in your step 1 after you let a thermodynamical system run for a long time, but they are are only approximate false positives.  
2[comment deleted]
2Scott Garrabrant
I think my model has macro states. In game of life, if you take the entire grid at time t, that will have full history regardless of t. It is only when you look at the macro states (individual cells) that my time increases with game of life time.
2Scott Garrabrant
As for entropy, here is a cute observation (with unclear connection to my framework): whenever you take two independent coin flips (with probabilities not 0,1, or 1/2), their xor will always be high entropy than either of the individual coin flips. 
6Scott Garrabrant
Thanks Paul, this seems really helpful. As for the name I feel like "FFS" is a good name for the analog of "DAG", which also doesn't communicate that much of the intuition, but maybe doesn't make as much sense for name of the framework.

I was originally using the name Time Cube, but my internal PR center wouldn't let me follow through with that :)

6gwillen
That sounds like the right choice, but a part of me is incredibly disappointed that you didn't go for it.
9paulfchristiano
I think FFS makes sense as an analog of DAG, and it seems reasonable to think of the normal model as DAG time and this model as FFS time. I think the name made me a bit confused by calling attention to one particular diff between this model and Pearl (factored sets vs variables), whereas I actually feel like that diff was basically a red herring and it would have been fastest to understand if the presentation had gone in the opposite direction by demphasizing that diff (e.g. by presenting the framework with variables instead of factors).  That said, even the DAG/FFS analogy still feels a bit off to me (with the caveat that I may still not have a clear picture / don't have great aesthetic intuitions about the domain).  Factorization seems analogous to describing a world as a set of variables (and to the extent it's not analogous it seems like an aesthetic difference about whether to take the world or variables as fundamental, rather than a substantive difference in the formalism) rather than to the DAG that relates the variables. The structural changes seem more like (i) replacing a DAG with a bipartite graph, (ii) allowing arrows to be deterministic (I don't know how typically this is done in causal models). And then those structural changes lead to generalizing the usual concepts about causality so that they remain meaningful in this setting. All that said, I'm terrible at both naming things and presenting ideas, and so don't want to make much of a bid for changes in either department.

Makes sense. I think a bit of my naming and presentation was biased by being so surprised by the not on OEIS fact.

I think I disagree about the bipartite graph thing. I think it only feels more natural when comparing to Pearl. The talk frames everything in comparison to Pearl, but I think if you are not looking at Pearl, I think graphs don’t feel like the right representation here. Comparing to Pearl is obviously super important, and maybe the first introduction should just be about the path from Pearl to FFS, but once we are working within the FFS ontology, graphs feel not useful. One crux might be about how I am excited for directions that are not temporal inference from statistical data.

My guess is that if I were putting a lot of work into a very long introduction for e.g. the structure learning community, I might start the way you are emphasizing, but then eventually convert to throwing all the graphs away.

(The paper draft I have basically only ever mentions Pearl/graphs for motivation at the beginning and in the applications section.)

4paulfchristiano
I agree that bipartite graphs are only a natural way of thinking about it if you are starting from Pearl. I'm not sure anything in the framework is really properly analogous to the DAG in a causal model.
5Koen.Holtman
My thoughts on naming this finite factored sets: I agree with Paul's observation that | Factorization seems analogous to describing a world as a set of variables By calling this 'finite factored sets', you are emphasizing the process of coming up with individual random variables, the variables that end up being the (names of the) nodes in a causal graph. With s∈S representing the entire observable 4D history of a world (like a computation starting from a single game of life board state), a factorisation B={b1,b2,⋯bn} splits such s into a tuple of separate, more basic observables (bb1,bb2,⋯,bbn). where bb1∈b1, etc. In the normal narrative that explains Pearl causal graphs, this splitting of the world into smaller observables is not emphasized. Also, the splitting does not necessarily need to be a bijection. It may loose descriptive information with respect to s. So I see the naming finite factored sets as a way to draw attention to this splitting step, it draws attention to the fact that if you split things differently, you may end up with very different causal graphs. This leaves open the question of course is if really want to name your framework in a way that draws attention to this part of the process. Definitely you spend a lot of time on creating an equivalent to the arrows between the nodes too.

Planned summary for the Alignment Newsletter. I'd be particularly keen for someone to check that I didn't screw up in the section "Orthogonality in Finite Factored Sets":

This newsletter is a combined summary + opinion for the [Finite Factored Sets sequence](https://www.alignmentforum.org/s/kxs3eeEti9ouwWFzr) by Scott Garrabrant. I (Rohin) have taken a lot more liberty than I usually do with the interpretation of the results; Scott may or may not agree with these interpretations.

## Motivation

One view on the importance of deep learning is that it allows you to automatically _learn_ the features that are relevant for some task of interest. Instead of having to handcraft features using domain knowledge, we simply point a neural net at an appropriate dataset, and it figures out the right features. Arguably this is the _majority_ of what makes up intelligent cognition; in humans it seems very analogous to [System 1](https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow), which we use for most decisions and actions. We are also able to infer causal relations between the resulting features.

Unfortunately, [existing models](https://en.wikipedia.org/wiki/The_Book_of_Why) of causal inference d

... (read more)

Are there any interesting pure combinatorics problems about finite factored sets that you're interested in?

8Scott Garrabrant
1. Given a finite set Ω of cardinality n, find a computable upper bound on the largest finite factored set model that is combinatorially different from all smaller finite factored set models. (We say that two FFS models are combinatorially different if they say the same thing about the emptiness of all boolean combinations of histories and conditional histories of partitions of Ω.) (Such an upper bound must exist because there are only finitely many combinatorially distinct FFS models, but a computable upper bound, would tell us that temporal inference is computable.) 2. Prove the fundamental theorem for finite dimensional factored sets. (Seems likely combinatorial-ish, but I am not sure.) 3. Figure out how to write a program to do efficient temporal inference on small examples. (I suspect this requires a bunch of combinatorial insights. By default this is very intractable, but we might be able to use math to make it easier.) 4. Axiomatize complete consistent orthogonality databases (consistent means consistent with some model, complete means has an opinion on every possible conditional orthogonality) (To start, is it the case that compositional semigraphoid axioms already work?) If by "pure" you mean "not related to history/orthogonality/time," then no, the structure is simple, and I don't have much to ask about it.
6alkjash
Right, the structure is quite simple. The only thing that came to mind about finite factored sets as combinatorial objects was studying the L-function of the number of them, which surely has some nice Euler product. Maybe you can write it as a product of standard zeta functions or something? 

I've thought a good amount about Finite Factored Sets in the past year or two, but I do sure keep going back to thinking about the world primarily in the form of Pearlian causal influence diagrams, and I am not really sure why. 

I do think this one line by Scott at the top gave me at least one pointer towards what was happening: 

but I'm trained as a combinatorialist, so I'm giving a combinatorics talk upfront.

In the space of mathematical affinities, combinatorics is among the branches of math I feel most averse to, and I think that explains a good... (read more)

3Scott Garrabrant
This underrated post is pretty good at explaining how to translate between FFSs and DAGs.

Let's try category theory.

A partition is a family of sets called parts. A partition morphism X->X' has a function from each part of X to some part of X'. It witnesses that X is finer than X'¹.

The underlying set of a partition is its disjoint union. Call the discrete partition of S DS. The functions S->⊔X correspond to the partition morphisms DS->X. Call the trivial partition of S TS. The functions ⊔X->S correspond to the partition morphisms X->TS. In terser notation, we have D ⊣ ⊔ ⊣ T.

A factorization is a family of partitions called fac... (read more)

4Gurkenglas
Let 1 be the category with one object • and one morphism. Let Δx be the constant functor to x. A set is a family of • called elements. A set morphism S->S' has a 1-morphism between each element of S and some element of S'. The 1-morphisms •->Δ•S correspond to the set morphisms Δ∅•->S. The 1-morphisms Δ•S->• correspond to the set morphisms S->Δ1•. We have Δ∅ ⊣ Δ• ⊣ Δ1. Let 0 the empty category. • is the empty family. A 1-morphism has nothing to prove. There's no forgetful functor 1->0 so the buck stops here.
2Gurkenglas
Call the index set of X IX. Call the partition into empty parts indexed by S 0S. We have 0 ⊣ I ⊣ D ⊣ ⊔ ⊣ T. None of the our three adjunction strings can be extended further. Let's apply the construction that gave us histories at the other 5 ends. Niceness is implicit. - The right construction of TS->X is the terminal S->S' with a TS'->X: The image of ⊔(TS->X). - The left construction of X->0S is the initial S'->S with a X->0S': The image of I(X->0S). - The left construction of B->FX is the initial X'->X with a B->FX': The image of ∨(B->FX). - The right construction of Δ1•->S is the terminal •->• with a Δ1•->S: The image of Δ•(Δ1•->S). - The left construction of S->Δ∅• is absurd, but can still be written as the image of Δ•(S->Δ∅•). - The history of ∨B->X is the terminal B->B' with a ∨B'->X: Breaks the pattern! F(∨B->X) does not have the information to determine the history. In fact, ⊔T, I0, ∨F, Δ•Δ1 and Δ•Δ∅ are all identity, only F∨ isn't.

I was thinking of some terminology that might make it easier to thinking about factoring and histories and whatnot.

A partition can be thought of as a (multiple-choice) question. Like for a set of words, you could have the partition corresponding to the question "Which letter does the word start with?" and then the partition groups together elements with the same answer.

Then a factoring is set of questions, where the set of answers will uniquely identify an element. The word that comes to mind for me is "signature", where an element's signature is the set o... (read more)

4Scott Garrabrant
Yep, this all seems like a good way of thinking about it.

I have read most of the article, but not yet carefully combed the math or watched the video. 

The OEIS gap seems suggestive of "being on the track of something novel (and thus maybe novel and good)".

Reading this, some things clicked for me as "possibly related and worth looking at" that I hadn't really noticed before. 

Specifically, I was reminded of how "TF * IDF" is this old pragmatic "it works in practice" mathematical kludge for information retrieval that just... gets the job done better "with" than "without" nearly all of the time? People have... (read more)

Let , where  and 

[...] The second rule says that  is orthogonal to itself

Should that be "is not orthogonal to itself"? I thought the  meant non-orthogonal, so would think  means that  is not orthogonal to itself.

(The transcript accurately reflects what was said in the talk, but I'm asking whether Scott misspoke.)

6Scott Garrabrant
Yeah, you are right. I will change it. Thanks.

And if X is independent of X XOR Y, we’re actually going to be able to conclude that X is before Y!

It's interesting to translate that to the language of probabilities. For example, your condition holds for any X,Y (possibly dependent) such that P(X)=P(Y)=1/2, but it doesn't make sense to say that X is before Y in every such pair. For a real world example, take X = "person has above median height" and Y = "person has above median age".

9Scott Garrabrant
So you should probably not work with probabilities equal to 1/2 in this framework, unless you are doing so for a specific reason. Just like in Pearlian causality, we are mostly talking about probabilities in general position. I have some ideas about how to deal with probability 1/2 (Have a FFS, together with a group of symmetry constraints, which could swap factors, or swap parts within a factor), but that is outside of the scope of what I am doing here. To give more detail, the uniform distribution on four elements does not satisfy the compositional semigraphoid axioms, since if we take X, Y, Z to be the three partitions into two parts of size two, X is independent with Y and X is independent with Z, but X is not independent with the common refinement of Y and Z. Thus, if we take the orthogonality database generated by this probability distribution, you will find that it is not satisfied by any models.
4Scott Garrabrant
The swapping within a factor allows for considering rational probabilities to be in general position, and the swapping factors allows IID samples to be considered in general position. I think this is an awesome research direction to go in, but it will make the story more complicated, since will not be able to depend on the fundamental theorem, since we are allowing for a new source of independence that is not orthogonality. (I want to keep calling the independence that comes from disjoint factors orthogonality, and not use "orthogonality" to describe the new independences that come from the principle of indifference.)
6cousin_it
Yeah, that's what I thought, the method works as long as certain "conspiracies" among probabilities don't happen. (1/2 is not the only problem case, it's easy to find others, but you're right that they have measure zero.) But there's still something I don't understand. In the general position, if X is before Y, it's not always true that X is independent of X XOR Y. For example, if X = "person has a car on Monday" and Y = "person has a car on Tuesday", and it's more likely that a car-less person gets a car than the other way round, the independence doesn't hold. It requires a conspiracy too. What's the intuitive difference between "ok" and "not ok" conspiracies?
4Scott Garrabrant
I don't understand what conspiracy is required here. X being orthogonal to X XOR Y implies X is before Y, we don't get the converse.

Well, imagine we have three boolean random variables. In "general position" there are no independence relations between them, so we can't say much. Constrain them so two of the variables are independent, that's a bit less "general", and we still can't say much. Constrain some more so the xor of all three variables is always 1, that's even less "general", now we can use your method to figure out that the third variable is downstream of the first two. Constrain some more so that some of the probabilities are 1/2, and the method stops working. What I'd like to understand is the intuition, which real world cases have the particular "general position" where the method works.

4Scott Garrabrant
Ok, makes sense. I think you are just pointing out that when I am saying "general position," that is relative to a given structure, like FFS or DAG or symmetric FFS. If you have a probability distribution, it might be well modeled by a DAG, or a weaker condition is that it is well modeled by a FFS, or an even weaker condition is that it is well modeled by a SFFS (symmetric finite factored set).  We have a version of the fundamental theorem for DAGs and d-seperation, we have a version of the fundamental theorem for FFS and conditional orthogonality, and we might get a version of the fundamental theorem for SFFS and whatever corresponds to conditional independence in that world. However, I claim that even if we can extend to a fundamental theorem for SFFS, I still want to think of the independences in a SFFS as having different sources. There are the independences coming from orthogonality, and there are there the independences coming from symmetry (or symmetry together with orthogonality. In this world, orthogonality won't be as inferable because it will only be a subset of independence, but it will still be an important concept. This is similar to what I think will happen when we go to the infinite dimensional factored sets case.
6cousin_it
Can you give some more examples to motivate your method? Like the smoking/tar/cancer example for Pearl's causality, or Newcomb's problem and counterfactual mugging for UDT.

Hmm, first I want to point out that the talk here sort of has natural boundaries around inference, but I also want to work in a larger frame that uses FFS for stuff other than inference.

If I focus on the inference question, one of the natural questions that I answer is where I talk about grue/bleen in the talk. 

I think for inference, it makes the most sense to think about FFS relative to Pearl. We have this problem with looking at smoking/tar/cancer, which is what if we carved into variables the wrong way. What if instead of tar/cancer, we had a variable for "How much bad stuff is in your body?" and "What is the ratio of tar to cancer?" The point is that our choice of variables both matters for the Pearlian framework, and is not empirically observable. I am trying to do all the good stuff in Pearl without the dependence on the variables

Indeed, I think the largest crux between FFS and Pearl is something about variable realism. To FFS, there is no realism to a variable beyond its information content, so it doesn't make sense to have two variables X, X' with the same information, but different temporal properties. Pearl's ontology, on the other hand, has these graphs with variabl... (read more)

2Scott Garrabrant
Here is a more concrete example of me using FFS the way I intend them to be used outside of the inference problem. (This is one specific application, but maybe it shows how I intend the concepts to be manipulated). I can give an example of embedded observation maybe, but it will have to come after a more formal definition of observation (This is observation of a variable, rather than the observation of an event above): Definition: Given a FFS F=(S,B), and A, W, X, which are partitions of S, where X={x1,…,xn}, we say A observes X relative to W if: 1) A⊥X, 2) A can be expressed in the form A=A0∨S⋯∨SAn, and  3) Ai⊥W|(S∖xi). (This should all be interpreted combinatorially, not probabilistically.) The intuition of what is going on here is that to observe an event, you are being promised that you 1) do not change whether the event holds, and 3) do not change anything that matters in the case where that event does not hold. Then, to observe a variable, you can basically 2) split yourself up into different fragments of your policy, where each policy fragment observes a different value of that variable. (This whole thing is very updateless.) Example 1: (non-observation) An agent A={L,R} does not observe a coinflip X={H,T}, and chooses to raise either his left or right hand. Our FFS F=(S,B) is given by S=A×X,  and B={A,X}. (I am abusing notation here slightly by conflating A with the partition you get on A×X by projecting onto the A coordinate.) Then W is the discrete partition on A×X. In this example, we do not have observation. Proof: A only has two parts, so if we express A as a common refinement of 2 partitions, at least one of these two partitions must be equal to A. However, A is not orthogonal to W given H and A is not orthogonal to W given T. (hF(A|H)=hF(W|H)=hF(A|T)=hF(W|T)={A}). Thus we must violate condition 3. Example 2: (observation) An agent A={LL,LR,RL,RR} does observe a coinflip X={H,T}, and chooses to raise either his left or right hand. We can thin
4[comment deleted]
3acgt
I’m confused what necessary work the Factorisation is doing in these temporal examples - in your example A and B are independent and C is related to both - the only assignment of “upstream/downstream” relations that makes sense is that C is downstream of both. Is the idea that factorisation is what carves your massive set of possible worlds up into these variables in the first place? Feel like I’m in a weird position where the math makes sense but I’m missing the motivational intuition for why we want to switch to this framework in the first place
4Scott Garrabrant
I are note sure what you are asking (indeed I am not sure if you are responding to me or cousin_it.) One thing that I think is going on is that I use "factorization" in two places. Once when I say Pearl is using factorization data, and once where I say we are inferring a FFS. I think this is a coincidence. "Factorization" is just a really general and useful concept. So the carving into A and B and C is a factorization of the world into variables, but it is not the kind of factorization that shows up in the FFS, because disjoint factors should be independent in the FFS. As for why to switch to this framework, the main reason (to me) is that it has many of the advantages of Pearl with also being able to talk about some variables being coarse abstract versions of other variables. This is largely because I am interested in embedded agency applications.  Another reason is that we can't tell a compelling story about where the variables came from in the Pearlian story. Another reason is that sometimes we can infer time where Pearl cannot. 
3acgt
What would such a distribution look like? The version where X XOR Y is independent of both X and Y makes sense but I’m struggling to envisage a case where it’s independent of only 1 variable.
4Scott Garrabrant
It looks like X and V are independent binary variables with different probabilities in general position, and Y is defined to be X XOR V. (and thus V=X XOR Y).
1Eigil Rischel
Thanks (to both of you), this was confusing for me as well.

Cool topic!
Here are some critiques on part 1 of your presentation: Short Combinatorial Talk

1 Colour coding slides by content is a nifty idea. I hadn't seen this before. Unfortunately, even with as little as five headings it is difficult to recall the correspondence between colours and content. Why not try something else. Maybe a designation in the upper right hand corner of each slide?

2. That looks like an interesting diagram on slide 4. Why didn't you explain it?

3. You tend to introduce succinct definitions first, motivating examples later. This worked al... (read more)

4Liron
Agree with #3, presenting definitions with examples first. Congrats on this research, feels like you’re onto something huge!

Curated. This is a fascinating framework that (to the best of my understanding) makes substantive improvements on the Pearlian paradigm. It's also really exciting that you found a new simple sequence. 

Re: the writeup, it's explained very clearly, the Q&A interspersed is a very nice touch. I like that the talk factorizes.

I really appreciate the research exploration you do around ideas of agency and am very happy to celebrate the writeups like this when you produce them.

How did you count the number of factorizations of sets of size 0-25?

6Scott Garrabrant
If you look at the draft edits for this sequence that is still awaiting approval on OEIS, you'll find some formulas. 

This is really cool!

The example of inferring from the independence of and reminds me of some techniques discussed in Elements of Causal Inference. They discuss a few different techniques for 2-variable causal inference.

One of them, which seems to be essentially analogous to this example, is that if are real-valued variables, then if the regression error (i.e for some constant ) is independent of , it's highly likely that is downstream of . It sounds like factored sets (or some extension to capture continuous-valued variables) migh... (read more)

Possible examples:  After staring at the definition of a set factorization for a minute, it clicked for me when I thought about Quarto.

Quarto is a simple board game played with 16 pieces (and a 4x4 grid) where each piece is (short or tall) and (light or dark) and (round or square) and (solid or hollow).  There's exactly one piece with each combination of attributes; for example, there's exactly one tall dark round hollow piece.

Thus, the full set of 16 pieces can be factored into {{short, tall}, {light, dark}, {round, square}, {solid, hollow}}. &n... (read more)

1jimv
What about Dobble / Spot-It? They are cards designed so each pair of cards has exactly one shared symbol between them.
1Dweomite
What elements of that game are you suggesting would correspond to a set factorization?  I'm not seeing one.

You said that you thought that this could be done in a categorical way. I attempted something which appears to describe the same thing when applied to the category FinSet , but I'm not sure it's the sort of thing you meant by when you suggested that the combinatorial part could potentially be done in a categorical way instead, and I'm not sure that it is fully categorical.

Let S be an object.
For i from 1 to k, let  be an object, (which is not anything isomorphic to the product of itself with itself, or at least is not the terminal object) .
Let&n... (read more)

6Scott Garrabrant
I have not thought much about applying to things other than finite sets. (I looked at infinite sets enough to know there is nontrivial work to be done.) I do think it is good that you are thinking about it, but I don't have any promises that it will work out. What I meant when I think that this can be done in a categorical way is that I think I can define a nice symmetric monodical category of finite factored sets such that things like orthogonality can be given nice categorical definitions. (I see why this was a confusing thing to say.)

I'm confused by the definition of conditional history, because it doesn't seem to be a generalisation of history. I would expect , but both of the conditions in the definition of  are vacuously true if . This is independent of what  is, so . Am I missing something?

3Scott Garrabrant
E is the event you are conditioning on, so the thing you should expect is that hF(X|S)=hF(X), which does indeed hold.
3FjolleJagt
Thanks, that makes sense! Could you say a little about why the weak union axiom holds? I've been struggling to prove that from your definitions. I was hoping that hF(X|z,w)⊆hF(X|z) would hold, but I don't think that hF(X|z) satisfies the second condition in the definition of conditional history for hF(X|z,w).
4Scott Garrabrant
When I prove it, I prove and use (a slight notational variation on) these two lemmas. 1. If hF(X|E)∩hF(Y|E)={}, then hF(X|E)=hF(X|(y∩E)) for all y∈Y. 2. hF((X∨SY)|E)=hF(X|E)∪⋃x∈XhF(Y|x∩E). (These are also the two lemmas that I have said elsewhere in the comments look suspiciously like entropy.) These are not trivial to prove, but they might help.

I don't understand the motivation behind the fundamental theorem, and I'm wondering if you could say a bit more about it. In particular, it suggests that if I want to propose a family of probability distributions that "represent" observations somehow ("represents" maybe means in the sense of Bayesian predictions or in the sense of frequentist limits), I want to also consider this family to arise from some underlying mutually independent family along with some function. I'm not sure why I would want to propose an underlying family and a function at all, and... (read more)

Hmm, I am not sure what to say about the fundamental theorem, because I am not really understanding the confusion. Is there something less motivating about the fundamental theorem, than the analogous theorem about d-seperation being equivalent to conditional independence in all distributions comparable with the DAG?

Maybe this helps? (probably not): I am mostly imagining interacting with only a single distributions in the class, and the claim about independence in all probability distributions comparable with the structure can be replaced with instead independence in a general position probability distribution comparable with the structure.

I am not thinking of it as related to a maximum entropy argument.

The point about SEMs having more structure that I am ignoring is correct. I think that the largest philosophical difference between my framework and Pearlian one is that I am denying realism of anything beyond the "apparently identical." Another way to think about it is that I am denying realism about there being anything about the variables beyond their information. All of my definitions are only based on the information content of the variables, and so, for example, if you have two... (read more)

1passinglunatic
I think the motivation for the representability of some sets of conditional independences with a DAG is pretty clear, because people already use probability distributions all the time, they sometimes have conditional independences and visuals are nice. On the other hand the fundamental theorem relates orthogonality to independences in a family of distributions generated in a particular way. Neither of these things are natural properties of probability distributions in the way that conditional independence is. If I am using probability distributions, it seems to me I'd rather avoid introducing them if I can. Even if the reasons are mysterious, it might be useful to work with models of this type - I was just wondering if there were reasons for doing that are apparent before you derive any useful results.  Alternatively, is it plausible that you could derive the same results just using probability + whatever else you need anyway? For example, you could perhaps define X to be prior to Y if, relative to some ordering of functions by "naturalness", there is a more natural f(X,Y) such that X⊥f(X,Y) and X⊥/f(X,Y)|Y than any g(X,Y) such that Y⊥g(X,Y) etc. I have no idea if that actually works! However, I'm pretty sure you'll need something like a naturalness ordering in order to separate "true orthogonality" from "merely apparent orthogonality", which is why I think it's fair to posit it as an element of "whatever else you need anyway". Maybe not.

On the last example with the XOR temporal inference - since the partitions/queries we’re asking about are also possible factors, doesn’t the temporal data in terms of history etc depend on which choice of factorisation we go with?

We have a choice of 2 out of 3 factors each of which corresponds to one of the partitions in question, so surely by factorising in different ways we can make any two of the variables have history of 1 and thus automatically orthogonal?

2Scott Garrabrant
So we are allowing S to have more than 4 elements (although we dont need that in this case), so it is not just looking at a small number of factorizations of a 4 element set. This is because we want an FFS model, not just a factorization of the sample space. If you factor in a different way, X will not be before Y, but if you do this it will not be the case that X is orthogonal to X XOR Y. The theorem in this example is saying that X being orthogonal to X XOR Y implies that X is before Y.

I was thinking about the difficulty of finite factored sets not understanding the uniform distribution over 4 elements, and it makes me feel like something fundamental needs to be recast. An analogy came to mind about eigenvectors vs. eigenspaces.

What we might like to be true about the unit eigenvectors of a matrix is that they are the unique unit vectors for which the linear transformation preserves direction. But if two eigenvectors have the same eigenvalue, the choice of eigenvectors is not unique--we could choose any pair on that plane. So really, it s... (read more)

3Scott Garrabrant
Hmm, I doubt the last paragraph about sets of partitions is going to be valuable, bet the eigenspace thinking might be useful.  Note that I gave my thoughts about how to deal with the uniform distribution over 4 elements in the thread responding to cousin_it.

I'm a bit confused about how X can be independent of both Y and of (X xor Y). What would a probability distribution where this holds look like?

8Scott Garrabrant
Indeed, If X is independent of both Y and X xor Y, that violates the compositional semigraphiod axioms (assuming X is nondeterministic.) Although it could still happen e.g. in the uniform distribution on X x Y. In the example in the post, I mean for X to be independent of X xor Y and for X to not be independent of Y.  
3tailcalled
I think one thing that confuses me is, wouldn't Y also be before X then?
9Scott Garrabrant
Nope, we have X⊥X⊕Y, but not Y⊥X⊕Y. That breaks the symmetry.
5tailcalled
Ah of course! So many symbols to keep track of 😅
1tailcalled
Or wait, I'm dumb, that can definitely happen if X and Y are coin flips. But I feel like this doesn't add up with the other stuff, will need to read more carefully.

Imagine that I had a bit, and it’s either a 0 or a 1, and it’s either blue or green. And these two facts are primitive and independently generated. And I also have this other concept that’s like, “Is it grue or bleen?”, which is the XOR of blue/​green and 0⁄1.

There’s a sense in which we’re inferring X is before Y, and in that case, we can infer that blueness is before grueness. And that’s pointing at the fact that blueness is more primitive, and grueness is a derived property.

I found it helpful to work through this example. Let's say is "is the bit 0 ... (read more)

Some general comments:

Overcoming blindness

You mention above that Pearl's ontology 'has blinded us to the obvious next question'. I am very sympathetic to research programmes that try to overcome such blindness, this is the kind or research I have been doing myself recently. The main type of blindness that I have been trying to combat is blindness to complex types of self-referencing and indirect representation that can be present inside online machine learning agents, specifically in my recent work I have added a less blind viewpoint by modifying and... (read more)

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.

How can a partition be a variable?  Should it be "part" instead?

3Ramana Kumar
Partitions (of some underlying set) can be thought of as variables like this: * The number of values the variable can take on is the number of parts in the partition. * Every element of the underlying set has some value for the variable, namely, the part that that element is in. Another way of looking at it: say we're thinking of a variable v:S→D as a function from the underlying set S to v's domain D. Then we can equivalently think of v as the partition {{s∈S∣v(s)=d}∣d∈D}∖∅ of S with (up to) |D| parts. In what you quoted, we construct the underlying set by taking all possible combinations of values for the "original" variables. Then we take all partitions of that to produce all "possible" variables on that set, which will include the original ones and many more.
3Vivek Hebbar
Makes perfect sense, thanks!

Definition paraphrasing attempt / question:

Can we say "a factorization B of a set S is a set of nontrivial partitions of S such that  " (cardinality not taken)? I.e., union(intersect(t in tuple) for tuple in cartesian_product(b in B)) = S. I.e., can we drop the intermediate requirement that each intersection has a unique single element, and only require the union of the intersections is equal to S?

3Scott Garrabrant
If I understand correctly, that definition is not the same. In particular, it would say that you can get nontrivial factorizations of a 5 element set: {{{0,1},{2,3,4}},{{0,2,4},{1,3}}}.
1lemonhope
That misses element 4 right? >>> from itertools import product >>> B = [[{0, 1}, {2, 3, 4}], [{0, 2, 3}, {1, 3}]] >>> list(product(*B)) [({0, 1}, {0, 2, 3}), ({0, 1}, {1, 3}), ({2, 3, 4}, {0, 2, 3}), ({2, 3, 4}, {1, 3})] >>> [set.intersection(*tup) for tup in product(*B)] [{0}, {1}, {2, 3}, {3}] >>> set.union(*[set.intersection(*tup) for tup in product(*B)]) {0, 1, 2, 3}
2Scott Garrabrant
Looks like you copied it wrong. Your B only has one 4.

I'm using some of the terminology I suggested here.

A factoring is a set of questions such that each signature of possible answers identifies a unique element. In 20 questions, you can tailor the questions depending on the answers to previous questions, and ultimately each element will have a bitstring signature depending on the history of yesses and nos. I guess you can define the question to include xors with previous questions, so that it effectively changes depending on the answers to others. But it's sometimes useful that the bitstrings are allowed to ... (read more)

7Scott Garrabrant
I think that the answers to both the concern about 7 elements, and the desire to have questions depend of previous questions come out of thinking about FFS models, rather than FFS. If you want to have 7 elements in Ω, that just means you will probably have more than 7 elements in S. If I want to model a situation where some questions I ask depend on other questions, I can just make a big FFS that asks all the questions, and have the model hide some of the answers.  For example, Let's say I flip a biased coin, and then if heads I roll a biased 6 sided die, and if tails I roll a biased 10 sided die. There are 16 outcomes in Ω.  I can build a 3 dimensional factored set 2x6x10, which I will imagine as sitting on my table with height 2. heads is on the bottom, and tails is on the top. f:S→Ω will then merge together the rows on the bottom, and the columns on the top, so it will look a little like the game Jenga. In this way, I am imagining there is some hidden data about each world in which I get heads and roll the 6 sided die, which is the answer to the question "what would have happened if I rolled the 10 sided die. Adding in all this counterfactual data gives a latent structure of 120 possible worlds, even though we can only distinguish 16 possible worlds.

I don't follow the math completely, but think I understand a little about how a couple of adjustments to the framework you've been using seems to point to the entanglement of the different values in the sets, and the ability to infer space/time relationships between them with a smaller subset of the data than would 'normally' be required. What caused me to comment was that a fair amount of my work has involved a few of the same topics and at least a few of the same trains of thought you deal with in this presentation. That is to say, I can see that what yo... (read more)

2Scott Garrabrant
Sure, if you want to send me an email and propose some times, we could set up a half hour chat. (You could also wait until I post all the math over the next couple weeks.)