Strong upvote. Very clearly written and communicated. I've been recently thinking about digging deeper into this paper with the hopes of potentially relating it to some recent causality based interpretability work and reading this distillation has accelerated my understanding of the paper. Looking forward to the rest of the sequence!
This is such a good deep dive into our paper, which I will be pointing people to in the future. Thanks for writing it!
Agree that conditioning on the intervention is unnatural for agents. One way around this is to note that adapting to an unknown distributional shift given only sensory inputs Pa_D is strictly harder than adapting to a known distributional shift (given Pa_D and sigma). It follows that any agent capable of adapting given only its sensory inputs must have learned a CWM (footnotes, p6).
This post was written during Alex Altair's agent foundations fellowship program, funded by LTFF. Thank you Alex Altair, Alfred Harwood, Daniel C for feedback and comments.
Introduction
The selection theorems agenda aims to prove statements of the following form: "agents selected under criteria X has property Y," where Y are things such as world models, general purpose search, modularity, etc. We're going to focus on world models.
But what is the intuition that makes us expect to be able to prove such things in the first place? Why expect world models?
Because: assuming the world is a Causal Bayesian Network with the agent's actions corresponding to the D (decision) node, if its actions can robustly control the U (utility) node despite various "perturbations" in the world, then intuitively it must have learned the causal structure of how U's parents influence U in order to take them into account in its actions.
And the same for the causal structure of how U's parents' parents influence U's parents ... and by induction, it must have further learned the causal structure of the entire world upstream of the utility variable.
This is the intuitive argument that the paper Robust Agents Learn Causal World Model by Jonathan Richens and Tom Everitt formalizes.
Informally, its main theorem can be translated as: if an agent responds to various environment interventions by prescribing policies that overall yield low regret, then it's possible to appropriately query the agent to reconstruct an implicit world model that matches up with the ground truth causal structure.
I will refer to this result as the "Causal Good Regulator Theorem". This sequence Thoughts on the Causal Good Regulator Theorem will do the following:
Basic Setup
World
The world is a Causal Bayesian Network G over the set of variables corresponding to the environment C, utility node U, and decision node D. The differences from a normal Causal Bayesian Network is that (1) U is a deterministic function of its parents U(PaU), and (2) P(D∣PaD), the conditional probability distribution for D, is undetermined—it's something that our agent will select.
Agent as a Policy Oracle
In this paper's setup, the agent is treated as a mechanism that, informally, takes in an intervention to the world, and returns a policy i.e. it chooses P(D∣PaD).
Following this, we formally define an agent via a policy oracle ΠΣ which is a function that takes in an intervention σ∈Σ (where Σ represents the set of all allowed interventions over C) and returns a policy πσ(D∣PaD).
With the P(D∣PaD) determined under σ as πσ, all the conditional probability distributions of G under σ are determined, meaning we can e.g., calculate the expected utility, Eπσ[U].
This definition makes intuitive sense - if a human is told about changes in the environment (e.g., The environment has changed to "raining"), they would change their policy accordingly (e.g., greater change of taking the umbrella when going out).
But also, it is unnatural in the sense that it directly receives environmental intervention as input, unlike real agents which have to figure it out with its sensory organs that are also embedded in the world. This will be further discussed later in the post.
"Robustness" as δ-optimality under interventions
By a "robust" agent, intuitively we mean an agent that can consistently maximize its utility despite its environment being subject to various interventions. We formalize this in terms of "regret bounds".
We say a policy oracle ΠΣ has a regret δ if for all interventions allowed σ∈Σ, the πσ the oracle prescribes attains expected utility that is lower than the maximum expected utility attainable in Gσ by at most δ (that is, Eπσ[U]≥Eπ∗[U]−δ), indicating a bound on suboptimality.
We denote a δ-optimal policy oracle ΠδΣ, and 0-optimal one Π∗Σ.
Now we have to choose some class of allowed interventions Σ. This is put in a collapsible section because it doesn't matter too much for the high-level discussion of the first post.
Choice of Σ
Now the question is the choice of Σ - how broad of a set of perturbation do we want "robustness" to hold in? Ideally we would want it to be small while letting us prove interesting theorems - the broader this class is, the broader the set of environments our oracle is assumed to return a δ-optimal policy, making the agent less realistic.
Recall hard interventions: σ=do(Ci=α) replaces P(Ci∣Pai) to a delta distribution P(Ci=ci∣Pai=pai;σ):=1[ci=α], and so the distribution factorizes to the following:
Pdo(V=v′)(c)={∏i:Ci∉VP(Ci=ci∣Pai=pai) if c consistent with v′0 otherwise.
Soft interventions instead change P(Ci∣Pai) more generally - not necessarily a delta function. It could even change the parent set Pai to a different set Pa′i as long as it doesn't introduce a cycle, of course.
The set of all soft interventions seem like a good choice of Σ, until you realize that the condition "as long as it doesn't introduce a cycle" assumes we already know the graph structure of G - but that's what we want to discover!
So the paper considers a restricted form of soft intervention called "local interventions." On top of P(Ci∣Pai), they apply a function f:Val(Ci)→Val(Ci). Because the function only maps the values of Ci locally, it does not change the fact that Ci depends on Pai. σ=do(Ci=f(ci)) yields P(Ci=c′i∣Pai=pai;σ):=∑ci:f(ci)=c′iP(ci∣pai).
Examples:
The paper extends this class further, considering "mixtures of local interventions," where we denote a mixed intervention σ∗=∑piσi for ∑pi=1, which denotes randomly performing a local intervention σi with probability pi.
Examples:
Aside from these "local" interventions that only depend on the value of a node we're intervening on, the paper extends the intervention class by adding a specific form of structural intervention on the decision node D.
Note: The paper mentions that this can be implemented by local interventions, but I don't think so since this is a structural intervention that doesn't just depend on values of D. A set of hard interventions that set PaD∖Pa′D to a constant wouldn't work, because then we're not just masking inputs to D, but also masking inputs to other descendants of PaD∖Pa′D.
Assumptions
1) Unmediated Decision Task states that DesD∩AncU=∅. This is pretty major.
2) Domain dependence states that there exists distributions over the chance variables P(C) and P′(C) (compatible with M) such that argmaxπEπP[U]≠argmaxπEπP′[U].
These together imply:
Main Theorem
With this basic setup explained, here is the main theorem of the paper (Theorem 1):
and a second theorem, which is the approximate case (Theorem 2):
High-level argument
The proof is basically a formalization of the argument given in the introduction, but here we state it in the framework so far introduced.
The world is a Causal Bayesian Network, also containing two nodes each corresponding to an agent's decision node D and utility node U.
Suppose we're given an oracle that takes in an intervention σ as input and returns a policy (conditional distribution of the decision node given its parents) that attains maximum utility under that intervention in the environment. This oracle is an operationalizes the notion of a "robust agent."
The question of "How robust?" is determined by the class of interventions to be considered. The broader this class is, the broader the set of environments our oracle is assumed to return an optimal policy, hence the agent is less realistic.
So how do we use the oracle?
Suppose you have two interventions σ and σ′, and you "interpolate"[1] between them by intervening under σ with a probability of q, and intervening under σ′ with a probability of 1−q. Denote such an intervention ~σ(q)=qσ+(1−q)σ′.
If some decision, say, d4 is the optimal decision returned by the oracle under σ, and d1 is the optimal decision returned under σ′, then as you gradually change q from 0 to 1, the optimal decision will switch from d1 to other decisions (d2,d3,…) and eventually to d4.
Call the value of q at which the decision eventually switches over to d4, qcrit.
The critical insight is that qcrit can be estimated by querying the oracle with various values of q, and it also can expressed as an equation involving a number of terms corresponding to the conditional probability distributions of the Bayes Net (call it P(Ci∣Pai) where Ci is some node).
By cleverly setting σ and equating the estimate and the expression, all of the P(Ci∣Pai) terms for Ci upstream of U can be solved. This is where we involve the earlier intuition of "induction":
Furthermore, all of this can be relaxed to approximately-optimal policy oracles.
The claim for the approximate case is that given an oracle that is "imperfect by δ amount" it is possible to identify a subgraph of the graph corresponding to variables upstream of U, and that the conditional distribution it estimates ^P(Ci|Pai) will differ from the true conditional distribution P(Ci|Pai) by an amount that scales linearly with δ for small values of δ.
Discussion
When viewed in terms of the selection theorem agenda, I think the results of this paper signify real advancement - (1) in its attempt at formalizing exactly the intuitive argument often used to argue for the existence of world models, (2) incorporation of causal information, (3) providing an explicit algorithm for world model discovery, and (4) relaxing its proof to the approximately optimal oracle case - many of which are important advancements on their own.
However, this work still leaves many room for improvement.
Policy oracle is not a good model of an agent.
Real agents respond to changes in the world through its sensory organs that are embedded in the world. However, the use of policy oracles imply that the agent can directly perceive the environmental intervention, where the policy oracle is sort of a sensory organ that lives outside of the causal graph of the world.
That's why even in cases like the above, where D has no input, the policy oracle can produce policies sensitive to environmental interventions. Loosely speaking, the agent's sensory nodes are only partially embedded into the environment (D's connection to its parents), and the rest (its ability to sense interventions) are outside of the causal graph of the environment.
The Causal Good Regulator Theorem isn't a structural theorem.
I think the term "world model" in the paper's title is misleading.
The existence of a "world model" is an inherently structural claim. It is a question about whether a mind's cognition explicitly uses/queries a modular subsystem that "abstracts away" the real world in some sense.
But the theorems of this paper are purely about behavioral properties - namely that some algorithm, given access to the policy oracle and the set of variables in G, can query the policy oracle in a particular way to reconstruct G accurately. This says nothing about whether this reconstructed G is actually used internally within the agent's cognition!
This is akin to how e.g., the VNM representation theorem is not structural: While it is possible to reconstruct a utility function from the betting behavior of agents satisfying certain axioms, that does not imply e.g., that the agent internally represents this utility function and argmaxes it.
I think this paper's result is better understood as a way to derive an implicit behavioral causal world model of an agent over a given set of variables, in the sense of answering the question: "given that I represent the world in terms of these variables C, what causal relationship among these variables do the agent believe in?"
This itself is a very cool result!! The algorithm is inferring the agent's implied belief about causality with respect to any choice of variable ontology (i.e. choice of C used to represent the world).
For example, I can literally imagine ...
But again, it would be misleading to call these C alongside their inferred causal relationship the "world model" of the human/LLM.
Conclusion
The paper Robust Agents Learn Causal World Model signifies a real advancement in the selection theorem agenda, proving that it is possible to derive an implicit behavioral causal world model from policy oracles (agents) with low regret by appropriately querying them. But it has a lot of room for improvement, especially in making its claims more "structural."
I believe this post was a self-contained explanation of the paper along with my take on it.
However, my primary motivation for reading this paper was to more rigorously understand the proof of the theorem, with the goal of identifying general proof strategies that might be applied to proving future selection theorems. This is especially relevant because this paper appears to be the first that proves a selection theorem of substantial content while being rooted in the language of causality, and I believe causality will play a critical role in future selection theorems.
For this, wait for the next post in the sequence soon to be published.