An Illustrated Summary of "Robust Agents Learn Causal World Model"

Dalcy

This post was written during Alex Altair's agent foundations fellowship program, funded by LTFF. Thank you Alex Altair, Alfred Harwood, Daniel C for feedback and comments.

Introduction

The selection theorems agenda aims to prove statements of the following form: "agents selected under criteria has property $Y$ ," where $Y$ are things such as world models, general purpose search, modularity, etc. We're going to focus on world models.

But what is the intuition that makes us expect to be able to prove such things in the first place? Why expect world models?

Because: assuming the world is a Causal Bayesian Network with the agent's actions corresponding to the $D$ (decision) node, if its actions can robustly control the $U$ (utility) node despite various "perturbations" in the world, then intuitively it must have learned the causal structure of how $U$ 's parents influence $U$ in order to take them into account in its actions.

And the same for the causal structure of how $U$ 's parents' parents influence $U$ 's parents ... and by induction, it must have further learned the causal structure of the entire world upstream of the utility variable.

This is the intuitive argument that the paper Robust Agents Learn Causal World Model by Jonathan Richens and Tom Everitt formalizes.

Informally, its main theorem can be translated as: if an agent responds to various environment interventions by prescribing policies that overall yield low regret, then it's possible to appropriately query the agent to reconstruct an implicit world model that matches up with the ground truth causal structure.

I will refer to this result as the "Causal Good Regulator Theorem". This sequence Thoughts on the Causal Good Regulator Theorem will do the following:

[1] a precise statement of the theorem alongside a high-level argument of its proof and discussions of the paper's results in the context of the selection theorem agenda - basically a self-complete distillation of the paper. The rest of the posts in this sequence are optional. (this post)
[2] detailed proof of the paper's theorems, presented in a way that focuses on how one could've discovered the paper's results by themselves, with an emphasis on intuition.
[3] extension of the paper's theorem by showing how one of the main assumptions can be relaxed without much trouble.

Basic Setup

World

The world is a Causal Bayesian Network $G$ over the set of variables corresponding to the environment $C$ , utility node $U$ , and decision node $D$ . The differences from a normal Causal Bayesian Network is that (1) $U$ is a deterministic function of its parents $U ({P a}_{U})$ , and (2) $P (D ∣ {P a}_{D})$ , the conditional probability distribution for $D$ , is undetermined—it's something that our agent will select.

Agent as a Policy Oracle

In this paper's setup, the agent is treated as a mechanism that, informally, takes in an intervention to the world, and returns a policy i.e. it chooses $P (D ∣ {P a}_{D})$ .

Following this, we formally define an agent via a policy oracle $Π_{Σ}$ which is a function that takes in an intervention $σ \in Σ$ (where $Σ$ represents the set of all allowed interventions over $C$ ) and returns a policy $π_{σ} (D ∣ {P a}_{D})$ .

With the $P (D ∣ {P a}_{D})$ determined under $σ$ as $π_{σ}$ , all the conditional probability distributions of $G$ under $σ$ are determined, meaning we can e.g., calculate the expected utility, $E^{π_{σ}} [U]$ .

This definition makes intuitive sense - if a human is told about changes in the environment (e.g., The environment has changed to "raining"), they would change their policy accordingly (e.g., greater change of taking the umbrella when going out).

But also, it is unnatural in the sense that it directly receives environmental intervention as input, unlike real agents which have to figure it out with its sensory organs that are also embedded in the world. This will be further discussed later in the post.

"Robustness" as $δ$ -optimality under interventions

By a "robust" agent, intuitively we mean an agent that can consistently maximize its utility despite its environment being subject to various interventions. We formalize this in terms of "regret bounds".

We say a policy oracle $Π_{Σ}$ has a regret $δ$ if for all interventions allowed $σ \in Σ$ , the $π_{σ}$ the oracle prescribes attains expected utility that is lower than the maximum expected utility attainable in $G_{σ}$ by at most $δ$ (that is, $E^{π_{σ}} [U] \geq E^{π^{*}} [U] - δ$ ), indicating a bound on suboptimality.

We denote a $δ$ -optimal policy oracle $Π_{Σ}^{δ}$ , and $0$ -optimal one $Π_{Σ}^{*}$ .

Now we have to choose some class of allowed interventions $Σ$ . This is put in a collapsible section because it doesn't matter too much for the high-level discussion of the first post.

Choice of $Σ$

Now the question is the choice of $Σ$ - how broad of a set of perturbation do we want "robustness" to hold in? Ideally we would want it to be small while letting us prove interesting theorems - the broader this class is, the broader the set of environments our oracle is assumed to return a $δ$ -optimal policy, making the agent less realistic.

Recall hard interventions: $σ = d o (C_{i} = α)$ replaces $P (C_{i} ∣ {P a}_{i})$ to a delta distribution $P (C_{i} = c_{i} ∣ {P a}_{i} = {p a}_{i}; σ) := 1 [c_{i} = α]$ , and so the distribution factorizes to the following:

$P_{d o (V = v^{'})} (c) = {\begin{matrix} \prod_{i : C_{i} \notin V} P (C_{i} = c_{i} ∣ {P a}_{i} = {p a}_{i}) & if c consistent with v^{'} 0 & otherwise. \end{matrix}$

Soft interventions instead change $P (C_{i} ∣ {P a}_{i})$ more generally - not necessarily a delta function. It could even change the parent set ${P a}_{i}$ to a different set ${P a}_{i}^{'}$ as long as it doesn't introduce a cycle, of course.

The set of all soft interventions seem like a good choice of $Σ$ , until you realize that the condition "as long as it doesn't introduce a cycle" assumes we already know the graph structure of $G$ - but that's what we want to discover!

So the paper considers a restricted form of soft intervention called "local interventions." On top of $P (C_{i} ∣ {P a}_{i})$ , they apply a function $f : Val (C_{i}) \to Val (C_{i})$ . Because the function only maps the values of $C_{i}$ locally, it does not change the fact that $C_{i}$ depends on ${P a}_{i}$ . $σ = d o (C_{i} = f (c_{i}))$ yields $P (C_{i} = c_{i}^{'} ∣ {P a}_{i} = {p a}_{i}; σ) := \sum_{c_{i} : f (c_{i}) = c_{i}^{'}} P (c_{i} ∣ {p a}_{i})$ .

Examples:

$f (c_{i}) = α$ makes $P (c_{i}^{'} ∣ {pa}_{i}; σ) = \sum_{c_{i} : α = c_{i}^{'}} P (c_{i} ∣ {pa}_{i}) = 1 [c_{i} = α]$ , corresponding to the hard intervention $C_{i} = α$ .
$f (c_{i}) = c_{i} + k$ makes $P (c_{i}^{'} ∣ {pa}_{i}; σ) = \sum_{c_{i} : c_{i} + k = c_{i}^{'}} P (c_{i} ∣ {pa}_{i}) = P (c_{i}^{'} - k ∣ {pa}_{i})$ , intuitively corresponding to e.g., shifting $C_{i}$ by $k$ ; maybe $C_{i}$ represents the location of some object in an RL world.

The paper extends this class further, considering "mixtures of local interventions," where we denote a mixed intervention $σ^{*} = \sum p_{i} σ_{i}$ for $\sum p_{i} = 1$ , which denotes randomly performing a local intervention $σ_{i}$ with probability $p_{i}$ .

Examples:

Suppose $C_{i}$ takes $k = Val (C_{i})$ values $v_{1}, \dots, v_{k}$ . Then consider a local intervention $f_{j} (c_{i}) = v_{j}$ , $σ_{j} = do (C_{i} = f_{j} (c_{i}))$ , and a mixture $σ^{*} = \sum p_{j} σ_{j}$ for some $p_{j}$ that sums to $1$ . This lets us set $C_{i}$ to literally any arbitrary distribution we want!
- This now makes it seem like a very broad class of interventions thus bad. But that's what we have for now.

Aside from these "local" interventions that only depend on the value of a node we're intervening on, the paper extends the intervention class by adding a specific form of structural intervention on the decision node $D$ .

Specifically, they consider masking inputs to $D$ such that it only depends on ${P a^{'}}_{D} \subseteq {P a}_{D}$ . This is somewhat reasonable (despite being structural) if we assume prior knowledge of which nodes influence $D$ , e.g., perhaps by design, and control over what the decision node reads as the values of these variables.

Note: The paper mentions that this can be implemented by local interventions, but I don't think so since this is a structural intervention that doesn't just depend on values of $D$ . A set of hard interventions that set ${P a}_{D} ∖ {P a^{'}}_{D}$ to a constant wouldn't work, because then we're not just masking inputs to D, but also masking inputs to other descendants of ${P a}_{D} ∖ {P a^{'}}_{D}$ .

Assumptions

1) Unmediated Decision Task states that ${D e s}_{D} \cap {A n c}_{U} = \emptyset$ . This is pretty major.

The left diagram doesn't satisfy the Unmediated Decision Task assumption, the right one does.

2) Domain dependence states that there exists distributions over the chance variables $P (C)$ and $P^{'} (C)$ (compatible with $M$ ) such that $arg {max}_{π} E_{P}^{π} [U] \neq arg {max}_{π} E_{P^{'}}^{π} [U]$ .

This is very reasonable. If domain dependence does not hold, then the optimal policy is just a constant function.

These together imply:

There does not exist a decision $d \in dom (D)$ that is optimal, i.e. $arg {max}_{d} U (d, x)$ across all $x \in X = {P a}_{U} ∖ {D}$ .
$D \in {P a}_{U}$ , i.e. there can't be any intermediate nodes between $D$ and $U$ , and all causal effects from $D$ to $U$ must be direct.

Main Theorem

With this basic setup explained, here is the main theorem of the paper (Theorem 1):

For almost all worlds $(G, P)$ satisfying assumption 1 and 2, we can identify the directed acyclic graph $G$ and the joint distribution $P$ over all variables upstream of $U$ , given that we have access to a $0$ -optimal policy oracle.

and a second theorem, which is the approximate case (Theorem 2):

For almost all worlds $(G, P)$ satisfying assumption 1 and 2 and some other new assumptions (explained below), we can identify the directed acyclic graph $G$ and the joint distribution $P$ over some subset of variables upstream of $U$ , and the quality of estimation for each of the conditional distributions scale linearly with $δ$ .

High-level argument

The proof is basically a formalization of the argument given in the introduction, but here we state it in the framework so far introduced.

The world is a Causal Bayesian Network, also containing two nodes each corresponding to an agent's decision node $D$ and utility node $U$ .

Suppose we're given an oracle that takes in an intervention $σ$ as input and returns a policy (conditional distribution of the decision node given its parents) that attains maximum utility under that intervention in the environment. This oracle is an operationalizes the notion of a "robust agent."

The question of "How robust?" is determined by the class of interventions to be considered. The broader this class is, the broader the set of environments our oracle is assumed to return an optimal policy, hence the agent is less realistic.

So how do we use the oracle?

Suppose you have two interventions $σ$ and $σ^{'}$ , and you "interpolate"^[1] between them by intervening under $σ$ with a probability of $q$ , and intervening under $σ^{'}$ with a probability of $1 - q$ . Denote such an intervention $~ σ (q) = q σ + (1 - q) σ^{'}$ .

If some decision, say, $d_{4}$ is the optimal decision returned by the oracle under $σ$ , and $d_{1}$ is the optimal decision returned under $σ^{'}$ , then as you gradually change $q$ from $0$ to $1$ , the optimal decision will switch from $d_{1}$ to other decisions ( $d_{2}, d_{3}, \dots$ ) and eventually to $d_{4}$ .

Call the value of $q$ at which the decision eventually switches over to $d_{4}$ , $q_{crit}$ .

The critical insight is that $q_{crit}$ can be estimated by querying the oracle with various values of q, and it also can expressed as an equation involving a number of terms corresponding to the conditional probability distributions of the Bayes Net (call it $P (C_{i} ∣ {P a}_{i})$ where $C_{i}$ is some node).

By cleverly setting $σ$ and equating the estimate and the expression, all of the $P (C_{i} ∣ {P a}_{i})$ terms for $C_{i}$ upstream of $U$ can be solved. This is where we involve the earlier intuition of "induction":

Assume some directed path $C_{k} \to \dots \to C_{1}$ where $C_{1} \in {P a}_{U}$
Step $1$ , $σ$ is set such that the expression for $q_{crit}$ only contains $P (C_{1} ∣ {P a}_{1})$ , so it can be immediately solved by comparing with the estimate.
Step k, assume we know $P (C_{i} ∣ {P a}_{i})$ for $i \in 1, \dots, k - 1$ , and set $σ$ such that the expression contains $P (C_{i} ∣ {P a}_{i})$ for all $i \in 1, \dots, k - 1, k$ . The only unknown here is $P (C_{k} ∣ {P a}_{k})$ .
Thus, by induction, we should be able to infer the conditional probability distribution table of $P (C_{i} ∣ {P a}_{i})$ for every variable upstream of $U$ .

Furthermore, all of this can be relaxed to approximately-optimal policy oracles.

The claim for the approximate case is that given an oracle that is "imperfect by $δ$ amount" it is possible to identify a subgraph of the graph corresponding to variables upstream of $U$ , and that the conditional distribution it estimates $^P (C_{i} | {P a}_{i})$ will differ from the true conditional distribution $P (C_{i} | {P a}_{i})$ by an amount that scales linearly with $δ$ for small values of $δ$ .

Discussion

When viewed in terms of the selection theorem agenda, I think the results of this paper signify real advancement - (1) in its attempt at formalizing exactly the intuitive argument often used to argue for the existence of world models, (2) incorporation of causal information, (3) providing an explicit algorithm for world model discovery, and (4) relaxing its proof to the approximately optimal oracle case - many of which are important advancements on their own.

However, this work still leaves many room for improvement.

Policy oracle is not a good model of an agent.

Real agents respond to changes in the world through its sensory organs that are embedded in the world. However, the use of policy oracles imply that the agent can directly perceive the environmental intervention, where the policy oracle is sort of a sensory organ that lives outside of the causal graph of the world.

That's why even in cases like the above, where D has no input, the policy oracle can produce policies sensitive to environmental interventions. Loosely speaking, the agent's sensory nodes are only partially embedded into the environment (D's connection to its parents), and the rest (its ability to sense interventions) are outside of the causal graph of the environment.

The Causal Good Regulator Theorem isn't a structural theorem.

I think the term "world model" in the paper's title is misleading.

The existence of a "world model" is an inherently structural claim. It is a question about whether a mind's cognition explicitly uses/queries a modular subsystem that "abstracts away" the real world in some sense.

But the theorems of this paper are purely about behavioral properties - namely that some algorithm, given access to the policy oracle and the set of variables in $G$ , can query the policy oracle in a particular way to reconstruct $G$ accurately. This says nothing about whether this reconstructed $G$ is actually used internally within the agent's cognition!

If we take the output of this algorithm and view it literally as a "world model" of the agent, then that implies the agent's "world model" is perfectly isomorphic to a subset of the environment's causal model. That would be very wrong. It shouldn't be that the "world model" recovered by the algorithm should depend on choice of variables used in $G$ , the world model should have its own ontology!

This is akin to how e.g., the VNM representation theorem is not structural: While it is possible to reconstruct a utility function from the betting behavior of agents satisfying certain axioms, that does not imply e.g., that the agent internally represents this utility function and argmaxes it.

I think this paper's result is better understood as a way to derive an implicit behavioral causal world model of an agent over a given set of variables, in the sense of answering the question: "given that I represent the world in terms of these variables $C$ , what causal relationship among these variables do the agent believe in?"

This itself is a very cool result!! The algorithm is inferring the agent's implied belief about causality with respect to any choice of variable ontology (i.e. choice of $C$ used to represent the world).

For example, I can literally imagine ...

... having $C_{1}$ to consist of variables like "sodium intake, blood pressure, arterial stiffness, ... vascular inflammation", and $C_{2}$ to consist of variables like "dietary habits, cardiovascular risk, life expectancy"
... then running the proof's algorithm on a human/LLM by treating them as a policy oracle (where $σ$ can be formulated as a text input, like providing them the text "your dietary habits are set to X ... ")
... and being able to infer the human/LLM's implied causal beliefs on these two different ontologies about healthcare issues!

But again, it would be misleading to call these $C$ alongside their inferred causal relationship the "world model" of the human/LLM.

Conclusion

The paper Robust Agents Learn Causal World Model signifies a real advancement in the selection theorem agenda, proving that it is possible to derive an implicit behavioral causal world model from policy oracles (agents) with low regret by appropriately querying them. But it has a lot of room for improvement, especially in making its claims more "structural."

I believe this post was a self-contained explanation of the paper along with my take on it.

However, my primary motivation for reading this paper was to more rigorously understand the proof of the theorem, with the goal of identifying general proof strategies that might be applied to proving future selection theorems. This is especially relevant because this paper appears to be the first that proves a selection theorem of substantial content while being rooted in the language of causality, and I believe causality will play a critical role in future selection theorems.

For this, wait for the next post in the sequence soon to be published.

[-]Jonathan Richens7mo*140

This is such a good deep dive into our paper, which I will be pointing people to in the future. Thanks for writing it!

Agree that conditioning on the intervention is unnatural for agents. One way around this is to note that adapting to an unknown distributional shift given only sensory inputs Pa_D is strictly harder than adapting to a known distributional shift (given Pa_D and sigma). It follows that any agent capable of adapting given only its sensory inputs must have learned a CWM (footnotes, p6).

[-]Lucas Teixeira7mo100

Strong upvote. Very clearly written and communicated. I've been recently thinking about digging deeper into this paper with the hopes of potentially relating it to some recent causality based interpretability work and reading this distillation has accelerated my understanding of the paper. Looking forward to the rest of the sequence!

LESSWRONG
LW

67

An Illustrated Summary of "Robust Agents Learn Causal World Model"

67

Ω 23

Introduction

Basic Setup

World

Agent as a Policy Oracle

"Robustness" as $δ$ -optimality under interventions

Assumptions

Main Theorem

High-level argument

Discussion

Policy oracle is not a good model of an agent.

The Causal Good Regulator Theorem isn't a structural theorem.

Conclusion

New to LessWrong?

67

Ω 23

67

An Illustrated Summary of "Robust Agents Learn Causal World Model"

67

Ω 23

Introduction

Basic Setup

World

Agent as a Policy Oracle

"Robustness" as δ-optimality under interventions

Assumptions

Main Theorem

High-level argument

Discussion

Policy oracle is not a good model of an agent.

The Causal Good Regulator Theorem isn't a structural theorem.

Conclusion

New to LessWrong?

67

Ω 23

"Robustness" as $δ$ -optimality under interventions