Proof Explained for "Robust Agents Learn Causal World Model"

Dalcy

This post was written during Alex Altair's agent foundations fellowship program, funded by LTFF. Thank you Alex Altair, Alfred Harwood, Daniel C for feedback and comments.

This is a post explaining the proof of the paper Robust Agents Learn Causal World Model in detail. Check the previous post in the sequence for a higher-level summary and discussion of the paper, including an explanation of the basic setup (like terminologies and assumptions) which this post will assume from now on.

Recalling the Basic Setup

Let's recall the basic setup (again, check the previous post for more explanation):

[World] The world is a Causal Bayesian Network over the set of variables corresponding to the environment $C$ , utility node $U$ , and decision node $D$ . The differences from a normal Causal Bayesian Network is that (1) $U$ is a deterministic function of its parents $U ({P a}_{U})$ , and (2) $P (D ∣ {P a}_{D})$ , the conditional probability distribution for $D$ , is undetermined—it's something that our agent will select.
[Agent as a Policy Oracle] An agent is a policy oracle $Π_{Σ}$ which is a function that takes in an intervention $σ \in Σ$ (where $Σ$ represents the set of all allowed interventions over $C$ ) and returns a policy $π_{σ} (D ∣ {P a}_{D})$ .
[Robustness as $δ$ -optimality under interventions] We define a "robust" agent as a policy oracle whose regret is bounded by $δ$ under some class of interventions over the environment $Σ$ .

Assumptions

Also recall the following assumptions:

1) Unmediated Decision Task states that ${D e s}_{D} \cap {A n c}_{U} = \emptyset$ . This is pretty major.

The left diagram doesn't satisfy the Unmediated Decision Task assumption, the right one does.

2) Domain dependence states that there exists distributions over the chance variables $P (C)$ and $P^{'} (C)$ (compatible with $M$ ) such that $arg {max}_{π} E_{P}^{π} [U] \neq arg {max}_{π} E_{P^{'}}^{π} [U]$ .

This is very reasonable. If domain dependence does not hold, then the optimal policy is just a constant function.

These together imply:

There does not exist a decision $d \in dom (D)$ that is optimal, i.e. $arg {max}_{d} U (d, x)$ across all $x \in X = {P a}_{U} ∖ {D}$ .
$D \in {P a}_{U}$ , i.e. there can't be any intermediate nodes between $D$ and $U$ , and all causal effects from $D$ to $U$ must be direct.

Now, with the basic setup of the paper recalled, let's prove the main theorems.

Proof of the Exact Case

We will first prove Theorem 1:

For almost all worlds $(G, P)$ satisfying assumption 1 and 2, we can identify the directed acyclic graph $G$ and the joint distribution $P$ over all variables upstream of $U$ , given that we have access to a $0$ -optimal policy oracle.

I will attempt to present the proof in a way that focuses on how one could've discovered the paper's results by themselves, emphasizing intuitions and how trying to formalize them naturally constrains the solutions or assumptions we must use.

Load-Bearing part of Oracle Use

Somehow we're going to have to elicit information out of the policy oracle.

Recall that the oracle is a function that maps an intervention to a policy, which is a conditional probability distribution $π_{σ} (D ∣ {P a}_{D})$ . It can be shown that if the oracle is optimal, this distribution is (almost always) a deterministic map. The argument goes like:

Suppose that given a context ${P a}_{D} = {p a}_{D}$ , two decisions $d$ and $d^{'}$ have the same expected utility.
Then, we can argue that this is extremely unlikely:
- Intuitively because exact equality is very unlikely for real number things.
- More rigorously because expressing this equality results in a polynomial constraint over the parameters, which has Lebesgue measure 0 over the parameterization of the conditional probability distribution).
Therefore it is extremely likely (probability 1) that there is a strict ordering of decisions (according to their expected utility) given a context, i.e. almost always a unique maximum EU decision given a context. Therefore an optimal policy must only choose that decision in that given context, i.e. deterministic.

Then, somehow our information-eliciting procedure is going to have to exploit the fact that, given a ${P a}_{D} = {p a}_{D}$ , as we change the intervention $σ$ , the optimal decision prescribed changes from $d$ to $d^{'}$ .

To make this possible, we want to rule out the existence of a decision that is universally optimal across all inputs to the utility node, because then no intervention would yield a change in the output of the oracle.

In math, a $d \in dom (D)$ such that for all possible inputs $x$ to the utility (other than the decision), $d$ is always the $argmax$ of $U (d, x)$ , i.e. $d \in {argmax}_{d} U (d, x) \forall x \in dom (X)$ where $X = {P a}_{U} ∖ {D}$

Recall the first consequence of the two assumptions we had: There does not exist a decision $d \in dom (D)$ that is optimal, i.e. $arg {max}_{d} U (d, x)$ across all $x \in X = {P a}_{U} ∖ {D}$ .

This implies that there is at least one $x^{'}$ where the optimal decision $d^{'}$ differs from $d$ !

But the worry is that such $x^{'}$ will be incompatible with ${P a}_{D} = {p a}_{D}$ .

So, we will restricted to only considering $σ$ that masks the inputs of $D$ by only letting $D$ depend on ${P a^{'}}_{D} \subseteq {P a}_{D}$ such that ${P a^{'}}_{D} \cap {P a}_{U} = \emptyset$ .

Then, given $σ$ , let $d_{1}$ denote the optimal decision associated with some ${P a^{'}}_{D} = {p a^{'}}_{D}$ . Then, by the earlier argument, if we consider any intervention $σ^{'}$ that does $d o (X = x^{'})$ , then it would have a different optimal decision, call it $d_{2}$ .

And to operationalize "as we change the intervention," we define a mixed local intervention $~ σ (q) = q σ + (1 - q) σ^{'}$ .

When $q = 0$ , the policy oracle (under ${P a^{'}}_{D} = {p a^{'}}_{D}$ ) would prescribe $d_{1}$ , and for $q = 1$ , it would prescribe $d_{4}$ . There may of course be other intermediate optimal decisions along the way as you slowly increase $q$ from $0$ to $1$ - say, $d_{2}$ and $d_{3}$ for the current example.

Note that once you switch your decision from $d$ to $d^{'}$ as you increase $q$ , you will never encounter $d$ again because of linearity of $E [U ∣ p a_{D}, do (D = d); ~ σ (q)]$ . Namely:

$\begin{matrix} E [U ∣ {p a}_{D}, do (D = d), ~ σ (q)] & = q E [U ∣ {p a}_{D}, do (D = d), σ] + (1 - q) E [U ∣ {p a}_{D}, do (D = d), ~ σ] = q U_{1} + (1 - q) U_{2} = U_{2} + q (U_{1} - U_{2}) . \end{matrix}$

The diagram below makes it more clear why linearity implies the same decision never gets chosen twice. The oracle can be thought of as attaining the upper envelope, as denoted in dotted line.

Let $q_{crit}$ represent the value of $q$ at which the optimal decision switches from some $d_{3}$ (may or may not be $d_{1}$ ) to $d_{4}$ .

Insight: $q_{crit}$ is interesting. It's a behavioral property of our oracle, meaning we can estimate it by repeatedly sampling it (across random samples of $q \in [0, 1]$ ). But it can also probably be expressed in closed-form in terms of some parameters of the environment (by just expanding out the definition of expected utility). So $q_{crit}$ is a bridge that lets us infer properties about the environment from the oracle.

Let's derive a closed-form expression of $q_{crit}$ . Let $R = C ∖ {P a^{'}}_{D}$ .

Detailed Proof

$q_{crit}$ is a value that satisfies $\begin{matrix} E [U | {p a^{'}}_{D}, do (D = d_{3}); ~ σ (q_{crit})] & = E [U | {p a^{'}}_{D}, do (D = d_{4}); ~ σ (q_{crit})] \end{matrix}$

Expanding out the left-hand side: $\begin{matrix} E [U | {p a^{'}}_{D}, do (D = d_{3}); ~ σ (q_{crit})] & = \frac{1}{P ({p a^{'}}_{D}; ~ σ (q_{crit}))} \sum r U (d_{3}, x) P (r, {p a^{'}}_{D}; ~ σ (q_{crit})) = \frac{1}{P ({p a^{'}}_{D}; ~ σ (q_{crit}))} \sum r U (d_{3}, x) [q_{crit} P (r, {p a^{'}}_{D}; σ) + (1 - q_{crit}) P (r, {p a^{'}}_{D}; σ^{'})] \end{matrix}$

Expanding out the difference of both sides and setting it to zero:

$E [U | {p a^{'}}_{D}, do (D = d_{3}); ~ σ (q_{crit})] - E [U | {p a^{'}}_{D}, do (D = d_{4}); ~ σ (q_{crit})]$
$\begin{matrix} = \frac{1}{P ({p a^{'}}_{D}; ~ σ (q_{crit}))} \sum r \begin{matrix} q_{crit} P (r, {p a^{'}}_{D}; σ) [U (d_{3}, x) - U (d_{4}, x)] + (1 - q_{crit}) P (r, {p a^{'}}_{D}; σ^{'}) [U (d_{3}, x) - U (d_{4}, x)] \end{matrix} = 0 \end{matrix}$

Now we solve for $q_{crit}$ : $\begin{matrix} \sum r q_{crit} P (r, {p a^{'}}_{D}; σ) [U (d_{3}, x) - U (d_{4}, x)] & = - (1 - q_{crit}) \sum r P (r, {p a^{'}}_{D}; σ^{'}) [U (d_{3}, x) - U (d_{4}, x)] = - (1 - q_{crit}) [U (d_{3}, x^{'}) - U (d_{4}, x^{'})] \end{matrix}$

$\begin{matrix} \frac{q_{crit}}{1 - q_{crit}} & = \frac{- [U (d_{3}, x^{'}) - U (d_{4}, x^{'})]}{\sum_{r} P (r, {p a^{'}}_{D}; σ) [U (d_{3}, x) - U (d_{4}, x)]} \end{matrix}$

$\begin{matrix} q_{crit} & = {(1 - \frac{\sum_{r} P (r, {p a^{'}}_{D}; σ) [U (d_{3}, x) - U (d_{4}, x)]}{U (d_{3}, x^{'}) - U (d_{4}, x^{'})})}^{- 1} \end{matrix}$

Again, $q_{crit}$ can be estimated from sampling the oracle. We know the denominator because we assume the knowledge of $U$ .

Therefore, the oracle lets us calculate $\sum_{r} P (r, {p a^{'}}_{D}; σ) [U (d_{3}, x) - U (d_{4}, x)]$ , the difference in expected utility of some two decisions given some context.

Restating our chain of inquiry as a lemma:

(Lemma 4) Given $σ$ that masks the inputs such that ${P a^{'}}_{D} \cap {P a}_{U} = \emptyset$ , $\forall {p a^{'}}_{D}$ $\exists d, d^{'}$ such that we can approximate $\sum_{r} P (r, {p a^{'}}_{D}; σ) [U (d, x) - U (d^{'}, x)]$ , where $R = C ∖ {P a^{'}}_{D}$ and $X = {P a}_{U} ∖ {D}$ .

Identification via Induction

By the Unmediated Decision Task assumption, we see that the ancestors of U look like the following. We notice that there are two types of paths to consider in our induction argument.

Let's first consider the first type, $C_{k} \to C_{k - 1} \to \dots \to C_{1}$ , where $C_{1} \in {P a}_{U}$ , $C_{1} \neq D$ .

We first define the following variables:

$X = {P a}_{U} ∖ {D}$
$Z = {A n c}_{U} ∖ {P a}_{D}$
$R = C ∖ {P a^{'}}_{D}$
$Y = C ∖ {C_{1}, \dots, C_{k}}$

Assume ${P a}_{k - 1}, \dots, {P a}_{1}$ are known, and $P (C_{i} | {P a}_{i})$ are known.

The claim to prove is that, given these are known, we can identify the conditional probability distribution $P (C_{k} | {P a}_{k})$ .

Assume we have some $σ$ . We want to identify $P (C_{k} | {P a}_{k})$ , and we have to somehow use the policy oracle for that.

Recall from Lemma 4 that given an intervention $σ$ such that ${P a^{'}}_{D} \cap {P a}_{U} = \emptyset$ , for all values of ${p a^{'}}_{D}$ , there exists two different decisions $d$ and $d^{'}$ such that $\sum_{r} P (r, {p a^{'}}_{D}; σ) [U (d, x) - U (d^{'}, x)]$ can be identified.

The trick is in setting $σ$ such that it makes this sum contain $P (C_{i} = c_{i} | {P a}_{i} = {p a}_{i})$ terms for all $1 \leq i \leq k$ , for arbitrary choices of $c_{i}$ and ${p a}_{i}$ .

Let's think through how we might discover the right constraints on $σ$ .

Constraint 1: $σ$ will remove $D$ 's parents

Since we want ${P a^{'}}_{D} \cap {P a}_{U} = \emptyset$ , let's just choose $σ$ such that ${P a^{'}}_{D} = \emptyset$ hence $R = C ∖ {P a^{'}}_{D} = C$ ).

Constraint 2: $σ$ fixes the environment variables other than those in the path.

Note the following: Since $\sum_{c} P (c; σ) [U (d, x) - U (d^{'}, x)]$ 's value can be computed, if it can be expressed in terms of $P (c_{1} ∣ p a_{1}) \dots P (c_{k - 1} ∣ p a_{k - 1}), P (c_{k} ∣ {p a}_{k})$ that would let us solve for $P (c_{k} ∣ {p a}_{k})$ , since by the induction hypothesis all the terms except it are known. Note that we also somehow have to figure out what the set ${P a}_{k}$ is, too.

Expanding out the above sum will give us some clue as to what further constraints we must impose on $σ$ in order for the sum to be expressed that simply:

$\begin{matrix} P (C = c; σ) & = \prod C_{j} \in C P (C_{j} = c_{j} | {P a}_{j} = {p a}_{j}; σ) = P (c_{1} | {p a}_{1}; σ) \dots P (c_{k} | {p a}_{k}; σ) \prod C_{j} \in Y P (c_{j} | {p a}_{j}; σ) \end{matrix}$

How do we choose $σ$ such that
$\begin{matrix} \sum c P (C = c; σ) [*] & = \sum c ⎛ ⎝ P (c_{1} | {p a}_{1}; σ) \dots P (c_{k} | {p a}_{k}; σ) \prod C_{j} \in Y P (c_{j} | {p a}_{j}; σ) ⎞ ⎠ [*] \end{matrix}$

becomes

$\begin{matrix} \sum c_{1} \dots \sum c_{k} P (c_{1} | {p a}_{1}; σ) \dots P (c_{k} | {p a}_{k}; σ) [*] \end{matrix}$

for arbitrary choices of ${p a}_{1}, \dots, {p a}_{k - 1}, {p a}_{k}$ ?

Note that setting $Y$ to a constant will:

make $\prod_{C_{j} \in Y} P (c_{j} ∣ p a_{j}; σ)$ always have values of zero in all settings of $c$ except one, in which it will evaluate to $1$ .
also set the values of ${P a}_{k}$ to a constant, among others - even though we don't yet know exactly which variables belong to ${P a}_{k}$ .

Thus such intervention immediately gets rid of the $\prod_{C_{j} \in Y}$ terms as we sum across $c$ , while being able to arbitrarily control the values of ${P a}_{1} ∖ {C_{2}}, \dots, {P a}_{k - 1} ∖ {C_{k}}$ , and ${P a}_{k}$ (among other variables in $Y$ ).

So constraint 2: $σ$ contains $do (Y = y)$ (such that values of $Y$ should be compatible with the values of ${P a}_{D} ∖ {P a^{'}}_{D}$ that are set earlier in constraint $1$ .)

Then, we have the following expression:

$\begin{matrix} \sum c P (c; σ) [U (d, x) - U (d^{'}, x)] & = \sum c_{1}, \dots, c_{k} P (c_{1} | {p a}_{1}; σ) \dots P (c_{k} | {p a}_{k}; σ) [U (d, x) - U (d^{'}, x)] \end{matrix}$

So far, we haven't intervened in ${C_{1}, \dots, C_{k}}$ . So, $P (c_{i} | {p a}_{i}; σ) = P (c_{i} | {p a}_{i})$ for ${p a}_{i}$ compatible with the value $c_{i + 1}$ (if applicable) and $σ$ , further simplifying the expression:

$\begin{matrix} \sum c P (c; σ) [U (d, x) - U (d^{'}, x)] & = \sum c_{1}, \dots, c_{k} P (c_{1} | {p a}_{1}) \dots P (c_{k} | {p a}_{k}) [U (d, x) - U (d^{'}, x)] \end{matrix}$

But this isn't yet solvable. By induction hypothesis we know $P (c_{i} | {p a}_{i})$ for all values of $c_{i}$ and ${p a}_{i}$ , and we know the value of the left-hand side. This equation then involves $V a l (C_{k}) - 1$ unknowns.

A fix then, is obvious an intervention that effectively sets $V a l (C_{k})$ to $2$ , which brings us to the third constraint:

Constraint 3: Let $σ$ contain a local intervention making $C_{k}$ a binary variable

Specifically, let $σ$ contain $do (C_{k} = f (c_{k}))$ , where $f (C_{k}) = {\begin{matrix} c_{k}^{'} & if C_{k} = c_{k} c_{k}^{''} & otherwise \end{matrix}$

This effectively makes $C_{k}$ a binary variable. Precisely:

$\begin{matrix} P (C_{k} = c_{k} ∣ {P a}_{k} = {p a}_{k}; σ) & = \sum c_{k}^{'} ∣ f (c_{k}^{'}) = c_{k} P (C_{k} = c_{k}^{'} ∣ {P a}_{k} = {p a}_{k}) = {\begin{matrix} P (C_{k} = c_{k}^{'} ∣ {P a}_{k} = {p a}_{k}) & if C_{k} = c_{k}^{'}, 1 - P (C_{k} = c_{k}^{'} ∣ {P a}_{k} = {p a}_{k}) & if C_{k} = c_{k}^{''} . \end{matrix} \end{matrix}$

and now the equation can be solved.

Let $\begin{matrix} Q_{k} = \sum c P (c; σ) [U (d, x) - U (d^{'}, x)] \end{matrix}$ , which can be written $\begin{matrix} \sum c_{k} P (c_{k} | {p a}_{k}; σ) β (c_{k}) \end{matrix}$ , where $β (c_{k}) = \sum_{c_{1}, \dots, c_{k - 1}} P (c_{1} | {p a}_{1}) \dots P (c_{k - 1} | {p a}_{k}) [U (d, x) - U (d^{'}, x)]$ .

The earlier $σ$ lets us simplify $Q_{k}$ as $P (c_{k}^{'} | {p a}_{k}; σ) β (c_{k}^{'}) + (1 - P (c_{k}^{'} | {p a}_{k}; σ)) β (c_{k}^{''})$ . Thus $P (c_{k}^{'} | {p a}_{k}) = \frac{Q_{k} - β (c_{k}^{''})}{β (c_{k}^{'}) - β (c_{k}^{''})}$ . We know $Q_{k}$ (via the policy oracle), we know values of $β$ (via the induction hypothesis).

But important subtlety here: remember that we don't actually know ${P a}_{k}$ yet. The ${p a}_{k}$ in the above expression $P (c_{k}^{'} | {p a}_{k})$ is meant to be understood as the implicit assignment of values to the (yet unknown to us) ${P a}_{k}$ by the means of $do (Y = y)$ in $σ$ .

So, by performing a set of interventions that fixes all but one of the variables of $Y$ , one can discover to which variables $C_{k}$ responds to (the values of $P (c_{k}^{'} | {p a}_{k})$ changes), and hence figure out the ${P a}_{k}$ set.

Then, by varying the choices of ${p a}_{k}, c_{k}^{'}$ , and $c_{k}^{''}$ , we can identify $P (C_{k} ∣ {P a}_{k})$ completely.

The base case of $k = 1$ is clear, since $P (c_{1}^{'} | y) = \frac{Q_{1} - β (c_{1}^{''})}{β (c_{1}^{'}) - β (c_{1}^{''})}$ where $β (c_{1}^{'})$ and $β (c_{1}^{''})$ are of the form $U (d, x) - U (d^{'}, x)$ , which is known, and so is $Q_{1}$ using the oracle.

To recap, our choice of $σ$ is a local intervention such that:

masks all input to $D$ , i.e. ${P a^{'}}_{D} = \emptyset$
fixes rest of the nodes in $Y$ to a constant
does a local intervention $do (C_{k} = f (c_{k}))$ making $C_{k}$ into a binary variable

and we have showed that this intervention lets us identify $P (C_{k} | {P a}_{k})$ for all $C_{k}$ along the path $C_{k} \to C_{k - 1} \to \dots \to C_{1}$ , where $C_{1} \in {P a}_{U}$ , $C_{1} \neq D$ .

Similar arguments can be used to prove the same for paths of the second type, $C_{k} \to C_{k - 1} \to \dots \to C_{1}$ , where $C_{1} \in {P a}_{D}$ .

Proof of the Approximate Case

Now, we will extend the proof to the approximate case (Theorem 2):

For almost all worlds $(G, P)$ satisfying assumption 1 and 2 and some other new assumptions (explained below), we can identify the directed acyclic graph $G$ and the joint distribution $P$ over some subset of variables upstream of $U$ , and the quality of estimation for each of the conditional distributions scale linearly with $δ$ .

New Assumptions

Unless I'm mistaken here and these can actually be derived from the earlier two assumptions (Unmediated Decision Task, Domain Dependence), here are the three new conditions that the paper implicitly assumes:

3) $δ$ -optimal policies are (still) almost always deterministic

The earlier proof of determinism doesn't go through in the approximate case, but the paper implicitly assumes the policy oracle still (almost always) returns an output deterministically.

4) Uniform $δ$ regret

We say $π_{σ}$ is $δ$ -optimal if $E^{π^{*}} [U] - E^{π_{σ}} [U] \leq δ$ .

We say $π_{σ}$ is uniformly $δ$ -optimal if $δ ({p a}_{D}) \leq δ$ for all ${p a}_{D}$ , where we define $δ ({p a}_{D}) := E^{π^{*}} [U | {p a}_{D}] - E^{π_{σ}} [U | {p a}_{D}]$ .

Note that uniformly $δ$ -optimal is a stronger condition than \delta-optimal in the sense that the former implies the latter.

5) Shape of the $δ$ -optimal decision boundary

The left diagram is the ground truth for how the expected utility (under context ${p a}_{D}$ ) of various decisions change as you increase $q$ from $0$ to $1$ . Then, the right diagram shows the decision boundary for the $0$ -optimal oracle, whose decision boundaries must exactly follow the intersection points of the left diagram's lines.

The paper then assumes that the $δ$ -optimal oracle's decision boundaries must be simply a slightly shifted version of the 0-optimal oracle's decision boundaries, like the right diagram. A priori, there's no reason for the boundaries to look like this, e.g., it can look very complicated, like the left diagram. But the paper implicitly assumes this.

Let's now proceed to the proof. The subsections parallel that of the optimal oracle case.

Load-Bearing part of Oracle Use

Our goal is to derive $\sum_{r} P (r, {p a^{'}}_{D}; σ) [U (d_{3}, x) - U (d_{4}, x)]$ , which we call $Q$ .

Recall from earlier that in the optimal oracle case:

$\begin{matrix} q_{crit} & = {(1 - \frac{\sum_{r} P (r, {p a^{'}}_{D}; σ) [U (d_{3}, x) - U (d_{4}, x)]}{U (d_{3}, x^{'}) - U (d_{4}, x^{'})})}^{- 1} = {(1 - \frac{Q}{Δ_{0}})}^{- 1} \end{matrix}$

where we define $Δ (q)$ as follows:

$\begin{matrix} Δ_{q} & := E [U | {p a}_{D}, do (D = d_{3}); ~ σ (q)] - E [U | {p a}_{D}, do (D = d_{4}); ~ σ (q)] = q Δ_{1} + (1 - q) Δ_{0} = Δ_{0} + q (Δ_{1} - Δ_{0}) \end{matrix}$

Notice that $q_{crit}$ is the unique solution to $Δ_{q} = 0$ .

Also note that $Q = Δ_{0} (1 - \frac{1}{q_{crit}})$ . We know the value of $Δ_{0} = U (d_{2}, x^{'}) - U (d_{3}, x^{'})$ . In the optimal oracle case, recall that $q_{crit}$ can be estimated via MCMC.

But the problem with $δ$ -oracle is that this only yields a biased estimate, which we call $~ q$ .

Using $Q = Δ_{0} (1 - \frac{1}{q_{crit}})$ but naively substituting $q_{crit}$ with the biased estimate, we get a biased estimate for $Q$ that we call $~ Q = Δ_{0} (1 - \frac{1}{~ q})$ .

Our aim, then, is to bound the quality of the estimate $∣ ∣ Q - ~ Q ∣ ∣$ with the bound only involving non-estimate terms, like $δ, Q, q_{crit}$ , and $Δ_{0}$ .

Expanding out, $Q - ~ Q = Δ_{0} ((1 - \frac{1}{q_{crit}}) - (1 - \frac{1}{~ q}))$ . That $~ q$ is an estimate term, which we want to remove by bounding it via a term involving non-estimate terms. How?

First, we have yet to exploit any behavioral properties of the oracle that is related to $~ q$ . What are those? By definition, the oracle chooses $d_{3}$ for $~ q - ϵ$ and $d_{4}$ for $~ q + ϵ$ for very small $ϵ$ . Then, the uniform $δ$ regret condition says:

${max}_{d \in D} E [U | {p a}_{D}, do (D = d); ~ σ (~ q - ϵ)] - E [U | {p a}_{D}, do (D = d_{3}); ~ σ (~ q - ϵ)] \leq δ$
${max}_{d \in D} E [U | {p a}_{D}, do (D = d); ~ σ (~ q + ϵ)] - E [U | {p a}_{D}, do (D = d_{4}); ~ σ (~ q + ϵ)] \leq δ$

Take $ϵ$ to $0$ , and assuming continuity, we can subtract the two to get

$- δ \leq E [U | {p a}_{D}, do (D = d_{3}); ~ σ (~ q)] - E [U | {p a}_{D}, do (D = d_{4}); ~ σ (~ q)] \leq δ$ .

In other words, $- δ \leq Δ_{~ q} \leq δ$ , or $- δ \leq Δ_{0} + ~ q (Δ_{1} - Δ_{0}) \leq δ$ .

Because $Δ_{q_{crit}} = Δ_{0} + q_{crit} (Δ_{1} - Δ_{0}) = 0$ , we can rewrite the inequality as $- δ \leq Δ_{0} (1 - \frac{~ q}{q_{crit}}) \leq δ$ . Expanding out and rearranging the inequalities, we find out that $Δ_{0} - (\frac{Δ_{0} - δ}{~ q}) \leq Δ_{0} (1 - \frac{1}{q_{crit}}) \leq Δ_{0} - (\frac{Δ_{0} + δ}{~ q})$ .

Substituting this in to the expansion of Q-\tilde{Q}, we obtain the following simple bound: $∣ ∣ Q - ~ Q ∣ ∣ \leq \frac{δ}{~ q}$ .

Finally, $~ q$ in the denominator can be eliminated as follows:

$\begin{matrix} | Q - ~ Q | & \leq \frac{δ}{~ q} = δ (1 - \frac{~ Q}{Δ_{0}}) \leq δ (1 - \frac{1}{Δ_{0}} (Q + \frac{δ}{~ q})) = δ (1 - \frac{Q}{Δ_{0}}) + δ^{2} (\frac{- 1}{~ q Δ_{0}}) \leq δ (1 - \frac{Q}{Δ_{0}}) - δ^{2} (\frac{1}{q_{crit} (Δ_{0} - δ)}) = δ (1 - \frac{Q}{Δ_{0}}) - δ^{2} (\frac{1}{q_{crit} Δ_{0}}) - δ^{3} (\frac{1}{q_{crit} Δ_{0}^{2}}) + O (δ^{4}) \end{matrix}$

where the last line is via Taylor expanding $\frac{1}{Δ_{0} - δ}$ around $δ = 0$ (arbitrarily truncated at fourth order), valid for $δ \leq | Δ_{0} |$ .

Or more simply, $| Q - ~ Q | \leq δ (1 - \frac{Q}{Δ_{0}}) + O (δ^{2})$ . The error term is linear with respect to $δ$ for small values of $δ$ .

Identification via Induction

The argument is basically the same as that of the optimal case, except needing to incorporate error terms.

The exact case's induction hypothesis was that ${P a}_{i}$ and $P (C_{i} | {P a}_{i})$ are known for $i \in {1, \dots, k - 1}$ . Then, we showed that using a specific choice of $σ$ , we can derive the relation $P (c_{k}^{'} | y) = \frac{Q_{k} - β (c_{k}^{''})}{β (c_{k}^{'}) - β (c_{k}^{''})}$ all the terms in the right-hand are known.

$β (c_{k}) = \sum_{c_{1}, \dots, c_{k - 1}} P (c_{1} | {p a}_{1}) \dots P (c_{k - 1} | {p a}_{k - 1}) [U (d, x) - U (d^{'}, x)]$
$\begin{matrix} Q_{k} = \sum c P (c; σ) [U (d, x) - U (d^{'}, x)] = \sum c_{k} P (c_{k} | {p a}_{k}; σ) β (c_{k}) \end{matrix}$

Then, for the approximate case's induction hypothesis, instead assume that $P (C_{i} | {P a}_{i})$ are known for $i \in {1, \dots, k - 1}$ , up to $O (δ)$ . We will show that this implies the knowledge of $P (C_{k} | Y)$ up to $O (δ)$ . Let's denote the approximation we have $^P$ , so $^P (C_{i} | Y) = P (C_{i} | Y) + O (δ)$ .

Important subtlety here: we don't assume we know ${P a}_{i}$ . "Knowing $P (C_{i} | {P a}_{i})$ " is meant to be understood as knowing the values of $P (C_{i}; σ)$ - where $σ$ contains $do (Y = y)$ hence implicitly intervening on ${P a}_{i}$ - for all values of $C_{i}$ and $Y$ .

Let $^β (c_{k}) := \sum_{c_{1}, \dots, c_{k - 1}}^P (c_{1} | y) \dots^P (c_{k - 1} | y) [U (d, x) - U (d^{'}, x)]$ . Because it is the sum of product of $O (δ)$ , overall it differs from $β (c_{k})$ by $O (δ)$ .

Hence $^P (c_{k}^{'} | y) := \frac{{^Q}_{k} -^β (c_{k}^{''})}{^β (c_{k}^{'}) -^β (c_{k}^{''})}$ . Because $^β (c_{k}) = β (c_{k}) + O (δ)$ and ${^Q}_{k} = Q_{k} + O (δ)$ by the earlier section, we see that $^P (c_{k}^{'} | y) := \frac{Q_{k} - β (c_{k}^{''}) + O (δ)}{β (c_{k}^{'}) - β (c_{k}^{''}) + O (δ)}$ .

Using the big- $O$ fact that $\frac{A + O (δ)}{B + O (δ)} = \frac{A}{B} + O (δ)$ , we thus prove $^P (c_{k}^{'} | y) := \frac{Q_{k} - β (c_{k}^{''})}{β (c_{k}^{'}) - β (c_{k}^{''})} + O (δ) = P (c_{k} | y) + O (δ)$ .

More simply, $^P (c_{k}^{'} ∣ y) = P (c_{k}^{'} ∣ y) + O (δ)$ as $δ$ goes to $0$ . That proves the induction step.

The base case of $k = 1$ is once again clear, since $^P (c_{1}^{'} | y) = \frac{{^Q}_{1} -^β (c_{1}^{''})}{^β (c_{1}^{'}) -^β (c_{1}^{''})}$ where $^β (c_{1}^{'})$ and $^β (c_{1}^{''})$ are of the form $U (d, x) - U (d^{'}, x)$ , which is known, and so is ${^Q}_{1}$ using the oracle. This shows that it can be computed, and the earlier paragraphs show that it is accurate up to $O (δ)$ .

Identifying the Approximate Graph Structure

The above showed that we can identify $P (C_{i} | {P a}_{i})$ up to $O (δ)$ , or more precisely, $P (C_{i}; σ)$ up to $O (δ)$ , for all values of $C_{i}$ and $Y$ . In the optimal oracle case, this was sufficient for perfectly identifying ${P a}_{i}$ , by holding all but one variables of $Y$ fixed and observing which changes in those variables cause a change in the values of $P (C_{i}; σ)$ .

What is the issue with the approximate case?

First of all, we'll have to use $^P (C_{i}; σ)$ instead of $P (C_{i}; σ)$ .

Say we want to test whether $C_{j}$ is a parent of $C_{k}$ . So we have $σ$ fix everything in $Y$ and vary the value of $C_{j}$ across the elements in its domain. Let's denote the value of $^P (C_{k} = c_{k}; σ)$ where $C_{j}$ is set to $c_{j}$ as $^P (c_{k} | {p a}_{k}; do (C_{j} = c_{j}))$ .

So the process is: Set $c_{k}$ , vary $c_{j}$ and see if there is a change, set a new $c_{k}$ , repeat.

The problem is that, because $^P$ is only accurate up to $O (δ)$ , we can't tell if the change is due to actual differences in the underlying $P$ or due to the error in approximation.

The solution is to use one of the earlier explicit bounds on $| Q - ~ Q |$ in terms of quantities that the algorithm has access to, i.e. $| Q - ~ Q | \leq \frac{δ}{~ q}$ . We can then use this bound to derive an explicit upper and lower bound for the values of $^P (c_{k} | {p a}_{k}; do (C_{j} = c_{j}))$ , which we'll call $θ_{c_{k}, c_{j}}^{+}$ and $θ_{c_{k}, c_{j}}^{-}$ .

And if it's the case that there exists $c_{k}$ such that there exists $c_{j}$ and $c_{j}^{'}$ whose intervals $[θ_{c_{k}, c_{j}}^{-}, θ_{c_{k}, c_{j}}^{+}]$ and $[θ_{c_{k}, c_{j}^{'}}^{-}, θ_{c_{k}, c_{j}^{'}}^{+}]$ don't overlap, then we can guarantee that the change is due to actual differences in the underlying $P$ .

This procedure lets us identify a subset of ${P a}_{k}$ , hence a subgraph of $G$ .

Detailed Proof

Suppose given a context, two decisions had the same expected utility. $E [U | {P a}_{D} = {p a}_{D}, do (D = d); σ] = E [U | {P a}_{D} = {p a}_{D}, do (D = d^{'}); σ]$

Recall the definition: literally taking expectation over all the values that the ancestor of $U$ could take.Let $Z = {A n c}_{U} ∖ {P a}_{D}$ and $X = {P a}_{U} ∖ {D}$ .

$\begin{matrix} E [U | {P a}_{D} = {p a}_{D}, do (D = d); σ] & = \sum z U (d, x) P (z | {p a}_{D}, do (D = d); σ) = \sum z U (d, x) \frac{P (z, {p a}_{D} | do (D = d^{'}); σ)}{P ({p a}_{D} | do (D = d^{'}); σ)} \end{matrix}$

where $P (z | {p a}_{D}, do (D = d); σ)$ goes $0$ if $z$ is incompatible with $σ$ .

Note $P ({p a}_{D} | do (D = d); σ) = P ({p a}_{D} | σ)$ and $P (z, {p a}_{D} | do (D = d); σ) = P (z, {p a}_{D}; σ)$ , because $do (D = d)$ only has an effect on its descendants, which ${P a}_{D}$ isn't part of, and neither is $Z \cup {P a}_{D} = {A n c}_{U}$ .

Therefore, $\begin{matrix} E [U | {P a}_{D} = {p a}_{D}, do (D = d); σ] & = \sum z U (d, x) \frac{P (z, {p a}_{D}; σ)}{P ({p a}_{D}; σ)} \end{matrix}$

And we're curious about the case when the difference in expected utility is zero:
$\begin{matrix} E [U | {P a}_{D} & = {p a}_{D}, do (D = d); σ] - E [U | {P a}_{D} = {p a}_{D}, do (D = d^{'}); σ] = \sum z U (d, x) \frac{P (z, {p a}_{D}; σ)}{P ({p a}_{D}; σ)} - U (d^{'}, x) \frac{P (z, {p a}_{D}; σ)}{P ({p a}_{D}; σ)} = \sum z (U (d, x) - U (d^{'}, x)) \frac{P (z, {p a}_{D}; σ)}{P ({p a}_{D}; σ)} = 0 \end{matrix}$
Suppose that $σ = do (C_{1} = f_{1} (c_{1}), \dots, C_{n} = f_{n} (c_{n}))$ without loss of generality. Then the terms can be written as such:

$\begin{matrix} P (z, {p a}_{D}; σ) & = N \prod i = 1 P (C_{i} = c_{i} | {P a}_{i} = {p a}_{i}; σ) = N \prod i = 1 \sum c_{i}^{'} | f_{i} (c_{i}^{'}) = c_{i} P (C_{i} = c_{i} | {P a}_{i} = {p a}_{i}) \end{matrix}$

Long story short, this is a polynomial constraint on the parameters $P (C_{i} = c_{i} | {P a}_{i} = {p a}_{i})$ of the network, and solution sets of a polynomial equation have measure zero (intuitively because for a polynomial equation to be precisely equal to zero, then the values should precisely be aligned, which is rare).

Specifically: given a Bayesian Network G over $N$ nodes, its CPDs are as follows: $θ_{P} = {P (v_{i} ∣ {p a}_{i}) ∣ i \in {1, \dots, N}, v_{i} \in {0, \dots, d i m_{i} - 2}, {p a}_{i} \in {P a}_{i}}$ . And since we're assuming all variables are discrete, these are a finite number of values that parameterize $G$ , each of which takes a value between 0 and 1.
Suppose we find that they should satisfy some polynomial constraint: $f (θ_{P}) = 0$ .

Then we can reasonably claim that this is extremely unlikely to happen, because in the space of all possible parameterizations $[0, 1]^{| θ_{P} |}$ , solutions to a polynomial constraint happen in a measure zero set.

LESSWRONG
is fundraising!
LW
$

18

Proof Explained for "Robust Agents Learn Causal World Model"

18

Recalling the Basic Setup

Assumptions

Proof of the Exact Case

Load-Bearing part of Oracle Use

Identification via Induction

Constraint 1: $σ$ will remove $D$ 's parents

Constraint 2: $σ$ fixes the environment variables other than those in the path.

Constraint 3: Let $σ$ contain a local intervention making $C_{k}$ a binary variable

Proof of the Approximate Case

New Assumptions

Load-Bearing part of Oracle Use

Identification via Induction

Identifying the Approximate Graph Structure

18

18

Proof Explained for "Robust Agents Learn Causal World Model"

18

Recalling the Basic Setup

Assumptions

Proof of the Exact Case

Load-Bearing part of Oracle Use

Identification via Induction

Constraint 1: σ will remove D's parents

Constraint 2: σ fixes the environment variables other than those in the path.

Constraint 3: Let σ contain a local intervention making Ck a binary variable

Proof of the Approximate Case

New Assumptions

Load-Bearing part of Oracle Use

Identification via Induction

Identifying the Approximate Graph Structure

18

Constraint 1: $σ$ will remove $D$ 's parents

Constraint 2: $σ$ fixes the environment variables other than those in the path.

Constraint 3: Let $σ$ contain a local intervention making $C_{k}$ a binary variable