This is a post explaining the proof of the paper Robust Agents Learn Causal World Model in detail. Check the previous post in the sequence for a higher-level summary and discussion of the paper, including an explanation of the basic setup (like terminologies and assumptions) which this post will assume from now on.
Recalling the Basic Setup
Let's recall the basic setup (again, check the previous post for more explanation):
[World] The world is a Causal Bayesian Network G over the set of variables corresponding to the environment C, utility node U, and decision node D. The differences from a normal Causal Bayesian Network is that (1) U is a deterministic function of its parents U(PaU), and (2) P(D∣PaD), the conditional probability distribution for D, is undetermined—it's something that our agent will select.
[Agent as a Policy Oracle] An agent is a policy oracle ΠΣ which is a function that takes in an intervention σ∈Σ (where Σ represents the set of all allowed interventions over C) and returns a policy πσ(D∣PaD).
[Robustness as δ-optimality under interventions] We define a "robust" agent as a policy oracle whose regret is bounded by δ under some class of interventions over the environment Σ.
1) Unmediated Decision Task states that DesD∩AncU=∅. This is pretty major.
2) Domain dependence states that there exists distributions over the chance variables P(C) and P′(C) (compatible with M) such that argmaxπEπP[U]≠argmaxπEπP′[U].
This is very reasonable. If domain dependence does not hold, then the optimal policy is just a constant function.
These together imply:
There does not exist a decision d∈dom(D) that is optimal, i.e. argmaxdU(d,x) across all x∈X=PaU∖{D}.
D∈PaU, i.e. there can't be any intermediate nodes between D and U, and all causal effects from D to U must be direct.
Now, with the basic setup of the paper recalled, let's prove the main theorems.
Proof of the Exact Case
We will first prove Theorem 1:
For almost all worlds (G,P) satisfying assumption 1 and 2, we can identify the directed acyclic graph G and the joint distribution P over all variables upstream of U, given that we have access to a 0-optimal policy oracle.
I will attempt to present the proof in a way that focuses on how one could've discovered the paper's results by themselves, emphasizing intuitions and how trying to formalize them naturally constrains the solutions or assumptions we must use.
Load-Bearing part of Oracle Use
Somehow we're going to have to elicit information out of the policy oracle.
Recall that the oracle is a function that maps an intervention to a policy, which is a conditional probability distribution πσ(D∣PaD). It can be shown that if the oracle is optimal, this distribution is (almost always) a deterministic map. The argument goes like:
Suppose that given a context PaD=paD, two decisions d and d′ have the same expected utility.
Then, we can argue that this is extremely unlikely:
Intuitively because exact equality is very unlikely for real number things.
More rigorously because expressing this equality results in a polynomial constraint over the parameters, which has Lebesgue measure 0 over the parameterization of the conditional probability distribution).
Therefore it is extremely likely (probability 1) that there is a strict ordering of decisions (according to their expected utility) given a context, i.e. almost always a unique maximum EU decision given a context. Therefore an optimal policy must only choose that decision in that given context, i.e. deterministic.
Then, somehow our information-eliciting procedure is going to have to exploit the fact that, given a PaD=paD, as we change the intervention σ, the optimal decision prescribed changes from d to d′.
To make this possible, we want to rule out the existence of a decision that is universally optimal across all inputs to the utility node, because then no intervention would yield a change in the output of the oracle.
In math, a d∈dom(D) such that for all possible inputs x to the utility (other than the decision), d is always the argmax of U(d,x), i.e. d∈argmaxdU(d,x)∀x∈dom(X) where X=PaU∖{D}
Recall the first consequence of the two assumptions we had: There does not exist a decision d∈dom(D) that is optimal, i.e. argmaxdU(d,x) across all x∈X=PaU∖{D}.
This implies that there is at least one x′ where the optimal decision d′ differs from d!
But the worry is that such x′ will be incompatible with PaD=paD.
So, we will restricted to only considering σ that masks the inputs of D by only letting D depend on Pa′D⊆PaD such that Pa′D∩PaU=∅.
Then, given σ, let d1 denote the optimal decision associated with some Pa′D=pa′D. Then, by the earlier argument, if we consider any intervention σ′ that does do(X=x′), then it would have a different optimal decision, call it d2.
And to operationalize "as we change the intervention," we define a mixed local intervention ~σ(q)=qσ+(1−q)σ′.
When q=0, the policy oracle (under Pa′D=pa′D) would prescribe d1, and for q=1, it would prescribe d4. There may of course be other intermediate optimal decisions along the way as you slowly increase q from 0 to 1 - say, d2 and d3 for the current example.
Note that once you switch your decision from d to d′ as you increase q, you will never encounter d again because of linearity of E[U∣paD,do(D=d);~σ(q)]. Namely:
The diagram below makes it more clear why linearity implies the same decision never gets chosen twice. The oracle can be thought of as attaining the upper envelope, as denoted in dotted line.
Let qcrit represent the value of q at which the optimal decision switches from some d3 (may or may not be d1) to d4.
Insight: qcrit is interesting. It's a behavioral property of our oracle, meaning we can estimate it by repeatedly sampling it (across random samples of q∈[0,1]). But it can also probably be expressed in closed-form in terms of some parameters of the environment (by just expanding out the definition of expected utility). So qcrit is a bridge that lets us infer properties about the environment from the oracle.
Let's derive a closed-form expression of qcrit. Let R=C∖Pa′D.
Detailed Proof
qcrit is a value that satisfiesE[U|pa′D,do(D=d3);~σ(qcrit)]=E[U|pa′D,do(D=d4);~σ(qcrit)]
Expanding out the left-hand side:E[U|pa′D,do(D=d3);~σ(qcrit)]=1P(pa′D;~σ(qcrit))∑rU(d3,x)P(r,pa′D;~σ(qcrit))=1P(pa′D;~σ(qcrit))∑rU(d3,x)[qcritP(r,pa′D;σ)+(1−qcrit)P(r,pa′D;σ′)]
Expanding out the difference of both sides and setting it to zero:
Again, qcrit can be estimated from sampling the oracle. We know the denominator because we assume the knowledge of U.
Therefore, the oracle lets us calculate ∑rP(r,pa′D;σ)[U(d3,x)−U(d4,x)], the difference in expected utility of some two decisions given some context.
Restating our chain of inquiry as a lemma:
(Lemma 4) Given σ that masks the inputs such that Pa′D∩PaU=∅, ∀pa′D∃d,d′ such that we can approximate ∑rP(r,pa′D;σ)[U(d,x)−U(d′,x)], where R=C∖Pa′D and X=PaU∖{D}.
Identification via Induction
By the Unmediated Decision Task assumption, we see that the ancestors of U look like the following. We notice that there are two types of paths to consider in our induction argument.
Let's first consider the first type, Ck→Ck−1→⋯→C1, where C1∈PaU, C1≠D.
We first define the following variables:
X=PaU∖{D}
Z=AncU∖PaD
R=C∖Pa′D
Y=C∖{C1,…,Ck}
Assume Pak−1,…,Pa1 are known, and P(Ci|Pai) are known.
The claim to prove is that, given these are known, we can identify the conditional probability distribution P(Ck|Pak).
Assume we have some σ. We want to identify P(Ck|Pak), and we have to somehow use the policy oracle for that.
Recall from Lemma 4 that given an intervention σ such that Pa′D∩PaU=∅, for all values of pa′D, there exists two different decisions d and d′ such that ∑rP(r,pa′D;σ)[U(d,x)−U(d′,x)] can be identified.
The trick is in setting σ such that it makes this sum contain P(Ci=ci|Pai=pai) terms for all 1≤i≤k, for arbitrary choices of ci and pai.
Let's think through how we might discover the right constraints on σ.
Constraint 1: σ will remove D's parents
Since we want Pa′D∩PaU=∅, let's just choose σ such that Pa′D=∅ hence R=C∖Pa′D=C).
Constraint 2: σ fixes the environment variables other than those in the path.
Note the following: Since ∑cP(c;σ)[U(d,x)−U(d′,x)]'s value can be computed, if it can be expressed in terms of P(c1∣pa1)…P(ck−1∣pak−1),P(ck∣pak) that would let us solve for P(ck∣pak), since by the induction hypothesis all the terms except it are known. Note that we also somehow have to figure out what the set Pak is, too.
Expanding out the above sum will give us some clue as to what further constraints we must impose on σ in order for the sum to be expressed that simply:
How do we choose σ such that ∑cP(C=c;σ)[∗]=∑c⎛⎝P(c1|pa1;σ)…P(ck|pak;σ)∏Cj∈YP(cj|paj;σ)⎞⎠[∗]
becomes
∑c1…∑ckP(c1|pa1;σ)…P(ck|pak;σ)[∗]
for arbitrary choices of pa1,…,pak−1,pak?
Note that setting Y to a constant will:
make ∏Cj∈YP(cj∣paj;σ) always have values of zero in all settings of c except one, in which it will evaluate to 1.
also set the values of Pak to a constant, among others - even though we don't yet know exactly which variables belong to Pak.
Thus such intervention immediately gets rid of the ∏Cj∈Y terms as we sum across c, while being able to arbitrarily control the values of Pa1∖{C2},…,Pak−1∖{Ck}, and Pak (among other variables in Y).
So constraint 2: σ contains do(Y=y) (such that values of Y should be compatible with the values of PaD∖Pa′D that are set earlier in constraint 1.)
So far, we haven't intervened in {C1,…,Ck}. So, P(ci|pai;σ)=P(ci|pai) for pai compatible with the value ci+1 (if applicable) and σ, further simplifying the expression:
But this isn't yet solvable. By induction hypothesis we know P(ci|pai) for all values of ci and pai, and we know the value of the left-hand side. This equation then involves Val(Ck)−1 unknowns.
A fix then, is obvious an intervention that effectively sets Val(Ck) to 2, which brings us to the third constraint:
Constraint 3: Let σ contain a local intervention making Ck a binary variable
Specifically, let σ contain do(Ck=f(ck)), where f(Ck)={c′kif Ck=ckc′′kotherwise
This effectively makes Ck a binary variable. Precisely:
Let Qk=∑cP(c;σ)[U(d,x)−U(d′,x)], which can be written ∑ckP(ck|pak;σ)β(ck), where β(ck)=∑c1,…,ck−1P(c1|pa1)…P(ck−1|pak)[U(d,x)−U(d′,x)].
The earlier σ lets us simplify Qk as P(c′k|pak;σ)β(c′k)+(1−P(c′k|pak;σ))β(c′′k). Thus P(c′k|pak)=Qk−β(c′′k)β(c′k)−β(c′′k). We know Qk (via the policy oracle), we know values of β (via the induction hypothesis).
But important subtlety here: remember that we don't actually know Pak yet. The pak in the above expression P(c′k|pak) is meant to be understood as the implicit assignment of values to the (yet unknown to us) Pak by the means of do(Y=y) in σ.
So, by performing a set of interventions that fixes all but one of the variables of Y, one can discover to which variables Ck responds to (the values of P(c′k|pak) changes), and hence figure out the Pak set.
Then, by varying the choices of pak,c′k, and c′′k, we can identify P(Ck∣Pak) completely.
The base case of k=1 is clear, since P(c′1|y)=Q1−β(c′′1)β(c′1)−β(c′′1) where β(c′1) and β(c′′1) are of the form U(d,x)−U(d′,x), which is known, and so is Q1 using the oracle.
To recap, our choice of σ is a local intervention such that:
masks all input to D, i.e. Pa′D=∅
fixes rest of the nodes in Y to a constant
does a local intervention do(Ck=f(ck)) making Ck into a binary variable
and we have showed that this intervention lets us identify P(Ck|Pak) for all Ck along the path Ck→Ck−1→⋯→C1, where C1∈PaU, C1≠D.
Similar arguments can be used to prove the same for paths of the second type, Ck→Ck−1→⋯→C1, where C1∈PaD.
Proof of the Approximate Case
Now, we will extend the proof to the approximate case (Theorem 2):
For almost all worlds (G,P) satisfying assumption 1 and 2 and some other new assumptions (explained below), we can identify the directed acyclic graph G and the joint distribution P over some subset of variables upstream of U, and the quality of estimation for each of the conditional distributions scale linearly with δ.
New Assumptions
Unless I'm mistaken here and these can actually be derived from the earlier two assumptions (Unmediated Decision Task, Domain Dependence), here are the three new conditions that the paper implicitly assumes:
3) δ-optimal policies are (still) almost always deterministic
The earlier proof of determinism doesn't go through in the approximate case, but the paper implicitly assumes the policy oracle still (almost always) returns an output deterministically.
4) Uniform δ regret
We say πσ is δ-optimal if Eπ∗[U]−Eπσ[U]≤δ.
We say πσ is uniformly δ-optimal if δ(paD)≤δ for all paD, where we defineδ(paD):=Eπ∗[U|paD]−Eπσ[U|paD].
Note that uniformly δ-optimal is a stronger condition than \delta-optimal in the sense that the former implies the latter.
5) Shape of the δ-optimal decision boundary
The left diagram is the ground truth for how the expected utility (under context paD) of various decisions change as you increase q from 0 to 1. Then, the right diagram shows the decision boundary for the 0-optimal oracle, whose decision boundaries must exactly follow the intersection points of the left diagram's lines.
The paper then assumes that the δ-optimal oracle's decision boundaries must be simply a slightly shifted version of the 0-optimal oracle's decision boundaries, like the right diagram. A priori, there's no reason for the boundaries to look like this, e.g., it can look very complicated, like the left diagram. But the paper implicitly assumes this.
Let's now proceed to the proof. The subsections parallel that of the optimal oracle case.
Load-Bearing part of Oracle Use
Our goal is to derive ∑rP(r,pa′D;σ)[U(d3,x)−U(d4,x)], which we call Q.
Recall from earlier that in the optimal oracle case:
Also note that Q=Δ0(1−1qcrit). We know the value of Δ0=U(d2,x′)−U(d3,x′). In the optimal oracle case, recall that qcrit can be estimated via MCMC.
But the problem with δ-oracle is that this only yields a biased estimate, which we call ~q.
Using Q=Δ0(1−1qcrit) but naively substituting qcrit with the biased estimate, we get a biased estimate for Q that we call ~Q=Δ0(1−1~q).
Our aim, then, is to bound the quality of the estimate ∣∣Q−~Q∣∣ with the bound only involving non-estimate terms, like δ,Q,qcrit, and Δ0.
Expanding out, Q−~Q=Δ0((1−1qcrit)−(1−1~q)). That ~q is an estimate term, which we want to remove by bounding it via a term involving non-estimate terms. How?
First, we have yet to exploit any behavioral properties of the oracle that is related to ~q. What are those? By definition, the oracle chooses d3 for ~q−ϵ and d4 for ~q+ϵ for very small ϵ. Then, the uniform δ regret condition says:
Because Δqcrit=Δ0+qcrit(Δ1−Δ0)=0, we can rewrite the inequality as −δ≤Δ0(1−~qqcrit)≤δ. Expanding out and rearranging the inequalities, we find out that Δ0−(Δ0−δ~q)≤Δ0(1−1qcrit)≤Δ0−(Δ0+δ~q).
Substituting this in to the expansion of Q-\tilde{Q}, we obtain the following simple bound: ∣∣Q−~Q∣∣≤δ~q.
Finally, ~q in the denominator can be eliminated as follows:
where the last line is via Taylor expanding 1Δ0−δ around δ=0 (arbitrarily truncated at fourth order), valid for δ≤|Δ0|.
Or more simply, |Q−~Q|≤δ(1−QΔ0)+O(δ2). The error term is linear with respect to δ for small values of δ.
Identification via Induction
The argument is basically the same as that of the optimal case, except needing to incorporate error terms.
The exact case's induction hypothesis was that Pai and P(Ci|Pai) are known for i∈{1,…,k−1}. Then, we showed that using a specific choice of σ, we can derive the relation P(c′k|y)=Qk−β(c′′k)β(c′k)−β(c′′k) all the terms in the right-hand are known.
Then, for the approximate case's induction hypothesis, instead assume that P(Ci|Pai) are known for i∈{1,…,k−1}, up to O(δ). We will show that this implies the knowledge of P(Ck|Y) up to O(δ). Let's denote the approximation we have ^P, so ^P(Ci|Y)=P(Ci|Y)+O(δ).
Important subtlety here: we don't assume we know Pai. "Knowing P(Ci|Pai)" is meant to be understood as knowing the values of P(Ci;σ) - where σ contains do(Y=y) hence implicitly intervening on Pai - for all values of Ci and Y.
Let ^β(ck):=∑c1,…,ck−1^P(c1|y)…^P(ck−1|y)[U(d,x)−U(d′,x)]. Because it is the sum of product of O(δ), overall it differs from β(ck) by O(δ).
Hence ^P(c′k|y):=^Qk−^β(c′′k)^β(c′k)−^β(c′′k). Because ^β(ck)=β(ck)+O(δ) and ^Qk=Qk+O(δ) by the earlier section, we see that ^P(c′k|y):=Qk−β(c′′k)+O(δ)β(c′k)−β(c′′k)+O(δ).
Using the big-O fact that A+O(δ)B+O(δ)=AB+O(δ), we thus prove ^P(c′k|y):=Qk−β(c′′k)β(c′k)−β(c′′k)+O(δ)=P(ck|y)+O(δ).
More simply, ^P(c′k∣y)=P(c′k∣y)+O(δ) as δ goes to 0. That proves the induction step.
The base case of k=1 is once again clear, since ^P(c′1|y)=^Q1−^β(c′′1)^β(c′1)−^β(c′′1) where ^β(c′1) and ^β(c′′1) are of the form U(d,x)−U(d′,x), which is known, and so is ^Q1 using the oracle. This shows that it can be computed, and the earlier paragraphs show that it is accurate up to O(δ).
Identifying the Approximate Graph Structure
The above showed that we can identify P(Ci|Pai) up to O(δ), or more precisely, P(Ci;σ) up to O(δ), for all values of Ci and Y. In the optimal oracle case, this was sufficient for perfectly identifying Pai, by holding all but one variables of Y fixed and observing which changes in those variables cause a change in the values of P(Ci;σ).
What is the issue with the approximate case?
First of all, we'll have to use ^P(Ci;σ) instead of P(Ci;σ).
Say we want to test whether Cj is a parent of Ck. So we have σ fix everything in Y and vary the value of Cj across the elements in its domain. Let's denote the value of ^P(Ck=ck;σ) where Cj is set to cj as ^P(ck|pak;do(Cj=cj)).
So the process is: Set ck, vary cj and see if there is a change, set a new ck, repeat.
The problem is that, because ^P is only accurate up to O(δ), we can't tell if the change is due to actual differences in the underlying P or due to the error in approximation.
The solution is to use one of the earlier explicit bounds on |Q−~Q| in terms of quantities that the algorithm has access to, i.e. |Q−~Q|≤δ~q. We can then use this bound to derive an explicit upper and lower bound for the values of ^P(ck|pak;do(Cj=cj)), which we'll call θ+ck,cj and θ−ck,cj.
And if it's the case that there exists ck such that there exists cj and c′j whose intervals [θ−ck,cj,θ+ck,cj] and [θ−ck,c′j,θ+ck,c′j] don't overlap, then we can guarantee that the change is due to actual differences in the underlying P.
This procedure lets us identify a subset of Pak, hence a subgraph of G.
Detailed Proof
Suppose given a context, two decisions had the same expected utility.E[U|PaD=paD,do(D=d);σ]=E[U|PaD=paD,do(D=d′);σ]
Recall the definition: literally taking expectation over all the values that the ancestor of U could take.Let Z=AncU∖PaD and X=PaU∖{D}.
where P(z|paD,do(D=d);σ) goes 0 if z is incompatible with σ.
Note P(paD|do(D=d);σ)=P(paD|σ) and P(z,paD|do(D=d);σ)=P(z,paD;σ), because do(D=d) only has an effect on its descendants, which PaD isn't part of, and neither is Z∪PaD=AncU.
And we're curious about the case when the difference in expected utility is zero: E[U|PaD=paD,do(D=d);σ]−E[U|PaD=paD,do(D=d′);σ]=∑zU(d,x)P(z,paD;σ)P(paD;σ)−U(d′,x)P(z,paD;σ)P(paD;σ)=∑z(U(d,x)−U(d′,x))P(z,paD;σ)P(paD;σ)=0 Suppose that σ=do(C1=f1(c1),…,Cn=fn(cn)) without loss of generality. Then the terms can be written as such:
Long story short, this is a polynomial constraint on the parameters P(Ci=ci|Pai=pai) of the network, and solution sets of a polynomial equation have measure zero (intuitively because for a polynomial equation to be precisely equal to zero, then the values should precisely be aligned, which is rare).
Specifically: given a Bayesian Network G over N nodes, its CPDs are as follows: θP={P(vi∣pai)∣i∈{1,…,N},vi∈{0,…,dimi−2},pai∈Pai}. And since we're assuming all variables are discrete, these are a finite number of values that parameterize G, each of which takes a value between 0 and 1.
Suppose we find that they should satisfy some polynomial constraint: f(θP)=0.
Then we can reasonably claim that this is extremely unlikely to happen, because in the space of all possible parameterizations [0,1]|θP|, solutions to a polynomial constraint happen in a measure zero set.
This post was written during Alex Altair's agent foundations fellowship program, funded by LTFF. Thank you Alex Altair, Alfred Harwood, Daniel C for feedback and comments.
This is a post explaining the proof of the paper Robust Agents Learn Causal World Model in detail. Check the previous post in the sequence for a higher-level summary and discussion of the paper, including an explanation of the basic setup (like terminologies and assumptions) which this post will assume from now on.
Recalling the Basic Setup
Let's recall the basic setup (again, check the previous post for more explanation):
Assumptions
Also recall the following assumptions:
1) Unmediated Decision Task states that DesD∩AncU=∅. This is pretty major.
2) Domain dependence states that there exists distributions over the chance variables P(C) and P′(C) (compatible with M) such that argmaxπEπP[U]≠argmaxπEπP′[U].
These together imply:
Now, with the basic setup of the paper recalled, let's prove the main theorems.
Proof of the Exact Case
We will first prove Theorem 1:
I will attempt to present the proof in a way that focuses on how one could've discovered the paper's results by themselves, emphasizing intuitions and how trying to formalize them naturally constrains the solutions or assumptions we must use.
Load-Bearing part of Oracle Use
Somehow we're going to have to elicit information out of the policy oracle.
Recall that the oracle is a function that maps an intervention to a policy, which is a conditional probability distribution πσ(D∣PaD). It can be shown that if the oracle is optimal, this distribution is (almost always) a deterministic map. The argument goes like:
Then, somehow our information-eliciting procedure is going to have to exploit the fact that, given a PaD=paD, as we change the intervention σ, the optimal decision prescribed changes from d to d′.
To make this possible, we want to rule out the existence of a decision that is universally optimal across all inputs to the utility node, because then no intervention would yield a change in the output of the oracle.
Recall the first consequence of the two assumptions we had: There does not exist a decision d∈dom(D) that is optimal, i.e. argmaxdU(d,x) across all x∈X=PaU∖{D}.
This implies that there is at least one x′ where the optimal decision d′ differs from d!
But the worry is that such x′ will be incompatible with PaD=paD.
So, we will restricted to only considering σ that masks the inputs of D by only letting D depend on Pa′D⊆PaD such that Pa′D∩PaU=∅.
Then, given σ, let d1 denote the optimal decision associated with some Pa′D=pa′D. Then, by the earlier argument, if we consider any intervention σ′ that does do(X=x′), then it would have a different optimal decision, call it d2.
And to operationalize "as we change the intervention," we define a mixed local intervention ~σ(q)=qσ+(1−q)σ′.
When q=0, the policy oracle (under Pa′D=pa′D) would prescribe d1, and for q=1, it would prescribe d4. There may of course be other intermediate optimal decisions along the way as you slowly increase q from 0 to 1 - say, d2 and d3 for the current example.
Note that once you switch your decision from d to d′ as you increase q, you will never encounter d again because of linearity of E[U∣paD,do(D=d);~σ(q)]. Namely:
E[U∣paD,do(D=d),~σ(q)]=qE[U∣paD,do(D=d),σ]+(1−q)E[U∣paD,do(D=d),~σ]=qU1+(1−q)U2=U2+q(U1−U2).
The diagram below makes it more clear why linearity implies the same decision never gets chosen twice. The oracle can be thought of as attaining the upper envelope, as denoted in dotted line.
Let qcrit represent the value of q at which the optimal decision switches from some d3 (may or may not be d1) to d4.
Insight: qcrit is interesting. It's a behavioral property of our oracle, meaning we can estimate it by repeatedly sampling it (across random samples of q∈[0,1]). But it can also probably be expressed in closed-form in terms of some parameters of the environment (by just expanding out the definition of expected utility). So qcrit is a bridge that lets us infer properties about the environment from the oracle.
Let's derive a closed-form expression of qcrit. Let R=C∖Pa′D.
Detailed Proof
qcrit is a value that satisfiesE[U|pa′D,do(D=d3);~σ(qcrit)]=E[U|pa′D,do(D=d4);~σ(qcrit)]
Expanding out the left-hand side:E[U|pa′D,do(D=d3);~σ(qcrit)]=1P(pa′D;~σ(qcrit))∑rU(d3,x)P(r,pa′D;~σ(qcrit))=1P(pa′D;~σ(qcrit))∑rU(d3,x)[qcritP(r,pa′D;σ)+(1−qcrit)P(r,pa′D;σ′)]
Expanding out the difference of both sides and setting it to zero:
E[U|pa′D,do(D=d3);~σ(qcrit)]−E[U|pa′D,do(D=d4);~σ(qcrit)]
=1P(pa′D;~σ(qcrit))∑rqcritP(r,pa′D;σ)[U(d3,x)−U(d4,x)]+(1−qcrit)P(r,pa′D;σ′)[U(d3,x)−U(d4,x)]=0
Now we solve for qcrit:∑rqcritP(r,pa′D;σ)[U(d3,x)−U(d4,x)]=−(1−qcrit)∑rP(r,pa′D;σ′)[U(d3,x)−U(d4,x)]=−(1−qcrit)[U(d3,x′)−U(d4,x′)]
qcrit1−qcrit=−[U(d3,x′)−U(d4,x′)]∑rP(r,pa′D;σ)[U(d3,x)−U(d4,x)]
qcrit=(1−∑rP(r,pa′D;σ)[U(d3,x)−U(d4,x)]U(d3,x′)−U(d4,x′))−1
qcrit=(1−∑rP(r,pa′D;σ)[U(d3,x)−U(d4,x)]U(d3,x′)−U(d4,x′))−1
Again, qcrit can be estimated from sampling the oracle. We know the denominator because we assume the knowledge of U.
Therefore, the oracle lets us calculate ∑rP(r,pa′D;σ)[U(d3,x)−U(d4,x)], the difference in expected utility of some two decisions given some context.
Restating our chain of inquiry as a lemma:
Identification via Induction
By the Unmediated Decision Task assumption, we see that the ancestors of U look like the following. We notice that there are two types of paths to consider in our induction argument.
Let's first consider the first type, Ck→Ck−1→⋯→C1, where C1∈PaU, C1≠D.
We first define the following variables:
Assume Pak−1,…,Pa1 are known, and P(Ci|Pai) are known.
The claim to prove is that, given these are known, we can identify the conditional probability distribution P(Ck|Pak).
Assume we have some σ. We want to identify P(Ck|Pak), and we have to somehow use the policy oracle for that.
Recall from Lemma 4 that given an intervention σ such that Pa′D∩PaU=∅, for all values of pa′D, there exists two different decisions d and d′ such that ∑rP(r,pa′D;σ)[U(d,x)−U(d′,x)] can be identified.
The trick is in setting σ such that it makes this sum contain P(Ci=ci|Pai=pai) terms for all 1≤i≤k, for arbitrary choices of ci and pai.
Let's think through how we might discover the right constraints on σ.
Constraint 1: σ will remove D's parents
Since we want Pa′D∩PaU=∅, let's just choose σ such that Pa′D=∅ hence R=C∖Pa′D=C).
Constraint 2: σ fixes the environment variables other than those in the path.
Note the following: Since ∑cP(c;σ)[U(d,x)−U(d′,x)]'s value can be computed, if it can be expressed in terms of P(c1∣pa1)…P(ck−1∣pak−1),P(ck∣pak) that would let us solve for P(ck∣pak), since by the induction hypothesis all the terms except it are known. Note that we also somehow have to figure out what the set Pak is, too.
Expanding out the above sum will give us some clue as to what further constraints we must impose on σ in order for the sum to be expressed that simply:
P(C=c;σ)=∏Cj∈CP(Cj=cj|Paj=paj;σ)=P(c1|pa1;σ)…P(ck|pak;σ)∏Cj∈YP(cj|paj;σ)
How do we choose σ such that
∑cP(C=c;σ)[∗]=∑c⎛⎝P(c1|pa1;σ)…P(ck|pak;σ)∏Cj∈YP(cj|paj;σ)⎞⎠[∗]
becomes
∑c1…∑ckP(c1|pa1;σ)…P(ck|pak;σ)[∗]
for arbitrary choices of pa1,…,pak−1,pak?
Note that setting Y to a constant will:
Thus such intervention immediately gets rid of the ∏Cj∈Y terms as we sum across c, while being able to arbitrarily control the values of Pa1∖{C2},…,Pak−1∖{Ck}, and Pak (among other variables in Y).
So constraint 2: σ contains do(Y=y) (such that values of Y should be compatible with the values of PaD∖Pa′D that are set earlier in constraint 1.)
Then, we have the following expression:
∑cP(c;σ)[U(d,x)−U(d′,x)]=∑c1,…,ckP(c1|pa1;σ)…P(ck|pak;σ)[U(d,x)−U(d′,x)]
So far, we haven't intervened in {C1,…,Ck}. So, P(ci|pai;σ)=P(ci|pai) for pai compatible with the value ci+1 (if applicable) and σ, further simplifying the expression:
∑cP(c;σ)[U(d,x)−U(d′,x)]=∑c1,…,ckP(c1|pa1)…P(ck|pak)[U(d,x)−U(d′,x)]
But this isn't yet solvable. By induction hypothesis we know P(ci|pai) for all values of ci and pai, and we know the value of the left-hand side. This equation then involves Val(Ck)−1 unknowns.
A fix then, is obvious an intervention that effectively sets Val(Ck) to 2, which brings us to the third constraint:
Constraint 3: Let σ contain a local intervention making Ck a binary variable
Specifically, let σ contain do(Ck=f(ck)), where f(Ck)={c′kif Ck=ckc′′kotherwise
This effectively makes Ck a binary variable. Precisely:
P(Ck=ck∣Pak=pak;σ)=∑c′k∣f(c′k)=ckP(Ck=c′k∣Pak=pak)={P(Ck=c′k∣Pak=pak)if Ck=c′k,1−P(Ck=c′k∣Pak=pak)if Ck=c′′k.
and now the equation can be solved.
Let Qk=∑cP(c;σ)[U(d,x)−U(d′,x)], which can be written ∑ckP(ck|pak;σ)β(ck), where β(ck)=∑c1,…,ck−1P(c1|pa1)…P(ck−1|pak)[U(d,x)−U(d′,x)].
The earlier σ lets us simplify Qk as P(c′k|pak;σ)β(c′k)+(1−P(c′k|pak;σ))β(c′′k). Thus P(c′k|pak)=Qk−β(c′′k)β(c′k)−β(c′′k). We know Qk (via the policy oracle), we know values of β (via the induction hypothesis).
But important subtlety here: remember that we don't actually know Pak yet. The pak in the above expression P(c′k|pak) is meant to be understood as the implicit assignment of values to the (yet unknown to us) Pak by the means of do(Y=y) in σ.
So, by performing a set of interventions that fixes all but one of the variables of Y, one can discover to which variables Ck responds to (the values of P(c′k|pak) changes), and hence figure out the Pak set.
Then, by varying the choices of pak,c′k, and c′′k, we can identify P(Ck∣Pak) completely.
The base case of k=1 is clear, since P(c′1|y)=Q1−β(c′′1)β(c′1)−β(c′′1) where β(c′1) and β(c′′1) are of the form U(d,x)−U(d′,x), which is known, and so is Q1 using the oracle.
To recap, our choice of σ is a local intervention such that:
and we have showed that this intervention lets us identify P(Ck|Pak) for all Ck along the path Ck→Ck−1→⋯→C1, where C1∈PaU, C1≠D.
Similar arguments can be used to prove the same for paths of the second type, Ck→Ck−1→⋯→C1, where C1∈PaD.
Proof of the Approximate Case
Now, we will extend the proof to the approximate case (Theorem 2):
New Assumptions
Unless I'm mistaken here and these can actually be derived from the earlier two assumptions (Unmediated Decision Task, Domain Dependence), here are the three new conditions that the paper implicitly assumes:
3) δ-optimal policies are (still) almost always deterministic
The earlier proof of determinism doesn't go through in the approximate case, but the paper implicitly assumes the policy oracle still (almost always) returns an output deterministically.
4) Uniform δ regret
We say πσ is δ-optimal if Eπ∗[U]−Eπσ[U]≤δ.
We say πσ is uniformly δ-optimal if δ(paD)≤δ for all paD, where we defineδ(paD):=Eπ∗[U|paD]−Eπσ[U|paD].
Note that uniformly δ-optimal is a stronger condition than \delta-optimal in the sense that the former implies the latter.
5) Shape of the δ-optimal decision boundary
The left diagram is the ground truth for how the expected utility (under context paD) of various decisions change as you increase q from 0 to 1. Then, the right diagram shows the decision boundary for the 0-optimal oracle, whose decision boundaries must exactly follow the intersection points of the left diagram's lines.
The paper then assumes that the δ-optimal oracle's decision boundaries must be simply a slightly shifted version of the 0-optimal oracle's decision boundaries, like the right diagram. A priori, there's no reason for the boundaries to look like this, e.g., it can look very complicated, like the left diagram. But the paper implicitly assumes this.
Let's now proceed to the proof. The subsections parallel that of the optimal oracle case.
Load-Bearing part of Oracle Use
Our goal is to derive ∑rP(r,pa′D;σ)[U(d3,x)−U(d4,x)], which we call Q.
Recall from earlier that in the optimal oracle case:
qcrit=(1−∑rP(r,pa′D;σ)[U(d3,x)−U(d4,x)]U(d3,x′)−U(d4,x′))−1=(1−QΔ0)−1
where we define Δ(q) as follows:
Δq:=E[U|paD,do(D=d3);~σ(q)]−E[U|paD,do(D=d4);~σ(q)]=qΔ1+(1−q)Δ0=Δ0+q(Δ1−Δ0)
Notice that qcrit is the unique solution to Δq=0.
Also note that Q=Δ0(1−1qcrit). We know the value of Δ0=U(d2,x′)−U(d3,x′). In the optimal oracle case, recall that qcrit can be estimated via MCMC.
But the problem with δ-oracle is that this only yields a biased estimate, which we call ~q.
Using Q=Δ0(1−1qcrit) but naively substituting qcrit with the biased estimate, we get a biased estimate for Q that we call ~Q=Δ0(1−1~q).
Our aim, then, is to bound the quality of the estimate ∣∣Q−~Q∣∣ with the bound only involving non-estimate terms, like δ,Q,qcrit, and Δ0.
Expanding out, Q−~Q=Δ0((1−1qcrit)−(1−1~q)). That ~q is an estimate term, which we want to remove by bounding it via a term involving non-estimate terms. How?
First, we have yet to exploit any behavioral properties of the oracle that is related to ~q. What are those? By definition, the oracle chooses d3 for ~q−ϵ and d4 for ~q+ϵ for very small ϵ. Then, the uniform δ regret condition says:
Take ϵ to 0, and assuming continuity, we can subtract the two to get
−δ≤E[U|paD,do(D=d3);~σ(~q)]−E[U|paD,do(D=d4);~σ(~q)]≤δ.
In other words, −δ≤Δ~q≤δ, or −δ≤Δ0+~q(Δ1−Δ0)≤δ.
Because Δqcrit=Δ0+qcrit(Δ1−Δ0)=0, we can rewrite the inequality as −δ≤Δ0(1−~qqcrit)≤δ. Expanding out and rearranging the inequalities, we find out that Δ0−(Δ0−δ~q)≤Δ0(1−1qcrit)≤Δ0−(Δ0+δ~q).
Substituting this in to the expansion of Q-\tilde{Q}, we obtain the following simple bound: ∣∣Q−~Q∣∣≤δ~q.
Finally, ~q in the denominator can be eliminated as follows:
|Q−~Q|≤δ~q=δ(1−~QΔ0)≤δ(1−1Δ0(Q+δ~q))=δ(1−QΔ0)+δ2(−1~qΔ0)≤δ(1−QΔ0)−δ2(1qcrit(Δ0−δ))=δ(1−QΔ0)−δ2(1qcritΔ0)−δ3(1qcritΔ20)+O(δ4)
where the last line is via Taylor expanding 1Δ0−δ around δ=0 (arbitrarily truncated at fourth order), valid for δ≤|Δ0|.
Or more simply, |Q−~Q|≤δ(1−QΔ0)+O(δ2). The error term is linear with respect to δ for small values of δ.
Identification via Induction
The argument is basically the same as that of the optimal case, except needing to incorporate error terms.
The exact case's induction hypothesis was that Pai and P(Ci|Pai) are known for i∈{1,…,k−1}. Then, we showed that using a specific choice of σ, we can derive the relation P(c′k|y)=Qk−β(c′′k)β(c′k)−β(c′′k) all the terms in the right-hand are known.
Then, for the approximate case's induction hypothesis, instead assume that P(Ci|Pai) are known for i∈{1,…,k−1}, up to O(δ). We will show that this implies the knowledge of P(Ck|Y) up to O(δ). Let's denote the approximation we have ^P, so ^P(Ci|Y)=P(Ci|Y)+O(δ).
Let ^β(ck):=∑c1,…,ck−1^P(c1|y)…^P(ck−1|y)[U(d,x)−U(d′,x)]. Because it is the sum of product of O(δ), overall it differs from β(ck) by O(δ).
Hence ^P(c′k|y):=^Qk−^β(c′′k)^β(c′k)−^β(c′′k). Because ^β(ck)=β(ck)+O(δ) and ^Qk=Qk+O(δ) by the earlier section, we see that ^P(c′k|y):=Qk−β(c′′k)+O(δ)β(c′k)−β(c′′k)+O(δ).
Using the big-O fact that A+O(δ)B+O(δ)=AB+O(δ), we thus prove ^P(c′k|y):=Qk−β(c′′k)β(c′k)−β(c′′k)+O(δ)=P(ck|y)+O(δ).
More simply, ^P(c′k∣y)=P(c′k∣y)+O(δ) as δ goes to 0. That proves the induction step.
The base case of k=1 is once again clear, since ^P(c′1|y)=^Q1−^β(c′′1)^β(c′1)−^β(c′′1) where ^β(c′1) and ^β(c′′1) are of the form U(d,x)−U(d′,x), which is known, and so is ^Q1 using the oracle. This shows that it can be computed, and the earlier paragraphs show that it is accurate up to O(δ).
Identifying the Approximate Graph Structure
The above showed that we can identify P(Ci|Pai) up to O(δ), or more precisely, P(Ci;σ) up to O(δ), for all values of Ci and Y. In the optimal oracle case, this was sufficient for perfectly identifying Pai, by holding all but one variables of Y fixed and observing which changes in those variables cause a change in the values of P(Ci;σ).
What is the issue with the approximate case?
First of all, we'll have to use ^P(Ci;σ) instead of P(Ci;σ).
Say we want to test whether Cj is a parent of Ck. So we have σ fix everything in Y and vary the value of Cj across the elements in its domain. Let's denote the value of ^P(Ck=ck;σ) where Cj is set to cj as ^P(ck|pak;do(Cj=cj)).
So the process is: Set ck, vary cj and see if there is a change, set a new ck, repeat.
The problem is that, because ^P is only accurate up to O(δ), we can't tell if the change is due to actual differences in the underlying P or due to the error in approximation.
The solution is to use one of the earlier explicit bounds on |Q−~Q| in terms of quantities that the algorithm has access to, i.e. |Q−~Q|≤δ~q. We can then use this bound to derive an explicit upper and lower bound for the values of ^P(ck|pak;do(Cj=cj)), which we'll call θ+ck,cj and θ−ck,cj.
And if it's the case that there exists ck such that there exists cj and c′j whose intervals [θ−ck,cj,θ+ck,cj] and [θ−ck,c′j,θ+ck,c′j] don't overlap, then we can guarantee that the change is due to actual differences in the underlying P.
This procedure lets us identify a subset of Pak, hence a subgraph of G.
Detailed Proof
Suppose given a context, two decisions had the same expected utility.E[U|PaD=paD,do(D=d);σ]=E[U|PaD=paD,do(D=d′);σ]
Recall the definition: literally taking expectation over all the values that the ancestor of U could take.Let Z=AncU∖PaD and X=PaU∖{D}.
E[U|PaD=paD,do(D=d);σ]=∑zU(d,x)P(z|paD,do(D=d);σ)=∑zU(d,x)P(z,paD|do(D=d′);σ)P(paD|do(D=d′);σ)
where P(z|paD,do(D=d);σ) goes 0 if z is incompatible with σ.
Note P(paD|do(D=d);σ)=P(paD|σ) and P(z,paD|do(D=d);σ)=P(z,paD;σ), because do(D=d) only has an effect on its descendants, which PaD isn't part of, and neither is Z∪PaD=AncU.
Therefore, E[U|PaD=paD,do(D=d);σ]=∑zU(d,x)P(z,paD;σ)P(paD;σ)
And we're curious about the case when the difference in expected utility is zero:
E[U|PaD=paD,do(D=d);σ]−E[U|PaD=paD,do(D=d′);σ]=∑zU(d,x)P(z,paD;σ)P(paD;σ)−U(d′,x)P(z,paD;σ)P(paD;σ)=∑z(U(d,x)−U(d′,x))P(z,paD;σ)P(paD;σ)=0
Suppose that σ=do(C1=f1(c1),…,Cn=fn(cn)) without loss of generality. Then the terms can be written as such:
P(z,paD;σ)=N∏i=1P(Ci=ci|Pai=pai;σ)=N∏i=1∑c′i|fi(c′i)=ciP(Ci=ci|Pai=pai)
Long story short, this is a polynomial constraint on the parameters P(Ci=ci|Pai=pai) of the network, and solution sets of a polynomial equation have measure zero (intuitively because for a polynomial equation to be precisely equal to zero, then the values should precisely be aligned, which is rare).
Then we can reasonably claim that this is extremely unlikely to happen, because in the space of all possible parameterizations [0,1]|θP|, solutions to a polynomial constraint happen in a measure zero set.