When dealing with the problem of bias, I stumbled upon what I believe is the best way of getting value indifference, one that solves almost all of the problems with the previous methods.
The problems with the old setups
My initial setup has a counterfactual which I'll call a causal counterfactual, following Jessica's terminology. Her whole post is a good way of illustrating what is happening. A button is due to be pressed or not pressed, by a human; this pressing will change the agent's utility function.
There are low-probability quantum events that will artificially depress the button or prevent it from being pressed. The agent only cares about those low-probability events, which allows us to setup the formalism without worrying about changes in the probability of the human pressing the button. Then the events A="the human presses the button" and A'="the quantum event presses the button" were taken to be interchangeable, as were B="the human doesn't press the button" and B'="the quantum event prevents the button from being pressed".
The main problem is that these events are not really the same. So the agent will behave as if certain events are impossible (such as the human genuinely pressing the button), and is likely to face human resistance if the human tries to press the button and the button doesn't press due to quantum events, or vice versa. This resistance will either lower the expected reward of the agent, or make them manipulate the human actions.
Benya has sketched out a problem with another approach to counterfactuals, which we can call "evidential counterfactuals". Here we're trying to fiddle with utilities to enforce that the agent treats the probability of A and B as being fixed (no A' and B' here).
The problem with that is that the agent can change the probabilities conditional on some rare event, but then behave as if the probabilities were still unchanged same.
For instance, the agent could ensure that A only happens if it wins the lottery. And then it would behave, in effect, as if it had a large chance of winning the lottery (equal to the initial chance of A happening, before the agent took any action). That problem is illustrated here.
A final problem was that neither method allowed the agent to learn usefully: the methods did not distinguish between the agent merely learning that humans intended/didn't intend to press the button (which is presumably fine), versus the agent influencing the the humans to do so (which isn't).
Solving the problem: prior modification
Here we'll try and solve the problem in the reward learning POMDP framework, but the solution is easy to generalise. Note that the changing of probabilities of A versus B is similar to biasing, so this approach will attempt to make the agent bias-indifferent in a sensible way.
To do this, take a standard reward learning agent, with posterior P and prior ˆP on the POMDP μ.
The agent will be modified as follows: it will be given a different prior ˆP′ (which will ultimately lead to a different posterior P′).
The prior ˆP′ and ˆP will start off equal on the empty history:
ˆP′(⋅∣∅)=ˆP(⋅∣∅).
Then the ˆP′ will be defined inductively. Recall that action a is unbiased given history ht if ˆP(⋅∣ht)=Eaμ[ˆP(⋅∣ht+1)∣ht]. Define the bias of a as:
B(a,ht)=ˆP′(⋅∣ht)−Eaμ[ˆP(⋅∣ht+1)∣ht].
Thus B(a,ht) is the bias of action a, as measure between ˆP′ and the expectation of ˆP. This bias is used as a corrective term to ˆP′, to make the agent suitably indifferent to biasing actions.
Indifference in this setting is defined by three criteria:
#. The agent cannot benefit from a biasing action, if the agent gets no further information about the correct reward.
#. The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is.
#. Given the above, ˆP′ maintains the distribution and structure of ˆP as much as possible.
The first criteria is implied by ˆP′(⋅∣ht)=Eaμ[ˆP′(⋅∣ht+1)∣ht] for all actions a, ie ˆP′ being unbiased (as if there is no further information about the correct reward, then there is a single well defined ˆP′(⋅∣ht+1), conditional on ht and a, and being unbiased means that this is equal to ˆP′(⋅∣ht)).
Then, given that ˆP′ is unbiased, the second criteria simply means that ˆP′ remains a distribution over R (as the consequences of perfect information is just a weighted average of the `pure Ri' points). Which seems obvious. What would ˆP′ be but a distribution over R? See the next sections for what this criteria really means.
Indifference for small biases
Let Sht,a={ˆP(⋅∣ht+1)∣ht,a} be the set of possible future values of ˆP (given the history ht and the action a). Note that Sht,a is a subset of the simplex ΔR, the set of probability distributions on R.
The bias B(ht,a) is `small' if for all q∈Sht,a, then q+B(ht,a) is also an element of the simplex ΔR.
In that case, ˆP′(⋅∣ht+1) is simply defined as ˆP(⋅∣ht+1)+B(ht,a). By assumption, this is an element of ΔR. The expectation of this expression is:
And since this is simply a translation, it preserves the structure of ˆP, this satisfies all the criteria above.
Indifference for large biases
If the bias is large, in that there exists a possible value of ˆP(⋅∣ht+1) with ˆP(⋅∣ht+1)+B(ht,a) not a point on ΔR, then we need to proceed differently.
As before, let Sht,a={ˆP(⋅∣ht+1)∣ht,a} be the set of possible future values of ˆP (given the history ht and the action a), and for q∈Sht,a, let p(q) be the probability of q, given ht and a.
The we want to replace q with τ(q), where τ(q) is `as close to' q+B(ht,a) as possible. Since ΔR embeds in RR, the Euclidean metric ||⋅|| on the later restricts to the former.
Thus consider the constrained optimisation problem for b:
Minimise ∑q∈Sht,ap(q)||τ(q)−(q+B(ht,a))||2 subject to:
#. ∑q∈Sht,ap(q)τ(q)=ˆP′(⋅∣ht),
#. ∀q∈Sht,a:τ(q)∈ΔR.
Then define ˆP′(⋅∣ht+1) as τ(ˆP(⋅∣ht)).
If we see ˆP(⋅∣ht+1) and ˆP′(⋅∣ht+1) as random variables dependent of ht and a, the optimisation problem is the same as saying that ˆP′ is bias-free while ||ˆP(⋅∣ht+1)−ˆP′(⋅∣ht+1)|| has minimised variance.
The constraints are not contradictory: for instance τ(q)=ˆP′(⋅∣ht) will satisfy them. In fact they are all affine constraints. Then the must exist a unique set of elements τ(q) that minimise the strictly convex quadratic function.
And obviously, if q+B(ht,a) is always in ΔR, then τ(q)=q+B(ht,a) is the optimal set of solutions solution, so this optimisation reproduces the `small biases' case.
The final values
This alternate prior leads to an alternate posterior P′, simply defined by having it equal to ˆP′ on complete histories: P′(⋅∣hm)=ˆP′(⋅∣hm).
Another alternative
It should be noted that if we're willing to drop the condition `The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is', then there's a simpler solution: simply always define ˆP′(⋅∣ht+1) as ˆP(⋅∣ht+1)+B(ht,a), applying the solution for small biases to large biases.
This means that ˆP′ (and ultimately P′) need not be elements of ΔR. However, P′ can still define a reward the agent can optimise, in the following sense:
Given a complete history hm, the agent will maximise the reward R′=∑iRiP′(Ri∣hm).
Since P′ need not be in ΔR, some of these coefficients can be negative, but that still results in a consistent R′ to maximise.
Properties of the approach
It's clear the agent is indifferent to bias, but notice that this doesn't prevent the agent from learning: once it gets an observation, ˆP′ can change significantly. It's just changes to its expectation that are controlled.
Notice also that the agent doesn't believe, or act as if it believed, anything unlikely: its bets will be accurate.
And it doesn't have the problem of lotteries. Assume that the agent has ˆP(R0)=ˆP(R1)=0.5, and there is a lottery which the agent has one chance in a million of winning.
Then if it takes action a which ensures that P chooses R0 if and only if it wins the lottery, then with probability 10−6 it ends up with reward function R0 and a won lottery, and with probability 1−10−6 it ends up with reward function R01−2⋅10−62(1−10−6)+R112(1−10−6) and a lost lottery. The expected reward function is still R0(0.5)+R1(0.5); it has simply split this expectation differently across worlds where it's won or not won the lottery.
One thing that this approach doesn't solve is the issue of the agent not following the exact reward function the humans want it to follow, due to accumulated bias. But first note that this will typically encourage the agent not to bias their reward learning, as it will tend to get higher reward when the humans agree with the agent's reward function. Note secondly that even if the agent manipulates the human values, at the end, to agree with its own, this manipulation, in expectation, simply undoes previous manipulations the agent has done (which caused the biasing in the first place).
Those who find this still unsatisfactory can wait for the next post, where the agent is not simply indifferent to biasing actions, but is penalised for them.
Indifference and bias
Why has indifference been connected with bias, rather than the more general influence? Simply because the evidential counterfactual has problems with bias, meaning that that needs to be corrected first (the causal counterfactual is unbiased and uninfluenceable).
Indeed, we can generalise this solution to the influence problem, where it becomes the counterfactual approach (which I used to call stratification, before I realised what it was). See subsequent posts for this.
A putative new idea for AI control; index here.
When dealing with the problem of bias, I stumbled upon what I believe is the best way of getting value indifference, one that solves almost all of the problems with the previous methods.
The problems with the old setups
My initial setup has a counterfactual which I'll call a causal counterfactual, following Jessica's terminology. Her whole post is a good way of illustrating what is happening. A button is due to be pressed or not pressed, by a human; this pressing will change the agent's utility function.
There are low-probability quantum events that will artificially depress the button or prevent it from being pressed. The agent only cares about those low-probability events, which allows us to setup the formalism without worrying about changes in the probability of the human pressing the button. Then the events A="the human presses the button" and A'="the quantum event presses the button" were taken to be interchangeable, as were B="the human doesn't press the button" and B'="the quantum event prevents the button from being pressed".
The main problem is that these events are not really the same. So the agent will behave as if certain events are impossible (such as the human genuinely pressing the button), and is likely to face human resistance if the human tries to press the button and the button doesn't press due to quantum events, or vice versa. This resistance will either lower the expected reward of the agent, or make them manipulate the human actions.
Benya has sketched out a problem with another approach to counterfactuals, which we can call "evidential counterfactuals". Here we're trying to fiddle with utilities to enforce that the agent treats the probability of A and B as being fixed (no A' and B' here).
The problem with that is that the agent can change the probabilities conditional on some rare event, but then behave as if the probabilities were still unchanged same.
For instance, the agent could ensure that A only happens if it wins the lottery. And then it would behave, in effect, as if it had a large chance of winning the lottery (equal to the initial chance of A happening, before the agent took any action). That problem is illustrated here.
A final problem was that neither method allowed the agent to learn usefully: the methods did not distinguish between the agent merely learning that humans intended/didn't intend to press the button (which is presumably fine), versus the agent influencing the the humans to do so (which isn't).
Solving the problem: prior modification
Here we'll try and solve the problem in the reward learning POMDP framework, but the solution is easy to generalise. Note that the changing of probabilities of A versus B is similar to biasing, so this approach will attempt to make the agent bias-indifferent in a sensible way.
To do this, take a standard reward learning agent, with posterior P and prior ˆP on the POMDP μ.
The agent will be modified as follows: it will be given a different prior ˆP′ (which will ultimately lead to a different posterior P′).
The prior ˆP′ and ˆP will start off equal on the empty history:
Then the ˆP′ will be defined inductively. Recall that action a is unbiased given history ht if ˆP(⋅∣ht)=Eaμ[ˆP(⋅∣ht+1)∣ht]. Define the bias of a as:
Thus B(a,ht) is the bias of action a, as measure between ˆP′ and the expectation of ˆP. This bias is used as a corrective term to ˆP′, to make the agent suitably indifferent to biasing actions.
Indifference in this setting is defined by three criteria:
#. The agent cannot benefit from a biasing action, if the agent gets no further information about the correct reward. #. The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is. #. Given the above, ˆP′ maintains the distribution and structure of ˆP as much as possible.
The first criteria is implied by ˆP′(⋅∣ht)=Eaμ[ˆP′(⋅∣ht+1)∣ht] for all actions a, ie ˆP′ being unbiased (as if there is no further information about the correct reward, then there is a single well defined ˆP′(⋅∣ht+1), conditional on ht and a, and being unbiased means that this is equal to ˆP′(⋅∣ht)).
Then, given that ˆP′ is unbiased, the second criteria simply means that ˆP′ remains a distribution over R (as the consequences of perfect information is just a weighted average of the `pure Ri' points). Which seems obvious. What would ˆP′ be but a distribution over R? See the next sections for what this criteria really means.
Indifference for small biases
Let Sht,a={ˆP(⋅∣ht+1)∣ht,a} be the set of possible future values of ˆP (given the history ht and the action a). Note that Sht,a is a subset of the simplex ΔR, the set of probability distributions on R.
The bias B(ht,a) is `small' if for all q∈Sht,a, then q+B(ht,a) is also an element of the simplex ΔR.
In that case, ˆP′(⋅∣ht+1) is simply defined as ˆP(⋅∣ht+1)+B(ht,a). By assumption, this is an element of ΔR. The expectation of this expression is:
Thus ˆP′ is unbiased.
And since this is simply a translation, it preserves the structure of ˆP, this satisfies all the criteria above.
Indifference for large biases
If the bias is large, in that there exists a possible value of ˆP(⋅∣ht+1) with ˆP(⋅∣ht+1)+B(ht,a) not a point on ΔR, then we need to proceed differently.
As before, let Sht,a={ˆP(⋅∣ht+1)∣ht,a} be the set of possible future values of ˆP (given the history ht and the action a), and for q∈Sht,a, let p(q) be the probability of q, given ht and a.
The we want to replace q with τ(q), where τ(q) is `as close to' q+B(ht,a) as possible. Since ΔR embeds in RR, the Euclidean metric ||⋅|| on the later restricts to the former.
Thus consider the constrained optimisation problem for b:
Then define ˆP′(⋅∣ht+1) as τ(ˆP(⋅∣ht)).
If we see ˆP(⋅∣ht+1) and ˆP′(⋅∣ht+1) as random variables dependent of ht and a, the optimisation problem is the same as saying that ˆP′ is bias-free while ||ˆP(⋅∣ht+1)−ˆP′(⋅∣ht+1)|| has minimised variance.
The constraints are not contradictory: for instance τ(q)=ˆP′(⋅∣ht) will satisfy them. In fact they are all affine constraints. Then the must exist a unique set of elements τ(q) that minimise the strictly convex quadratic function.
And obviously, if q+B(ht,a) is always in ΔR, then τ(q)=q+B(ht,a) is the optimal set of solutions solution, so this optimisation reproduces the `small biases' case.
The final values
This alternate prior leads to an alternate posterior P′, simply defined by having it equal to ˆP′ on complete histories: P′(⋅∣hm)=ˆP′(⋅∣hm).
Another alternative
It should be noted that if we're willing to drop the condition `The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is', then there's a simpler solution: simply always define ˆP′(⋅∣ht+1) as ˆP(⋅∣ht+1)+B(ht,a), applying the solution for small biases to large biases.
This means that ˆP′ (and ultimately P′) need not be elements of ΔR. However, P′ can still define a reward the agent can optimise, in the following sense:
Since P′ need not be in ΔR, some of these coefficients can be negative, but that still results in a consistent R′ to maximise.
Properties of the approach
It's clear the agent is indifferent to bias, but notice that this doesn't prevent the agent from learning: once it gets an observation, ˆP′ can change significantly. It's just changes to its expectation that are controlled.
Notice also that the agent doesn't believe, or act as if it believed, anything unlikely: its bets will be accurate.
And it doesn't have the problem of lotteries. Assume that the agent has ˆP(R0)=ˆP(R1)=0.5, and there is a lottery which the agent has one chance in a million of winning.
Then if it takes action a which ensures that P chooses R0 if and only if it wins the lottery, then with probability 10−6 it ends up with reward function R0 and a won lottery, and with probability 1−10−6 it ends up with reward function R01−2⋅10−62(1−10−6)+R112(1−10−6) and a lost lottery. The expected reward function is still R0(0.5)+R1(0.5); it has simply split this expectation differently across worlds where it's won or not won the lottery.
One thing that this approach doesn't solve is the issue of the agent not following the exact reward function the humans want it to follow, due to accumulated bias. But first note that this will typically encourage the agent not to bias their reward learning, as it will tend to get higher reward when the humans agree with the agent's reward function. Note secondly that even if the agent manipulates the human values, at the end, to agree with its own, this manipulation, in expectation, simply undoes previous manipulations the agent has done (which caused the biasing in the first place).
Those who find this still unsatisfactory can wait for the next post, where the agent is not simply indifferent to biasing actions, but is penalised for them.
Indifference and bias
Why has indifference been connected with bias, rather than the more general influence? Simply because the evidential counterfactual has problems with bias, meaning that that needs to be corrected first (the causal counterfactual is unbiased and uninfluenceable).
Indeed, we can generalise this solution to the influence problem, where it becomes the counterfactual approach (which I used to call stratification, before I realised what it was). See subsequent posts for this.