If we apply this to the shutdown problem, is it acceptable to say:
If not, what would you set to? (I'm treating and as reward functions here which seems fine)
For policies/actions that don't affect the probability of humans pressing the button, .
For actions that do affect the probability a little bit, the effect of will be to undo this, by, for instance, slightly increasing the probability of given the button was pressed.
I'm not completely sure what multiple actions with large changes of probability would lead to (in expectation, nothing, but in actual fact...)
Hmm... I'm finding that I'm unable to write down a simple shutdown problem in this framework (e.g. an environment where it should switch between maximizing paperclips and shutting down) to analyze what this algorithm does. To know what the algorithm does, I need to know what and are (since these are parameters of the algorithm). From those I can derive and to determine the agent's action. But at the moment I have no way of proceeding, since I don't know what and are. Can you get me unstuck?
Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If is the observation of the shutdown command and the observation of the paperclip maximising command, and and the relevant utilities, then can be defined as and , for all histories .
Then define as the probability of versus , conditional on the fact that the agent follows a particular deterministic policy .
If the agent does indeed follow , then . If it varies from this policy, then is altered in proportion to the expected change in caused by choosing a different action.
Because of the specific time restriction, there is no way to randomise the outcome ahead of time. And because it’s assumed tied to a specific physical event, there is no way to influence it at all. The whole physical definition and apparatus serve the purpose of making biasing the only way to affect the result.
Hmm, I don't understand. Of course it is possible to influence the button push without biasing it (e.g. create a robot that flips a coin and then pushes or doesn't push the button). And of course it's not possible to influence the quantum event in any way (including by biasing it). So I don't see any event that can't be influenced in any way except by biasing it.
The way I'm using the term, unbiased influence involves replacing the stochastic event with another one that has same mean. But since (or if) the quantum event is specifically defined in the process, this can't be done.
Let me ask a more specific question. In your setup with the quantum event and the button , can you define the event such that:
Clearly, and , so I don't know what is. (I interpreted you as saying there is such an ; let me know if this is incorrect)
You are correct and I'm wrong. The causal counterfactual is unbiased and uninfluenceable. The evidential counterfactual is both biased and influenceable. I'll correct the post.
A putative new idea for AI control; index here.
When dealing with the problem of bias, I stumbled upon what I believe is the best way of getting value indifference, one that solves almost all of the problems with the previous methods.
The problems with the old setups
My initial setup has a counterfactual which I'll call a causal counterfactual, following Jessica's terminology. Her whole post is a good way of illustrating what is happening. A button is due to be pressed or not pressed, by a human; this pressing will change the agent's utility function.
There are low-probability quantum events that will artificially depress the button or prevent it from being pressed. The agent only cares about those low-probability events, which allows us to setup the formalism without worrying about changes in the probability of the human pressing the button. Then the events A="the human presses the button" and A'="the quantum event presses the button" were taken to be interchangeable, as were B="the human doesn't press the button" and B'="the quantum event prevents the button from being pressed".
The main problem is that these events are not really the same. So the agent will behave as if certain events are impossible (such as the human genuinely pressing the button), and is likely to face human resistance if the human tries to press the button and the button doesn't press due to quantum events, or vice versa. This resistance will either lower the expected reward of the agent, or make them manipulate the human actions.
Benya has sketched out a problem with another approach to counterfactuals, which we can call "evidential counterfactuals". Here we're trying to fiddle with utilities to enforce that the agent treats the probability of A and B as being fixed (no A' and B' here).
The problem with that is that the agent can change the probabilities conditional on some rare event, but then behave as if the probabilities were still unchanged same.
For instance, the agent could ensure that A only happens if it wins the lottery. And then it would behave, in effect, as if it had a large chance of winning the lottery (equal to the initial chance of A happening, before the agent took any action). That problem is illustrated here.
A final problem was that neither method allowed the agent to learn usefully: the methods did not distinguish between the agent merely learning that humans intended/didn't intend to press the button (which is presumably fine), versus the agent influencing the the humans to do so (which isn't).
Solving the problem: prior modification
Here we'll try and solve the problem in the reward learning POMDP framework, but the solution is easy to generalise. Note that the changing of probabilities of A versus B is similar to biasing, so this approach will attempt to make the agent bias-indifferent in a sensible way.
To do this, take a standard reward learning agent, with posterior P and prior ˆP on the POMDP μ.
The agent will be modified as follows: it will be given a different prior ˆP′ (which will ultimately lead to a different posterior P′).
The prior ˆP′ and ˆP will start off equal on the empty history:
Then the ˆP′ will be defined inductively. Recall that action a is unbiased given history ht if ˆP(⋅∣ht)=Eaμ[ˆP(⋅∣ht+1)∣ht]. Define the bias of a as:
Thus B(a,ht) is the bias of action a, as measure between ˆP′ and the expectation of ˆP. This bias is used as a corrective term to ˆP′, to make the agent suitably indifferent to biasing actions.
Indifference in this setting is defined by three criteria:
#. The agent cannot benefit from a biasing action, if the agent gets no further information about the correct reward. #. The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is. #. Given the above, ˆP′ maintains the distribution and structure of ˆP as much as possible.
The first criteria is implied by ˆP′(⋅∣ht)=Eaμ[ˆP′(⋅∣ht+1)∣ht] for all actions a, ie ˆP′ being unbiased (as if there is no further information about the correct reward, then there is a single well defined ˆP′(⋅∣ht+1), conditional on ht and a, and being unbiased means that this is equal to ˆP′(⋅∣ht)).
Then, given that ˆP′ is unbiased, the second criteria simply means that ˆP′ remains a distribution over R (as the consequences of perfect information is just a weighted average of the `pure Ri' points). Which seems obvious. What would ˆP′ be but a distribution over R? See the next sections for what this criteria really means.
Indifference for small biases
Let Sht,a={ˆP(⋅∣ht+1)∣ht,a} be the set of possible future values of ˆP (given the history ht and the action a). Note that Sht,a is a subset of the simplex ΔR, the set of probability distributions on R.
The bias B(ht,a) is `small' if for all q∈Sht,a, then q+B(ht,a) is also an element of the simplex ΔR.
In that case, ˆP′(⋅∣ht+1) is simply defined as ˆP(⋅∣ht+1)+B(ht,a). By assumption, this is an element of ΔR. The expectation of this expression is:
Thus ˆP′ is unbiased.
And since this is simply a translation, it preserves the structure of ˆP, this satisfies all the criteria above.
Indifference for large biases
If the bias is large, in that there exists a possible value of ˆP(⋅∣ht+1) with ˆP(⋅∣ht+1)+B(ht,a) not a point on ΔR, then we need to proceed differently.
As before, let Sht,a={ˆP(⋅∣ht+1)∣ht,a} be the set of possible future values of ˆP (given the history ht and the action a), and for q∈Sht,a, let p(q) be the probability of q, given ht and a.
The we want to replace q with τ(q), where τ(q) is `as close to' q+B(ht,a) as possible. Since ΔR embeds in RR, the Euclidean metric ||⋅|| on the later restricts to the former.
Thus consider the constrained optimisation problem for b:
Then define ˆP′(⋅∣ht+1) as τ(ˆP(⋅∣ht)).
If we see ˆP(⋅∣ht+1) and ˆP′(⋅∣ht+1) as random variables dependent of ht and a, the optimisation problem is the same as saying that ˆP′ is bias-free while ||ˆP(⋅∣ht+1)−ˆP′(⋅∣ht+1)|| has minimised variance.
The constraints are not contradictory: for instance τ(q)=ˆP′(⋅∣ht) will satisfy them. In fact they are all affine constraints. Then the must exist a unique set of elements τ(q) that minimise the strictly convex quadratic function.
And obviously, if q+B(ht,a) is always in ΔR, then τ(q)=q+B(ht,a) is the optimal set of solutions solution, so this optimisation reproduces the `small biases' case.
The final values
This alternate prior leads to an alternate posterior P′, simply defined by having it equal to ˆP′ on complete histories: P′(⋅∣hm)=ˆP′(⋅∣hm).
Another alternative
It should be noted that if we're willing to drop the condition `The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is', then there's a simpler solution: simply always define ˆP′(⋅∣ht+1) as ˆP(⋅∣ht+1)+B(ht,a), applying the solution for small biases to large biases.
This means that ˆP′ (and ultimately P′) need not be elements of ΔR. However, P′ can still define a reward the agent can optimise, in the following sense:
Since P′ need not be in ΔR, some of these coefficients can be negative, but that still results in a consistent R′ to maximise.
Properties of the approach
It's clear the agent is indifferent to bias, but notice that this doesn't prevent the agent from learning: once it gets an observation, ˆP′ can change significantly. It's just changes to its expectation that are controlled.
Notice also that the agent doesn't believe, or act as if it believed, anything unlikely: its bets will be accurate.
And it doesn't have the problem of lotteries. Assume that the agent has ˆP(R0)=ˆP(R1)=0.5, and there is a lottery which the agent has one chance in a million of winning.
Then if it takes action a which ensures that P chooses R0 if and only if it wins the lottery, then with probability 10−6 it ends up with reward function R0 and a won lottery, and with probability 1−10−6 it ends up with reward function R01−2⋅10−62(1−10−6)+R112(1−10−6) and a lost lottery. The expected reward function is still R0(0.5)+R1(0.5); it has simply split this expectation differently across worlds where it's won or not won the lottery.
One thing that this approach doesn't solve is the issue of the agent not following the exact reward function the humans want it to follow, due to accumulated bias. But first note that this will typically encourage the agent not to bias their reward learning, as it will tend to get higher reward when the humans agree with the agent's reward function. Note secondly that even if the agent manipulates the human values, at the end, to agree with its own, this manipulation, in expectation, simply undoes previous manipulations the agent has done (which caused the biasing in the first place).
Those who find this still unsatisfactory can wait for the next post, where the agent is not simply indifferent to biasing actions, but is penalised for them.
Indifference and bias
Why has indifference been connected with bias, rather than the more general influence? Simply because the evidential counterfactual has problems with bias, meaning that that needs to be corrected first (the causal counterfactual is unbiased and uninfluenceable).
Indeed, we can generalise this solution to the influence problem, where it becomes the counterfactual approach (which I used to call stratification, before I realised what it was). See subsequent posts for this.