Predicted corrigibility: pareto improvements

Stuart_Armstrong

A putative new idea for AI control; index here.

Corrigibility allows an agent to transition smoothly from a perfect u-maximiser to a perfect v-maximiser, without seeking to resist or cause this transition.

And it's the very perfection of the transition that could cause problems; while u-maximising, the agent will not take the slightest action to increase v, even if such actions are readily available. Nor will it 'rush' to finish its u-maximising before transitioning. It seems that there's some possibility of improvements here.

I've already attempted one way of dealing with the issue (see the pre-corriged agent idea). This is another one.

Pareto improvements allowed

Suppose that an agent with corrigible algorithm A is following utility u currently, and estimates that there are probabilities p_i that it will transition to utilities v_i at midnight (note that these are utility function representatives, not affine classes of equivalent utility functions). At midnight, the usual corrigibility applies, making A indifferent to that transition, making use of such terms as E(u|u→u) (the expectation of u, given that the A's utility doesn't change) and E(v_i|u→v_i) (the expectation of v_i, given that A's utility changes to v_i).

But, in the meantime, there are expectations such as E({u,v₁,v₂,...}). These are A's best current estimates as to what the genuine expected utility of the various utilites are, given all it knows about the world and itself. It could be more explicitly written as E({u,v₁,v₂,...}| A), to emphasise that these expectations are dependent on the agent's own algorithm.

Then the idea is to modify the agent's algorithm so that Pareto improvements are possible. Call this modified algorithm B. B can select actions that A would not have chosen, conditional on:

E(u|B) ≥ E(u|A) and E(Σp_iv_i|B) ≥ E(Σp_iv_i|A).

There are two obvious ways we could define B:

B maximises u, subject to the constraints E(Σp_iv_i|B) ≥ E(Σp_iv_i|A).
B maximises Σp_iv_i, subject to the constraints E(u|B) ≥ E(u|A).

In the first case, the agent maximises its current utility, without sacrificing its future utility. This could apply, for example, to a ruby mining agent that rushes to gets its rubies to the bank before its utility changes. In the second case, the agent maximises it future expected utility, without sacrificing its current utility. This could apply to a ruby mining agent that's soon to become a sapphire mining agent: it then starts to look around and collect some early sapphires as well.

Now, it would seem that doing this must cause it to lose some ruby mining ability. However, it is being Pareto with E("rubies in bank"|A, expected future transition), not with E("rubies in bank"|A, "A remains a ruby mining agent forever"). The difference is that A will behave as if it was maximising the second term, and so might not go to the bank to deposit its gains, before getting hit by the transition. So B can collects some early sapphires, and also goes to the bank to deposit some rubies, and thus end up ahead for both u and Σp_iv_i.

A putative new idea for AI control; index here.

Corrigibility allows an agent to transition smoothly from a perfect u-maximiser to a perfect v-maximiser, without seeking to resist or cause this transition.

I've already attempted one way of dealing with the issue (see the pre-corriged agent idea). This is another one.

Pareto improvements allowed

Then the idea is to modify the agent's algorithm so that Pareto improvements are possible. Call this modified algorithm B. B can select actions that A would not have chosen, conditional on:

E(u|B) ≥ E(u|A) and E(Σp_iv_i|B) ≥ E(Σp_iv_i|A).

There are two obvious ways we could define B:

B maximises u, subject to the constraints E(Σp_iv_i|B) ≥ E(Σp_iv_i|A).
B maximises Σp_iv_i, subject to the constraints E(u|B) ≥ E(u|A).

Third obvious possibility: B maximises u~Σpivi, subject to the constraints E(Σpivi|B) ≥ E(Σpivi|A) and E(u|B) ≥ E(u|A). where ~ is some simple combining operation like addition or multiplication, or "the product of A and B divided by the sum of A and B".

I think these possibilities all share the problem that the constraint makes it essentially impossible to choose any action other than what A would have chosen. If A chose the action that maximized u, then B cannot choose any other action while satisfying the constraint E(u|B) ≥ E(u|A) unless there were multiple actions that had the exact same payoff (which seems unlikely if payoff values are distributed over the reals, rather than over a finite set). And the first possibility (to maximize u while respecting E(Σpivi|B) ≥ E(Σpivi|A) ) just results in choosing the exact same action as A would have chosen, even if there's another action that has an identical E(u) AND higher E(Σpivi).

I think these possibilities all share the problem that the constraint makes it essentially impossible to choose any action other than what A would have chosen.

I see I've miscommunicated the central idea. Let U be the proposition "the agent will remain a u maximiser forever". Agent A acts as if P(U)=1 (see the entry on value learning). In reality, P(U) is probably very low. So A is a u-maximiser, but a u-maximiser that acts on false beliefs.

Agent B is is allowed to have a better estimate of P(U). Therefore it can find actions that increase u be... (read more)

8

Predicted corrigibility: pareto improvements

8

Pareto improvements allowed

8

8

Predicted corrigibility: pareto improvements

8

Pareto improvements allowed

8