You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Corrigibility through stratified indifference

4 Stuart_Armstrong 19 August 2016 04:11PM

A putative new idea for AI control; index here.

Corrigibility through indifference has a few problems. One of them is that the AI is indifferent between the world in which humans change its utility to v, and world in which humans try to change its utility, but fail.

Now the try-but-fail world is going to be somewhat odd - humans will be reacting by trying to change the utility again, trying to shut the AI down, panicking that a tiny probability event has happened, and so on.

continue reading »

Double Corrigibility: better Corrigibility

5 Stuart_Armstrong 28 April 2016 02:46PM

A putative new idea for AI control; index here.

Corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI's utility changes from u to v at time t, it maximises a meta-utility U such that

  • U≤t = u
  • U>t = v + E(u|u→u) - E(v|u→v).

Here u→u designates the event that u remains unchanged, while u→v designates the change.

As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change - it is indifferent to the change. Those expectation terms can be designated compensatory rewards.

What it is not indifferent to, though, is changes to E. If the AI knows that change u→v is coming, then it has incentives to change E to E' with a high E'(u|u→u) and a low E'(v|u→v). Its ability to rewire is limited by the fact that E' will serve as its future expectation estimator, so it won't mangle it completely, but it certainly can twist it a bit.

Notice that the AI is now no longer indifferent to the change of utility. It can benefit from rewiring E this way only if a change is coming up. Notice that E' probably can't rewire itself any more, so the rewired AI will be indifferent, but not before rewiring.

 

Double corrigibility

To solve this problem, we need to turn to corrigibility a second time. Let ρ be the AI's current expectation/probability estimator. Let ρ' be one that it can conceivably change into. Let C be a random variable designating the sum total of all compensatory rewards. Now, to guarantee that C is finite would require bounds on all the utility functions and some discount rate. In practice, if often will be finite. That's because E(C)=E(u0|u0→u0)-E(un |un-1→un, H), where u0 is the AI's first utility, un its last, and H intermediate history. So some milder restrictions on the ui should suffice to bound C.

Then, whenever the AI takes an action that changes ρ to ρ', it gets the special meta-compensatory reward

  • Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ').

This prevents it from deriving any advantage to C via changes to ρ. Or, more precisely, it prevents it from deriving any expected advantage, that it can figure out using its current ρ.

Note that it doesn't prevent improvements to ρ in general, it just removes the incentives to change it to affect compensatory rewards.

Since any action a might have an indirect effect on ρ, the "utility maximising" for a given u must be changed to:

  • Eρ(u|a) + Σρ' Pρ(ρ→ρ'|a) (Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ')),

where Pρ is the probability estimate corresponding to ρ; the probability term can be rewritten as Eρ(Iρ→ρ') for Iρ→ρ' the indicator function for ρ→ρ'. In fact the whole line above can be rewritten as

  • Eρ(u|a) + Eρ(Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ') | a).

For this to work, Eρ needs to be able to say sensible things about itself, and also about Eρ', which is used to estimate C if ρ→ρ'.

If we compare this with various ways of factoring out variables, we can see that it's a case where we have a clear default, ρ, and are estimating deviations from that.

Predicted corrigibility: pareto improvements

5 Stuart_Armstrong 18 August 2015 11:02AM

A putative new idea for AI control; index here.

Corrigibility allows an agent to transition smoothly from a perfect u-maximiser to a perfect v-maximiser, without seeking to resist or cause this transition.

And it's the very perfection of the transition that could cause problems; while u-maximising, the agent will not take the slightest action to increase v, even if such actions are readily available. Nor will it 'rush' to finish its u-maximising before transitioning. It seems that there's some possibility of improvements here.

I've already attempted one way of dealing with the issue (see the pre-corriged agent idea). This is another one.

 

Pareto improvements allowed

Suppose that an agent with corrigible algorithm A is following utility u currently, and estimates that there are probabilities pi that it will transition to utilities vi at midnight (note that these are utility function representatives, not affine classes of equivalent utility functions). At midnight, the usual corrigibility applies, making A indifferent to that transition, making use of such terms as E(u|u→u) (the expectation of u, given that the A's utility doesn't change) and E(vi|u→vi) (the expectation of vi, given that A's utility changes to vi).

But, in the meantime, there are expectations such as E({u,v1,v2,...}). These are A's best current estimates as to what the genuine expected utility of the various utilites are, given all it knows about the world and itself. It could be more explicitly written as E({u,v1,v2,...}| A), to emphasise that these expectations are dependent on the agent's own algorithm.

Then the idea is to modify the agent's algorithm so that Pareto improvements are possible. Call this modified algorithm B. B can select actions that A would not have chosen, conditional on:

  • E(u|B) ≥ E(u|A) and E(Σpivi|B) ≥ E(Σpivi|A).

There are two obvious ways we could define B:

  • B maximises u, subject to the constraints E(Σpivi|B) ≥ E(Σpivi|A).
  • B maximises Σpivi, subject to the constraints E(u|B) ≥ E(u|A).

In the first case, the agent maximises its current utility, without sacrificing its future utility. This could apply, for example, to a ruby mining agent that rushes to gets its rubies to the bank before its utility changes. In the second case, the agent maximises it future expected utility, without sacrificing its current utility. This could apply to a ruby mining agent that's soon to become a sapphire mining agent: it then starts to look around and collect some early sapphires as well.

Now, it would seem that doing this must cause it to lose some ruby mining ability. However, it is being Pareto with E("rubies in bank"|A, expected future transition), not with E("rubies in bank"|A, "A remains a ruby mining agent forever"). The difference is that A will behave as if it was maximising the second term, and so might not go to the bank to deposit its gains, before getting hit by the transition. So B can collects some early sapphires, and also goes to the bank to deposit some rubies, and thus end up ahead for both u and Σpivi.

Resource gathering and pre-corriged agents

7 Stuart_Armstrong 10 March 2015 11:47AM

A putative new idea for AI control; index here.

Resource-gathering agent

It will often be useful to have a model of a “pure” resource gathering agent – one motivated only to gather resources, accumulated power, spread efficiently, and so on. This model could be used as behaviour not to emulate, or as a comparison yardstick for the accumulation behaviour of other agents.

The simplest design for a resource gathering agent would be to take a utility function u – one linear in paperclips, say – and give the agent the utility function X(u) + ¬X(-u), where X is some future observation that has 50% chance of occurring, and that the AI cannot affect. Some cosmological fact coming from a distant galaxy (at some point in the future) could do the trick.

This agent would behave roughly as a resource gathering agent, accumulating power in preparation for the day it would know what to do with it: it would want resources (as these could be used to create or destroy paperclips) but would be indifferent to creating or destroying paperclips currently, as the expected gain from u is exactly compensated by the expected loss from -u (and vice versa).

However, its behaviour is not independent of u: if for instance there were a Grand President of the Committee to Establish the Proper Number of Paperclips in the World (GPotCtEtPNoPitW), then the AI would desperately try to secure that position, but would not care overmuch about being the GPotCtEtPNoSitW, who deals with staples.

So a better model of a resource gathering agent is one that has a distribution P over all sorts of different utility functions, with the proviso that for all such utilities u, P(u)=P(-u). Note here that we’re talking about actual utility functions (which can be compared and summed directly), not functions-up-to-affine-transformations. This distribution P will be updated at some future date according to some phenomena outside of the agent’s control.

Then this agent, which currently has exactly zero motivations, will nonetheless accumulate resources in preparation for the day it will know what to do.

There are some distributions P which are better suited to getting a “purer” resource gathering agent (a bad P would be, eg, having a lots of utilities which are tiny variations on u, which is essentially the same as having just u – but “tiny variations” is not a stable concept under affine transformations). A simplicity prior seems a natural choice here. If u is linear in paperclips and v in staples, then the complexity penalty for w=u+v doesn’t matter so much, as the agent will already want to preserve power over paperclips and staples, because of the (simpler) u, -u, v and -v.

 

Pre-corriged agents

One of the problems with corrigible agents is that they are, in a sense, too good at what they do. An agent that is currently a u maximiser and will transition tomorrow to being a v maximiser (and everyone knows this) will accept the deal “give me £1,000,000, and I’ll return it tripled tomorrow if you’re still a u-maximiser” (link to corrigibility paper). Why would it accept this deal? Because a real u-maximiser would, and it behaves (almost) exactly as a real u-maximiser.

We might be able to solve that specific problem with methods that identify agents or subagents (see subsequent posts). But there are still issues with, for instance, people who want to trade their own u-valuable and v-useless resources for the agent’s u-useless and v-valuable ones – and then propose the opposite trade tomorrow, with an extra premium.

We can use the idea of a resource gathering agent to prevent such loss of utility. Assume the agent has current utility u, and will transition to some v at specific point in the future. It has a probability distribution P over what this v will be.

Then instead of having current utility u, have it instead as:

u + C Σv Q(v),

where C is some constant and Q(v)=(P(v)+P(-v))/2. Note that Q(v)=Q(-v), so this agent is currently a combination between a u-maximiser and a resource gathering agent – moreover, a resource gathering agent that cares about preserving flexibility in the (likely) correct areas for its future values. The importance of either factor (u-maximising or resource gathering) can be tuned by changing C.

What if the agent expects that their utility will get changed more than once in the future? This can be built up inductively: if there are two utility changes to come, for instance, then after the first transition  (but before the second) the agent will have a composite utility, as above, of the form “u + Σv Q(v)”. Then the agent can have a P over all such composite utilities, and use that to define its current composite-composite utility (the one it has before the first change). A composite-composite utility is really just a composite utility, so the process can then be repeated.

Corrigibility will be applied to this setup in two types of circumstances: when people physically change the utility u, as before, and when the agent updates P (and hence Q) in a way that modifies the composite utility.

Note that this setup is less exploitable, but still suffers from the weakness that Q and P are not equal (in the worst case, you could have P(v)=0 while Q(v)=0.5). However, if Q were not symmetric, then the agent wouldn’t currently be a u-maximiser, so this non-equality is essential to preserving the idea of it being a (somewhat) u-maximising agent.

This may not matter too much in practice, however. The agent is like an investor on the stock market who wants to purchase a lot of the long-term stock options, but has no current interest in any stocks. However, given that other people are interested in stocks, it would be stupid to buy and sell them at prices too divergent from the majority opinion, even if the agent doesn’t itself value them. General measures against blackmail or exploitation might also help here.