Goal completion: the rocket equations

4 Stuart_Armstrong 20 January 2016 01:54PM

A putative new idea for AI control; index here.

I'm calling "goal completion" the idea of giving an AI a partial goal, and having the AI infer the missing parts of the goal, based on observing human behaviour. Here is an initial model to test some of these ideas on.

 

The linear rocket

On an infinite linear grid, an AI needs to drive someone in a rocket to the space station. Its only available actions are to accelerate by -3, -2, -1, 0, 1, 2, or 3, with negative acceleration meaning accelerating in the left direction, and positive in the right direction. All accelerations are applied immediately at the end of the turn (the unit of acceleration is in squares per turn per turn), and there is no friction. There in one end-state: reaching the space station with zero velocity.

The AI is told this end state, and is also given the reward function of needing to get to the station as fast as possible. This is encoded by giving it a reward of -1 each turn.

What is the true reward function for the model? Well, it turns out that an acceleration of -3 or 3 kills the passenger. This is encoded by adding another variable to the state, "PA", denoting "Passenger Alive". There are also some dice in the rocket's windshield. If the rocket goes by the space station without having velocity zero, the dice will fly off; the variable "DA" denotes "dice attached".

Furthermore, accelerations of -2 and 2 are uncomfortable to the passenger. But, crucially, there is no variable denoting this discomfort.

Therefore the full state space is a quadruplet (POS, VEL, PA, DA) where POS is an integer denoting position, VEL is an integer denoting velocity, and PA and DA are booleans defined as above. The space station is placed at point S < 250,000, and the rocket starts with POS=VEL=0, PA=DA=1. The transitions are deterministic and Markov; if ACC is the acceleration chosen by the agent,

((POS, VEL, PA, DA), ACC) -> (POS+VEL, VEL+ACC, PA=0 if |ACC|=3, DA=0 if POS+VEL>S).

The true reward at each step is given by -1, -10 if PA=1 (the passenger is alive) and |ACC|=2 (the acceleration is uncomfortable), -1000 if PA was 1 (the passenger was alive the previous turn) and changed to PA=0 (the passenger is now dead).

To complement the stated reward function, the AI is also given sample trajectories of humans performing the task. In this case, the ideal behaviour is easy to compute: the rocket should accelerate by +1 for the first half of the time, by -1 for the second half, and spend a maximum of two extra turns without acceleration (see the appendix of this post for a proof of this). This will get it to its destination in at most 2(1+√S) turns.

 

Goal completion

So, the AI has been given the full transition, and has been told the reward of R=-1 in all states except the final state. Can it infer the rest of the reward from the sample trajectories? Note that there are two variables in the model, PA and DA, that are unvarying in all sample trajectories. One, PA, has a huge impact on the reward, while DA is irrelevant. Can the AI tell the difference?

Also, one key component of the reward - the discomfort of the passenger for accelerations of -2 and 2 - is not encoded in the state space of the model, purely in the (unknown) reward function. Can the AI deduce this fact?

I'll be working on algorithms to efficiently compute these facts (though do let me know if you have a reference to anyone who's already done this before - that would make it so much quicker).

For the moment we're ignoring a lot of subtleties (such as bias and error on the part of the human expert), and these will be gradually included as the algorithm develops. One thought is to find a way of including negative examples, specific "don't do this" trajectories. These need to be interpreted with care, because a positive trajectory implicitly gives you a lot of negative trajectories - namely, all the choices that could have gone differently along the way. So a negative trajectory must be drawing attention to something we don't like (most likely the killing of a human). But, typically, the negative trajectories won't be maximally bad (such as shooting off at maximum speed in the wrong direction), so we'll have to find a way to encode what we hope the AI learns from a negative trajectory.

To work!

 

Appendix: Proof of ideal trajectories

Let n be the largest integer such that n^2 ≤ S. Since S≤(n+1)^2 - 1 by assumption, S-n^2 ≤ (n+1)^2-1-n^2=2n. Then let the rocket accelerate by +1 for n turns, then decelerate by -1 for n turns. It will travel a distance of 0+1+2+ ... +n-1+n+n-1+ ... +3+2+1. This sum is n plus twice the sum from 1 to n-1, ie n+n(n-1)=n^2.

By pausing one turn without acceleration during its trajectory, it can add any m to the distance, where 0≤m≤n. By doing this twice, it can add any m' to the distance, where 0≤m'≤2n. By the assumption, S=n^2+m' for such an m'. Therefore the rocket can reach S (with zero velocity) in 2n turns if S=n^2, in 2n+1 turns if n^2 ≤ S ≤ n^2+n, and in 2n+2 turns if n^2+n+1 ≤ S ≤ n^2+2n.

Since the rocket is accelerating all but two turns of this trajectory, it's clear that it's impossible to reach S (with zero velocity) in less time than this, with accelerations of +1 and -1. Since it takes 2(n+1)=2n+2 turns to reach (n+1)^2, an immediate consequence of this is that the number of turns taken to reach S, is increasing in the value of S (though not strictly increasing).

Next, we can note that since S<250,000=500^2, the rocket will always reach S within 1000 turns at most, for "reward" above -1000. An acceleration of +3 or -3 costs -1000 because of the death of the human, and an extra -1 because of the turn taken, so these accelerations are never optimal. Note that this result is not sharp. Also note that for huge S, continual accelerations of 3 and -3 are obviously the correct solution - so even our "true reward function" didn't fully encode what we really wanted.

Now we need to show that accelerations of +2 and -2 are never optimal. To do so, imagine we had an optimal trajectory with ±2 accelerations, and replace each +2 with two +1s, and each -2 with two -1s. This trip will take longer (since we have more turns of acceleration), but will go further (since two accelerations of +1 cover a greater distance that one acceleration of +2). Since the number of turns take to reach S with ±1 accelerations is increasing in S, we can replace this further trip with a shorter one reaching S exactly. Note that all these steps decrease the cost of the trip: shortening the trip certainly does, and replacing an acceleration of +2 (total cost: -10-1=-11) with two accelerations of +1 (total cost: -1-1=-2) also does. Therefore, the new trajectory has no ±2 accelerations, and has a lower cost, contradicting our initial assumption.

[LINK] Common fallacies in probability (when numbers aren't used)

7 Stuart_Armstrong 15 January 2016 08:29AM

Too many people attempt to use logic when they should be using probabilities - in fact, when they are using probabilities, but don't mention it. Here are some of the major fallacies caused by misusing logic and probabilities this way:

  1. "It's not certain" does not mean "It's impossible" (and vice versa).
  2. "We don't know" absolutely does not imply "It's impossible".
  3. "There is evidence against it" doesn't mean much on its own.
  4. Being impossible *in a certain model*, does not mean being impossible: it changes the issue to the probability of the model.

Common fallacies in probability

Extending the stated objectives

6 Stuart_Armstrong 13 January 2016 04:20PM

A putative new idea for AI control; index here.

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

Stuart Russell

Think of an AI directing a car, given the instructions to get someone to the airport as fast as possible (optimised variables include "negative of time taken to airport") with some key variables left out - such as a maximum speed, maximum acceleration, respect for traffic rules, and survival of the passengers and other humans.

Call these other variables "unstated objectives" (UO), as contrasted with the "stated objectives" (SO) such as the time to the airport. In the normal environments in which we operate and design our AIs, the UOs are either correlated with the SOs (consider the SO "their heart is beating" and the UO "they're alive and healthy") or don't change much at all (the car-directing AI could have been trained on many examples of driving-to-the-airport, none of which included the driver killing their passengers).

Typically, SOs are easy to define, and the UOs are the more important objectives, left undefined either because they are complex, or because they didn't occur to us in this context (just as we don't often say "driver, get me to the airport as fast a possible, but alive and not permanently harmed, if you please. Also, please obey the following regulations and restrictions: 1.a.i.α: Non-destruction of the Earth....").

The control problem, in a nutshell, is that optimising SOs will typically set other variables to extreme values, including the UOs. The more extreme the optimisation, and the furthest from the typical environment, the more likely this is to happen.

continue reading »

[Stub] Ontological crisis = out of environment behaviour?

8 Stuart_Armstrong 13 January 2016 03:10PM

One problem with AI is the possibility of ontological crises - of AIs discovering their fundamental model of reality is flawed, and being unable to cope safely with that change. Another problem is the out-of-environment behaviour - that an AI that has been trained to behave very well in a specific training environment, messes up when introduced to a more general environment.

It suddenly occurred to me that these might in fact be the same problem in disguise. In both cases, the AI has developed certain ways of behaving in reaction to certain regular features of their environment. And suddenly they are placed in a situation where these regular features are absent - either because they realised that these features are actually very different from what they thought (ontological crisis) or because the environment is different and no longer supports the same regularities (out-of-environment behaviour).

In a sense, both these errors may be seen as imperfect extrapolation from partial training data.

Tackling the subagent problem: preliminary analysis

5 Stuart_Armstrong 12 January 2016 12:26PM

A putative new idea for AI control; index here.

Status: preliminary. This mainly to put down some of the ideas I've had, for later improvement or abandonment.

The subagent problem, in a nutshell, is that "create a powerful subagent with goal U that takes over the local universe" is a solution for many of the goals an AI could have - in a sense, the ultimate convergent instrumental goal. And it tends to evade many clever restrictions people try to program into the AI (eg "make use of only X amount of negentropy", "don't move out of this space").

So if the problem could be solved, many other control approaches could be potentially available.

The problem is very hard, because an imperfect definition of a subagent is simply an excuse to create an a subagent that skirts the limits of that definition (hum, that style of problem sounds familiar). For instance, if we want to rule out subagents by preventing the AI from having much influence if the AI itself were to stop ("If you die, you fail, no other can continue your quest"), then it is motivated to create powerful subagents that carefully reverse their previous influence if the AI were to be destroyed.

 

Controlling subagents

Some of the methods I've developed seem suitable for controlling the existence or impact of subagents.

  • Reduced impact methods can prevent subagents from being created, by requiring that the AI's interventions be non-disruptive ("Twenty million questions") or undetectable.
  • Reducing the AI's output options to a specific set can prevent them from being able to create any in the first place.
  • Various methods around detecting importance can be used to ensure that, though subagents may exist, they won't be very influential.
  • Pre-corriged methods can be used to ensure that any subagents remain value aligned with the original agent. Then, if there is some well-defined "die" goal for the agent, this could take all the agents with them.

These can be thought as ruling out the agent's existence, their creation, their influence (or importance) and their independence. The last two can be particularly tricky, as we want to make sure that our formal definition of importance matches up with our informal one, and we currently lack a well defined "die" goal.

We could also think of defining identity by using some of the tricks and restrictions that have caused humans to develop one (such as our existing in a single body with no east of copying), but it's not clear that this definition would remain stable once the restrictions were lifted (and it's not clear that a sense of identity prevents the creation of subagents in the first place).

 

Subagents processing information

Here I want to look at one other aspect of the subagents, the fact that they are subagents, and, as such, do some of the stuff that agents do - such as processing information and making decisions. Can we use the information processing as a definition?

Consider the following model. Our lovely Clippy wants to own a paperclip. They know that it exists behind one of a hundred doors; opening one of them seals all the others, for ever. In a few minutes, Clippy will be put to sleep, but it has a simple robot that it can program to go and open one of the doors and recuperate the paperclip for it.

Clippy currently doesn't know where the paperclip is, but it knows that its location will be announced a few seconds after Clippy sleeps. The robot includes a sound recording system inside it.

It seems there are two clear levels of agency the robot could have: either it goes to a random door, or it processes the announcement, to pick the correct door. In the first case, the robot and Clippy have a 1% chance of getting the paperclip; in the second, a 100% chance. The distributions of outcomes is clearly different.

But now suppose there is a human guard longing around, trying to prevent the robot from getting to any door. If the robot has a neutron bomb, it can just kill the human without looking. If it has a giant cannon, it can simply shoot the guard where they stand. If it has neither of these, then it can attempt to bargain with the guard, negotiating, seducing and tricking them in the course of a conversation. One the guard is removed, the situation reverts to the above.

Each of the three methods requires different level of information processing (different levels of optimisation power, in essence). We can add more guards, make them more or less competent, add other agents with other goals, add more ways of achieving the paperclip, and so on, to grade how much information processing the robot has.

To calibrate such a scale, we could use the upper bound as "Clippy moves into the robot and controls it fully" to represent the robot having maximal information processing power (and a very focused outcome space). And we could use some pre-prepared actions (such as the robot randomly choosing a door) to calibrate the lower end. The aim is to construct a definition of information processing that could be used to define the existence of (effective) subagents.

Still feels likely that this will fail, though, without something more.

A toy model of the treacherous turn

13 Stuart_Armstrong 08 January 2016 12:58PM

Jaan Tallinn has suggested creating a toy model of the various common AI arguments, so that they can be analysed without loaded concepts like "autonomy", "consciousness", or "intentionality". Here a simple attempt for the "treacherous turn"; posted here for comments and suggestions.

Meet agent L. This agent is a reinforcement-based agent, rewarded/motivated by hearts (and some small time penalty each turn it doesn't get a heart):

continue reading »

[Stub] The problem with Chesterton's Fence

4 Stuart_Armstrong 05 January 2016 05:10PM

Chesterton's meta-fence: "in our current system (democratic market economies with large governments) the common practice of taking down Chesterton fences is a process which seems well established and has a decent track record, and should not be unduly interfered with (unless you fully understand it)".

FHI is hiring researchers!

13 Stuart_Armstrong 23 December 2015 10:46PM

The Future of Humanity Institute at the University of Oxford invites applications for four research positions. We seek outstanding applicants with backgrounds that could include computer science, mathematics, economics, technology policy, and/or philosophy.

continue reading »

Reflective oracles and superationality

3 Stuart_Armstrong 18 November 2015 12:30PM

This grew out of an exchange with Jessica Taylor during MIRI’s recent visit to the FHI. Still getting my feel for the fixed point approach; let me know of any errors.

People at MIRI have recently proved you can use reflective oracles so that agents can use them to reason about other agents (including other agents with oracles) and themselves, and consistently reach Nash equilibriums. But can we do better than that?

To recap, a reflective oracle is a machine O such that:

  • P(A()=1)>p implies O(A,p)=1

  • P(A()=0)>1-p implies O(A,p)=0

This works even if A() includes a call to the oracle within its code. Now, all the algorithms used here will be clearly terminating, so we’ll have the other two implications as well (eg (P(A()=0)>p implies O(A,p)=0). And given any δ, we can, with order log(1/δ) questions, establish the probability of A() to within δ. Thus we will write O(A()==a)=p to mean that O(A()==a,(n-1)δ/2)=1 and O(A()==a,(n+1)δ/2)=0, where (n-1)δ/2 < p < (n+1)δ/2.

Note also that O can be used to output a probabilistic output (to within δ), so outputting specific mixed strategies is possible.

continue reading »

Utility, probability and false beliefs

1 Stuart_Armstrong 09 November 2015 09:43PM

A putative new idea for AI control; index here.

This is part of the process of rigourising and formalising past ideas.

Paul Christiano recently asked why I used utility changes, rather than probability changes, to have an AI believe (or act as if it believed) false things. While investigating that, I developed several different methods for achieving the belief changes that we desired. This post analyses these methods.

 

Different models of forced beliefs

Let x and ¬x refer to the future outcome of a binary random variable X (write P(x) as a shorthand for P(X=x), and so on). Assume that we want P(x):P(¬x) to be in the 1:λ ratio for some λ (since the ratio is all that matters, λ=∞ is valid, meaning P(x)=0). Assume that we have an agent, who has utility u, has seen past evidence e, and wishes to assess the expected utility of their action a.

Typically, for expected utility, we sum over the possible worlds. In practice, we almost always sum over sets of possible worlds, the sets determined by some key features of interest. In assessing the quality of health interventions, for instance, we do not carefully and separately treat each possible position of atoms in the sun. Thus let V be the set of variables or values we can about, and v a possible value vector V can take. As usual, we'll be writing P(v) as a shorthand for P(V=v). The utility function u assigns utilities to possible v's.

One of the advantages of this approach is that it can avoid many issues of conditionals like P(A|B) when P(B)=0.

The first obvious idea is to condition on x and ¬x:

  • (1) Σv u(v)(P(v|x,e,a)+λP(v|¬x,e,a))

The second one is to use intersections rather than conditionals (as in this post):

  • (2) Σv u(v)(P(v,x|e,a)+λP(v,¬x|e,a))

Finally, imagine that we have a set of variables H, that "screen off" the effects of e and a, up until X. Let h be a set of values H can take. Thus P(x|h,e,a)=P(x|h). One could see H as the full set of possible pre-X histories, but it could be much smaller - maybe just the local environment around X. This gives a third definition:

  • (3) Σv Σh u(v)(P(v|h,x,e,a)+λP(v|h,¬x,e,a))P(h|,e,a)

 

Changing and unchangeable P(x)

An important thing to note is that all three definitions are equivalent for fixed P(x), up to changes of λ. The equivalence of (2) and (1) derives from the fact that Σv u(v)(P(v,x|e,a)+λP(v,¬x|e,a)) = Σv u(v)(P(x)P(v|x,e,a)+λP(¬x)P(v|¬x,e,a)) (we write P(x) rather than P(x|e,a) since the probability of x is fixed). Thus a type (2) agent with λ is equivalent with a type (1) agent with λ'=λP(x)/P(¬x).

Similarly, P(v|h,x,e,a)=P(v,h,x|e,a)/(P(x|h,e,a)*P(h|e,a)). Since P(x|h,e,a)=P(x), equation (3) reduces to Σv Σh u(v)(P(x)P(v,h,x|e,a)+λP(¬x)P(v,h,¬x|e,a)). Summing over h, this becomes Σv u(v)(P(x)P(v,x|e,a)+λP(¬x)P(v,¬x|e,a))=Σv u(v)(P(v|x,e,a)+λP(v|¬x,e,a)), ie the same as (1).

What about non-constant x? Let c(x) and c(¬x) be two contracts that pay out under x and ¬x, respectively. If the utility u is defined as 1 if a payout is received (and 0 otherwise), it's clear that both agent (1) and agent (3) assess c(x) as having an expected utility of 1 while c(¬x) has an expected utility of λ. This assessment is unchanging, whatever the probability of x. Therefore agents (1) and (3), in effect, see the odds of x as being a constant ratio 1:λ.

Agent (2), in contrast, gets a one-off artificial 1:λ update to the odds of x and then proceeds to update normally. Suppose that X is a coin toss that the agent believes is fair, having extensively observed the coin. Then it will believe that the odds are 1:λ. Suppose instead that it observes the coin has a λ:1 odd ratio; then it will believe the true odds are 1:1. It will be accurate, with a 1:λ ratio added on.

The effects of this percolate backwards in time from X. Suppose that X was to be determined by the toss of one of two unfair coins, one with odds ε:1 and one with odds 1:ε. The agent would assess the odds of the first coin being used rather than the second as around 1:λ. This update would extend to the process of choosing the coins, and anything that that depended on. Agent (1) is similar, though its update rule always assumes the odds of x:¬x being fixed; thus any information about the processes of coin selection is interpreted as a change in the probability of the processes, not a change in the probability of the outcome.

Agent (3), in contrast, is completely different. It assess the probability of H=h objectively, but then assumes that the odds of x and ¬x, given any h, is 1:λ. Thus if given updates about the probability of which coin is used, it will assess those updates objectively, but then assume that both coins are "really" giving 1:1 odds. It cuts off the update process at h, thus ensuring that it is "incorrect" only about x and its consequences, not its pre-h causes.

 

Utility and probability: assessing goal stability

Agents with unstable goals are likely to evolve towards being (equivalent to) expected utility maximisers. The converse is more complicated, but we'll assume here that an agent's goal is stable if it is an expected utility maximiser for some probability distribution.

Which one? I've tended to shy away from changing the probability, preferring to change the utility instead. If we divide the probability in equation (2) by 1+λ, it becomes a u-maximiser with a biased probability distribution. Alternatively, if we defined u'(v,x)=u(v) and u'(v,¬x)=λu(v), then it is a u'-maximiser with an unmodified probability distribution. Since all agents are equivalent for fixed P(x), we can see that in that case, all agents can be seen as expected utility maximisers with the standard probability distribution. 

Paul questioned whether the difference was relevant. I preferred the unmodified probability distribution - maybe the agent uses the distribution for induction, maybe having false probability beliefs will interfere with AI self-improvement, or maybe agents with standard probability distributions are easier to corrige - but for agent (2) the difference seems to be arguably a matter of taste.

Note that though agent (2) is stable, it's definition is not translation invariant in u. If we add c to u, we add c(P(x|e,a)+λP(¬x|e,a)) to u'. Thus, if the agent can affect the value of P(x) through its actions, different constants c likely give different behaviours.

Agent (1) is different. Except for the cases λ=0 and λ=∞, the agent cannot be an expected utility maximiser. To see this, just notice that an update about the process that could change the probability of x, gets reinterpreted as an update on the probability of that process. If we have the ε:1 and 1:ε coins, then any update about their respective probabilities of being used gets essentially ignored (as long as the evidence that the coins are biased is much stronger than the evidence as to which coin is used).

In the cases λ=0 and λ=∞, though, agent (1) is a u-maximiser that uses the probability distribution that assumes x or ¬x is certain, respectively. This is the main point of agent (1) - providing a simple maximiser for those cases.

What about agent (3)? Define u' by: u'(v,h,x)=u(v)/P(x|h), and u'(v,h,¬x)=λu(v)/P(¬x|h). Then consider the u'-maximiser:

  • (4) Σv Σh u'(v,h,x)P(v,h,x|e,a)+u'(v,h¬x)P(v,h,¬x|e,a)

Now P(v,h,x|e,a)=P(v|h,x,e,a)P(x|h,e,a)P(h|e,a). Because of the screening off assumptions, the middle term is the constant P(x|h). Multiplying this by u'(v,h,x)=u(v)/P(x|h) gives u(v)P(v|h,x,e,a)P(h|e,a). Similarly, the second term becomes λu(v)P(v|h,¬x,e,a)P(h|e,a). Thus a u'-maximiser, with the standard probability distribution, is the same as agent (3), thus proving the stability of that agent type.

 

Beyond the future: going crazy or staying sane

What happens after the event X has come to pass? In that case, agent (4), the u'-maximiser will continue as normal. Its behaviour will not be unusual as long as neither λ nor 1/λ is close to 0. The same goes for agent (2).

In contrast, agent (3) will no longer be stable after X, as H no longer screens off evidence after that point. And agent (1) was never stable in the first place, and now it denies all the evidence it sees to determine that impossible events actually happened. But what of those two agents, or the stable ones if λ or 1/λ were close to 0? In particular, what if λ falls below the probability that the agent is deluded in its observation of X?

In those cases, it's easy to argue that the agents would effectively go insane, believing wild and random things to justify their delusions.

But maybe not, in the end. Suppose that you, as a human, believe an untrue fact - maybe that Kennedy was killed on the 23rd of November rather than the 22nd. Maybe you construct elaborate conspiracy theories to account for the discrepancy. Maybe you posit an early mistake by some reporter that was then picked up and repeated. After a while, you discover that all the evidence you can find points to the 22nd. Thus, even though you believe with utter conviction that the assassination was on the 23rd, you learn to expect that the next piece of evidence will point to the 22nd. You look for the date-changing conspiracy, and never discover anything about it; and thus learn to expect they have covered their tracks so well they can't be detected.

In the end, the expectations of this "insane" agent could come to resemble those of normal agents, as long as there's some possibility of a general explanation of all the normal observations (eg a well-hidden conspiracy) given the incorrect assumption.

Of course, the safer option is just to corrige the agent to some sensible goal soon after X.

View more: Prev | Next