Goal completion: the rocket equations

Stuart_Armstrong

A putative new idea for AI control; index here.

I'm calling "goal completion" the idea of giving an AI a partial goal, and having the AI infer the missing parts of the goal, based on observing human behaviour. Here is an initial model to test some of these ideas on.

The linear rocket

On an infinite linear grid, an AI needs to drive someone in a rocket to the space station. Its only available actions are to accelerate by -3, -2, -1, 0, 1, 2, or 3, with negative acceleration meaning accelerating in the left direction, and positive in the right direction. All accelerations are applied immediately at the end of the turn (the unit of acceleration is in squares per turn per turn), and there is no friction. There in one end-state: reaching the space station with zero velocity.

The AI is told this end state, and is also given the reward function of needing to get to the station as fast as possible. This is encoded by giving it a reward of -1 each turn.

What is the true reward function for the model? Well, it turns out that an acceleration of -3 or 3 kills the passenger. This is encoded by adding another variable to the state, "PA", denoting "Passenger Alive". There are also some dice in the rocket's windshield. If the rocket goes by the space station without having velocity zero, the dice will fly off; the variable "DA" denotes "dice attached".

Furthermore, accelerations of -2 and 2 are uncomfortable to the passenger. But, crucially, there is no variable denoting this discomfort.

Therefore the full state space is a quadruplet (POS, VEL, PA, DA) where POS is an integer denoting position, VEL is an integer denoting velocity, and PA and DA are booleans defined as above. The space station is placed at point S < 250,000, and the rocket starts with POS=VEL=0, PA=DA=1. The transitions are deterministic and Markov; if ACC is the acceleration chosen by the agent,

((POS, VEL, PA, DA), ACC) -> (POS+VEL, VEL+ACC, PA=0 if |ACC|=3, DA=0 if POS+VEL>S).

The true reward at each step is given by -1, -10 if PA=1 (the passenger is alive) and |ACC|=2 (the acceleration is uncomfortable), -1000 if PA was 1 (the passenger was alive the previous turn) and changed to PA=0 (the passenger is now dead).

To complement the stated reward function, the AI is also given sample trajectories of humans performing the task. In this case, the ideal behaviour is easy to compute: the rocket should accelerate by +1 for the first half of the time, by -1 for the second half, and spend a maximum of two extra turns without acceleration (see the appendix of this post for a proof of this). This will get it to its destination in at most 2(1+√S) turns.

Goal completion

So, the AI has been given the full transition, and has been told the reward of R=-1 in all states except the final state. Can it infer the rest of the reward from the sample trajectories? Note that there are two variables in the model, PA and DA, that are unvarying in all sample trajectories. One, PA, has a huge impact on the reward, while DA is irrelevant. Can the AI tell the difference?

Also, one key component of the reward - the discomfort of the passenger for accelerations of -2 and 2 - is not encoded in the state space of the model, purely in the (unknown) reward function. Can the AI deduce this fact?

I'll be working on algorithms to efficiently compute these facts (though do let me know if you have a reference to anyone who's already done this before - that would make it so much quicker).

For the moment we're ignoring a lot of subtleties (such as bias and error on the part of the human expert), and these will be gradually included as the algorithm develops. One thought is to find a way of including negative examples, specific "don't do this" trajectories. These need to be interpreted with care, because a positive trajectory implicitly gives you a lot of negative trajectories - namely, all the choices that could have gone differently along the way. So a negative trajectory must be drawing attention to something we don't like (most likely the killing of a human). But, typically, the negative trajectories won't be maximally bad (such as shooting off at maximum speed in the wrong direction), so we'll have to find a way to encode what we hope the AI learns from a negative trajectory.

To work!

Appendix: Proof of ideal trajectories

Let n be the largest integer such that n^2 ≤ S. Since S≤(n+1)^2 - 1 by assumption, S-n^2 ≤ (n+1)^2-1-n^2=2n. Then let the rocket accelerate by +1 for n turns, then decelerate by -1 for n turns. It will travel a distance of 0+1+2+ ... +n-1+n+n-1+ ... +3+2+1. This sum is n plus twice the sum from 1 to n-1, ie n+n(n-1)=n^2.

By pausing one turn without acceleration during its trajectory, it can add any m to the distance, where 0≤m≤n. By doing this twice, it can add any m' to the distance, where 0≤m'≤2n. By the assumption, S=n^2+m' for such an m'. Therefore the rocket can reach S (with zero velocity) in 2n turns if S=n^2, in 2n+1 turns if n^2 ≤ S ≤ n^2+n, and in 2n+2 turns if n^2+n+1 ≤ S ≤ n^2+2n.

Since the rocket is accelerating all but two turns of this trajectory, it's clear that it's impossible to reach S (with zero velocity) in less time than this, with accelerations of +1 and -1. Since it takes 2(n+1)=2n+2 turns to reach (n+1)^2, an immediate consequence of this is that the number of turns taken to reach S, is increasing in the value of S (though not strictly increasing).

Next, we can note that since S<250,000=500^2, the rocket will always reach S within 1000 turns at most, for "reward" above -1000. An acceleration of +3 or -3 costs -1000 because of the death of the human, and an extra -1 because of the turn taken, so these accelerations are never optimal. Note that this result is not sharp. Also note that for huge S, continual accelerations of 3 and -3 are obviously the correct solution - so even our "true reward function" didn't fully encode what we really wanted.

Now we need to show that accelerations of +2 and -2 are never optimal. To do so, imagine we had an optimal trajectory with ±2 accelerations, and replace each +2 with two +1s, and each -2 with two -1s. This trip will take longer (since we have more turns of acceleration), but will go further (since two accelerations of +1 cover a greater distance that one acceleration of +2). Since the number of turns take to reach S with ±1 accelerations is increasing in S, we can replace this further trip with a shorter one reaching S exactly. Note that all these steps decrease the cost of the trip: shortening the trip certainly does, and replacing an acceleration of +2 (total cost: -10-1=-11) with two accelerations of +1 (total cost: -1-1=-2) also does. Therefore, the new trajectory has no ±2 accelerations, and has a lower cost, contradicting our initial assumption.

A putative new idea for AI control; index here.

The linear rocket

The AI is told this end state, and is also given the reward function of needing to get to the station as fast as possible. This is encoded by giving it a reward of -1 each turn.

Furthermore, accelerations of -2 and 2 are uncomfortable to the passenger. But, crucially, there is no variable denoting this discomfort.

((POS, VEL, PA, DA), ACC) -> (POS+VEL, VEL+ACC, PA=0 if |ACC|=3, DA=0 if POS+VEL>S).

Goal completion

I'll be working on algorithms to efficiently compute these facts (though do let me know if you have a reference to anyone who's already done this before - that would make it so much quicker).

To work!

Appendix: Proof of ideal trajectories

Nope, sorry! Tell me! ^_^

The short, rushed version (with links to longer versions that I think will be helpful to get what's going on):

Linear programming problems are, to some extent, 'all the same,' so I'll just talk about the classic example, the diet problem. You need to consume at least a minimum amount of certain macronutrients and calories, and have access to many foods with varying nutritional value and cost. The primal formulation asks the question "what amount of what foods should I buy, in order to satisfy my nutritional requirements at minimum cost?". The dual formulation is a little more awkward to state, but one way to think about it is that it asks the question "what is the largest lower bound on the price of a diet that satisfies the nutritional requirements?"

The difference between the two is called the 'optimality gap.' For linear programs, it's always 0; that is, the largest lower bound is equal to the optimal solution.

(These links may also help as explanations (I'll find book explanations later): Jeremy Kun (which steps through the diet example in more detail), wiki), Ghaoui.)

One thing that falls out of this is the interpretation of the decision variables in the dual formulation. We can express the primal linear programming problem as something like the following:

min c'x

s.t. Ax≥b

x≥0

(c is the cost vector, x is the amount of each food, s.t. is an abbreviation of 'subject to', A is the nutrition matrix, and b is the vector of minimum nutritional requirements, and x>=0 means you can't eat food in reverse.)

The corresponding dual problem is the following:

max b'y

s.t. A'y≤c

y≥0

But what the heck is y? It's the shadow price corresponding to the constraint (notice that we use A transpose, instead of A, so we have as many dual variables as constraints in the primal problem). That is, only binding constraints have nonzero shadow prices, and the shadow price tells you how valuable it would be (in terms of your objective function) to be able to relax the limit on that constraint by 1 unit.

In the terms of the diet problem, this tells you how much more cheaply you could eat if you reduced your minimum dietary intake of various nutrients, or how much more expensive it would be to increase them. (Remember, these are the constraints that are tight at the lower bound, and thus determine it.)

(Applied Mathematical Programming Ch. 4)

So what's the overall idea? If the AI is using normal optimization techniques, it can determine which of its constraints are binding and which are immaterial (it doesn't matter if you increase the penalty of the dice flying off because it won't alter the optimal trajectory, but if you decrease the penalty for uncomfortable acceleration by enough, then the plan will change so we can get there sooner). It seems possible to have the AI, if provided with expert trajectories, do standard pattern recognition techniques to identify boundaries, which it can then communicate as constraints with associated shadow prices.

"It looks to me like there's an unstated penalty for accelerating more than 1 unit a turn that's at least n units of reward. Is this a real constraint or just a lack of creativity on the part of the experts?"

(There's some machine ethics work by Anderson and Anderson on building an ethical reasoning bot trained on a bunch of human expert decisions that managed to do casuistry, basically, where you would give it a new moral dilemma and it would be able to both tell you what action it thought was preferable and which cases it was using to make that determination.)

But that doesn't really capture our intuitions either - if, in a more general model, a human ended up dying by accident, we wouldn't want the agent to pay the price for this each and every turn thereafter.

This is a 'pay for health' rather than a 'weregild' system. I think it would seem intuitive to pay the robot 1000 reward every turn that the human is still alive, but that's decision-theoretically equivalent to the case where you penalize it every turn that they're dead. If you use time-discounting, an infinite series of payments (in either direction) can be neatly transformed into a lump sum, but if you don't want to use time-discounting but don't want infinite sums floating around, you can have it be -1000 for every turn the human is dead and it hasn't reached the space station yet (so it cuts off when the -1s cut off).

[Unless, of course, you pay the robot 1000 reward for every turn the human is alive and hasn't reached the space station, and a penalty of -1 every turn it doesn't reach the station. That wouldn't end well. :P]

9

Goal completion: the rocket equations

9

The linear rocket

Goal completion

Appendix: Proof of ideal trajectories

9

9

Goal completion: the rocket equations

9

The linear rocket

Goal completion

Appendix: Proof of ideal trajectories

9