# Goal completion: algorithm ideas

*A putative new idea for AI control; index here.*

This post will be extending ideas from inverse reinforcement learning (IRL) to the problem of goal completion. I'll be drawing on the presentation and the algorithm from Apprenticeship Learning via Inverse Reinforcement Learning (with one minor modification).

In that setup, the environment is an MDP (Markov Decision process), and the real reward R is assumed to be linear in the "features" of the state-action space. Features are functions φ_{i} from the full state-action space S×A to the unit interval [0,1] (the paper linked above only considers functions from the state space; this is the "minor modification"). These features form a vector φ∈[0,1]^{k}, for k different features. The actual reward is given by the inner product with a vector w∈â„^{k}, thus the reward at state-action pair (s,a) is

R(s,a)=w.φ(s,a).

To ensure the reward is always between -1 and 1, w is constrained to have ||w||_{1} ≤ 1; to reduce redundancy, we'll assume ||w||_{1}=1.

The advantages of linearity is that we can compute the expected rewards directly from the expected feature vector. If the agent follows a policy π (a map from state to action) and has a discount factor γ, the expected feature vector is

μ(π) = E(Σ_{t} γ^{t}φ(s_{t},π(s_{t})),

where s_{t} is the state at step t.

The agent's expected reward is then simply

E(R) = w . μ(π).

Thus the problem of computing the correct reward is reduced to the problem of computing the correct w. In practice, to compute the correct *policy*, we just need to find one whose *expected features* are close enough to optimal; this need not involve computing w.

## Inverse reinforcement learning

The approach of IRL is then to find a way to efficiently compute w, given some "trajectories": examples of good performance, provided by (human) experts. These experts are following the expert policy π_{E} Given these trajectories, the agent can compute an empirical estimate for the μ_{E}=μ(π_{E}), by simply averaging the (discounted) feature vectors produced on each trajectory.

The algorithm used is to gradually expand a series of policies π^{(0)}, π^{(1)},... until μ_{E} is close enough to the convex hull of the μ(π^{(i)}). Then a policy is chosen in the hull of {π^{(0)}, π^{(1)},...}, whose expected features are close to those of μ_{E}.

Note that since there exists a genuine w such that the experts act as w.μ_{E}-maximisers, there must exist a pure deterministic policy that is maximally good according to this w. This will not be the case if the φ is under-specified, as we shall see.

## The rocket world

I want to slightly simplify the rocket model of the previous example. First of all, discard the extra state variables PA and DA; the state is described entirely by position and velocity. To ensure a finite state space, make the space cyclic (of size 100) and the velocity similarly cyclic - the state space is of size 10,000. The actions are, as before, accelerations of -3, -2, -1, 0, 1, 2, and 3 (a state-action space of 70,000). The updates to position and velocity are deterministic and as expected (new pos=old pos+old vel, new vel=old vel+acceleration).

The space station is at point 0; docking with the station means reaching it with zero velocity, ie hitting state (0,0). The agent/rocket will start at point 36 with velocity 0. Each turn it is not docked, it will get a cost of -1; once docked, it will get no further reward or cost. Each time it accelerates with ±2, it gets a penalty of -10 as some of its passengers are uncomfortable. Each time it accelerates with ±3, it gets a penalty of -1000 as some of its passengers are killed.

In the terminology above, this can be captured by a four-component feature vector given my (all unspecified values are zero):

- φ
_{0}(0,0; -)=1. - φ
_{1}(-,-; -)=1. - φ
_{2}(-,-; ±2)=1 - φ
_{3}(-,-; ±3)=1

Here φ_{0} encodes the specialness of the terminal position (docked at the space station), φ_{1} gives a uniform reward/penalty every turn, while φ_{2} and φ_{3} encode the effects of accelerations of magnitude 2 and 3, respectively. Then the true reward function is given by inner product with the weight vector

- w = (1/N,-1/N, -10/N, -1000/N).

Here N=1012 is the normalisation factor ensuring ||w||_{1}=1. Since w_{0}=-w_{1}, the bonus from the docking at the space station exactly cancels out the penalty per turn, meaning the agent gets nothing further (and loses nothing further) when docked.

### The optimal trajectories/policies

The agent will have some optimal trajectories to go along with its its partial goal. In this case, since the setup is deterministic, there is only one trajectory: the rocket accelerates as velocity +1 for 6 turns (covering 15 squares) and then decelerates by -1 for 6 turns (covering 21 squares) to reach its destination. Thus it takes 13 turns to dock, and it will then stay there. The expected features are 1/(1-γ) φ_{1} + (γ^{13})/(1-γ) φ_{0} (it will get the φ_{1} turn penalty forever, and the φ_{0} "docked" bonus after 12 turns) and the expected reward is (1/N) times -(1-γ^{13})/(1-γ).

Instead of giving one or multiple optimal trajectories (being deterministic, all trajectories would be identical), we could just give the above policy. In fact we will. Given any set of trajectories, the agent can infer the optimal (partial) policy π, simply by observing what the humans did in that state. It's only a partial policy, because the humans might not have reached every state during their example trajectories (in this case, the humans have only reached a very narrow subset of states). Call π the observed optimal policy.

## Goal completion with known φ

Assume that the agent knows all the φ and has been told that the first two components of w are (proportion to) (1,-1). It seems then that the algorithm should proceed as usual. The only difference being that the policies the agent considers are subject to constraints that it has been given. Since there is a true policy π with these constraints, it should end up being converged to in the usual way. However, the agent will not compute the correct weighting of w_{2} versus w_{3}, as neither type of acceleration appears in the expert trajectories.

In this sense, the goal completion algorithm is trivial. There still may be some advantages to goal completion (with the partial goal and trajectories) rather than traditional IRL (with just trajectories). If there are few trajectories to rely on, then the partial goal may be very informative. If there are multiple value functions that can lead to the same trajectories, then the information in the partial goal can help pick out one of them specifically (useful for when the agent leaves the training environment). This will be most relevant when we want to account for noise. Suppose, for instance, that we have told the agent that we have decided to price human lives at $5.8 million (though don't confuse the price of a life with the value of a life). Then if the agent observes a lot of mild inconsistency in human behaviour, where we sometimes seem to price life less, sometimes more, it won't try and overfit to these inconsistencies (or average them out to some other value), it will just dismiss them as noise.

The partial goals are most helpful if they reflect the true tradeoffs that we can quantify between multiple options. For example, we might not have known the true tradeoff between deaths and delay, but actuaries may have computed the tradeoff between discomfort/injury and death, giving the relationships between w_{2} and w_{3}, which it could not infer from the trajectories it is given.

## Goal completion with missing φ

Now imagine that the agent does not know the full φ. Though the full model will be useful to test the algorithms in practice, we'll simplify the situation for this exposition and remove the possibility of accelerations of magnitude 3. Thus the state-action space is of size 50,000, we remove φ_{3} and set the true w to be (1/12,-1/12,-10/12). The agent knows the first two components, as before, but is ignorant of both w_{2} and φ_{2}.

The first thing to notice is that the standard algorithm can't work. Knowing φ_{0}, φ_{1}, w_{0}, and w_{1}, the agent can compute the optimal policy π': accelerate at ±2 towards the space station, and get there in 9 turns (2 turns of +2 acceleration, covering 2 squares; 1 turn of no acceleration, covering 4 squares; 2 turns of +2 acceleration, covering 10 squares; and 4 turns of -2 acceleration, covering 20 squares). But this gives an expected feature vector of 1/(1-γ) φ_{1} + (γ^{10})/(1-γ) φ_{0}, rather than the observed feature vector of 1/(1-γ) φ_{1} + (γ^{13})/(1-γ) φ_{0}, in the trajectories it has been given. And π' is the only optimal policy compatible with the partial goal it has been given.

One thing we can do at this point is loosen the partial goal. Suppose we allow it to consider (w_{0}, w_{1})∝(-1,1) as a possible reward vector (ie the opposite of the true one). Under this setup, it gets rewards for every single turn, except when docking at the space station. The optimal policy π'' for this is to accelerate at random (unless the acceleration would cause the rocket to come to rest at the space station). This has an expected feature vector of 1/(1-γ) φ_{1}.

Now the expert feature vector is in the convex hull of the expected features it has found. The policy

**π'''** = γ^{3}π' + (1-γ^{3}) π'' (meaning that at the very beginning, the agent choose, once, either to follow π' with probability γ^{3}, and otherwise follows π''),

will give the correct expert expected features.

However it is clear that π''' is a different policy from the observed policy π, even if they have the same expected (φ_{0}, φ_{1}). It is a mixed strategy, for one, not close to any pure strategy, and it is not optimal for any possible values of the (two dimensional) vector w.

### Adding depth of observations

One might be tempted at this point to try to get extra observations. We could change the setup in the following way: at the very first turn, a "cosmic wind" blows and randomises the position and velocity of the rocket; after that, everything proceeds as before. Then the agent will be given a much richer set of trajectories, making its observed optimal policy π (dock at the space station as fast as possible using only ± accelerations) more defined, maybe even into a full policy.

But that doesn't suffice, however. The agent will still compute that π has an observed features expectation of 1/(1-γ)φ_{1}+Aφ_{2}, for some constant A, and that this can be reached by π''' = Bπ' + (1-B)π'', with π'={dock as fast as possible, using all accelerations} and π''={never dock, otherwise anything goes}, and some constant B. The problem is not the lack of observations, the problem is that the agent lacks sufficient φ_{i}'s to converge on the reward for the policy it has been given.

### The best φ_{i} to add

The logical next step is for the agent to "guess" a φ_{2} and add it to its feature vector. But what criteria should it be using? Obviously, if it tries every single possible φ_{i} (ie one for each state-action pair) then it will eventually converge on the correct behaviour in this model. But that is an absurdly high number of features to test, and will likely result in an extremely overfitted reward that could go badly wrong if the agent is faced with a more general situation.

The first step is to put some structure on the possible φ_{i}. This could be a list of φ_{i}'s to try; a Bayesian prior over likely φ_{i}'s; the output of a deep neural net looking at the data, etc... What is needed is a relatively short list of φ_{i}'s worth trying, possibly weighted/ordered by likelihood.

The candidate φ_{i}'s are then weighted again by how well they explain the discrepancy between π and π'''. Both these policies are indistinguishable (or very similar, in the general case) in terms of the expected features that the agent knows about, yet they have divergent behaviours. The best candidate additional features are those that best explain these divergences. Note that it is trivial to perfectly explain these divergences by constructing, by hand, a φ_{i} that precisely records these differences; thus the need for a structure on the φ_{i} beforehand. Depending on the structure on the state-action space, one could use ideas similar to linear discriminant analysis or correspondence analysis to quantify how good the φ_{i} is.

Then the highest weight candidate is added to φ and the agent attempts to recompute the optimal policy again, in the traditional IRL way. If it is sufficiently close to π, it stops; if not, it attempts to find a φ_{j} that accounts for the remaining discrepancies, and so on.

In the example above, the state space with its position and acceleration, has a lot of structure. More interestingly, the combined state-action space has a natural product decomposition. We've been talking about "accelerate by 2" as an action that can happen in every state; so instead of saying "here is a state space, each state has an individual list of possible actions", we've been saying "the state-action space is a product of the state space and the action (acceleration) space". This natural product suggests several candidate φ_{i}: look at features that are purely represented in state space, or purely represented in action space.

The action features of π are clear: "generally choose ±1, rarely 0". The action features of π' are "generally choose ±2, rarely ±1 or zero". The action features of π'' are "almost always, all actions are equally valid". There are three candidate φ_{2}'s that can therefore best distinguish π from π''' (note that we are trying to distinguish different actions in the same state, not different states reached):

- φ
_{2}(-,-; ±2) = 1 (as defined above). - φ
_{2}' (-,-; 2) = 1. - φ
_{2}'' (-,-; -2) = 1.

Now φ_{2} is correct, and adding it will allows the agent to immediately converge on the correct reward and behaviour. However, depending on how we've structured the action space, it may not be an obvious or top guess. On the other hand φ_{2}' is an obvious guess, and adding it will result in a second round of policy convergence (where we now have a π' that can't accelerate fast but can decelerate fast) and then φ_{2}'' will get added at the next iteration, giving the correct policy, a feature vector of (φ_{0},φ_{1},φ_{2}',φ_{2}''), and a reward vector proportional to something like (1,-1,-A,-B) for some positive A and B.

If we went back to allowing accelerations of ±3, the agent should converge on some reward function that correctly only uses accelerations of 0 and ±1, but if won't know the relative tradeoff between ±2 versus ±3, at least not in this environment (as the human experts refrain from using either one).

### Asking questions, seeking out new information

Once it has a candidate φ_{i}, the agent is not obliged to immediately add it in and proceed. Instead, a moderately advanced agent could choose to ask humans whether this is a good φ_{i}, while a more advanced agent could go looking for more information to determine this. Even once the agent has a good model of the human behaviour in this situation, the additional φ_{i} are good candidates for humans to review and assess (it will likely bring up issues that humans hadn't considered up to that point). This is especially the case if the extra φ_{i}'s do not fix w "rigidly", ie allow for a lot of variation in the weights of w while still computing the same expected features as a human would. This is a sign that there are too many features for the information contained in the partial goal and the trajectories.

## Learning from negative examples

One good thing would be to have the agent learn from negative examples, just as humans do: "whatever you do, don't do this!". But it is tricky to determine how we are supposed to interpret a negative example. A positive trajectory implicitly contains a lot of negative examples: all the possible actions the human could have taken, but didn't. What makes a specific negative example special?

Generally, human-generated negative examples aren't trivial ("the driver is going at 79.99 km/h rather than 80 km/h; don't do that!") nor are they maximally negative ("the driver is crashing into the White House, killing the US president, and attempting to start a nuclear war; don't do that!"). So the "don't do that" command is clear, but the intensity of that command is not (subtleties in using negative examples are quite common, see eg "Learning from Negative Examples in Set-Expansion").

When interpreted in the light of the preceding, however, it's easier to see what a negative example is. It's an example of an action that is likely to be chosen, and has a disproportionately negative impact among such likely actions. "Likely to be chosen", means likely in the human judgement, of course. The action could be likely to happen through error, or it could be something that the human thinks the agent (or other humans) are likely to want to attempt.

Dealing with this seems straightforward. The negative action gets added as "action to be avoided" (along with all the actions of π''' that differ from those of π). The φ_{i} is then chosen to best separate the sets {actions of π} from {actions of π'''}∪{examples of negative actions}.

Indeed, the agent could keep track of whether using {actions of π'''}∪{examples of negative actions} or plain {actions of π'''} results in a faster/better convergence. If the agents converges better with the smaller set, but still doesn't do any of the negative actions, then it's a sign that the human negative examples are not adding much to the agent's "understanding": what humans thought were worthwhile edge cases were situations the agents had already classified correctly. Of course, the negative action examples might be intended to help more for when the agent moves beyond its training environment, so they need to be kept in mind, even if they are currently uninformative.

## Learning from extreme examples

What might be as useful (or more useful) than negative examples, are extreme examples. These are examples of decisions that are valid, but are made in unusual or urgent situations - maybe the rocket is rushing to arrive with desperately needed medical supplies. This could allow the agent to distinguish between accelerations of ±2 and those of ±3, which is impossible to do from the trajectories it was given, which contained neither.

## Comments (1)

Best*1 point [-]It makes sense that if there are only 100 positions, there are also only 100 velocities, since moving 100 units forward is identical to moving 0 units forward.

Currently rocketworld has: known domain of the reward function, a small list of actions with simple consequences, simple future interaction histories when things go right, and a very simple reward function. I'm probably forgetting more simplicities. It would be interesting to try to relax some of these.

If looking into value learning just from a few trajectories, there's probably not much point in making the agent work out the states and transitions of the MDP. But there might be some value in making them more complicated than Newtonian motion in 100 discrete spots. You might use reinforcement learning or [insert thing here] to allow the agent to more efficiently match complicated values and complicated action-consequences to optimal behavior, both in decision-making and in inference.