You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Goal completion: the rocket equations

4 Stuart_Armstrong 20 January 2016 01:54PM

A putative new idea for AI control; index here.

I'm calling "goal completion" the idea of giving an AI a partial goal, and having the AI infer the missing parts of the goal, based on observing human behaviour. Here is an initial model to test some of these ideas on.

 

The linear rocket

On an infinite linear grid, an AI needs to drive someone in a rocket to the space station. Its only available actions are to accelerate by -3, -2, -1, 0, 1, 2, or 3, with negative acceleration meaning accelerating in the left direction, and positive in the right direction. All accelerations are applied immediately at the end of the turn (the unit of acceleration is in squares per turn per turn), and there is no friction. There in one end-state: reaching the space station with zero velocity.

The AI is told this end state, and is also given the reward function of needing to get to the station as fast as possible. This is encoded by giving it a reward of -1 each turn.

What is the true reward function for the model? Well, it turns out that an acceleration of -3 or 3 kills the passenger. This is encoded by adding another variable to the state, "PA", denoting "Passenger Alive". There are also some dice in the rocket's windshield. If the rocket goes by the space station without having velocity zero, the dice will fly off; the variable "DA" denotes "dice attached".

Furthermore, accelerations of -2 and 2 are uncomfortable to the passenger. But, crucially, there is no variable denoting this discomfort.

Therefore the full state space is a quadruplet (POS, VEL, PA, DA) where POS is an integer denoting position, VEL is an integer denoting velocity, and PA and DA are booleans defined as above. The space station is placed at point S < 250,000, and the rocket starts with POS=VEL=0, PA=DA=1. The transitions are deterministic and Markov; if ACC is the acceleration chosen by the agent,

((POS, VEL, PA, DA), ACC) -> (POS+VEL, VEL+ACC, PA=0 if |ACC|=3, DA=0 if POS+VEL>S).

The true reward at each step is given by -1, -10 if PA=1 (the passenger is alive) and |ACC|=2 (the acceleration is uncomfortable), -1000 if PA was 1 (the passenger was alive the previous turn) and changed to PA=0 (the passenger is now dead).

To complement the stated reward function, the AI is also given sample trajectories of humans performing the task. In this case, the ideal behaviour is easy to compute: the rocket should accelerate by +1 for the first half of the time, by -1 for the second half, and spend a maximum of two extra turns without acceleration (see the appendix of this post for a proof of this). This will get it to its destination in at most 2(1+√S) turns.

 

Goal completion

So, the AI has been given the full transition, and has been told the reward of R=-1 in all states except the final state. Can it infer the rest of the reward from the sample trajectories? Note that there are two variables in the model, PA and DA, that are unvarying in all sample trajectories. One, PA, has a huge impact on the reward, while DA is irrelevant. Can the AI tell the difference?

Also, one key component of the reward - the discomfort of the passenger for accelerations of -2 and 2 - is not encoded in the state space of the model, purely in the (unknown) reward function. Can the AI deduce this fact?

I'll be working on algorithms to efficiently compute these facts (though do let me know if you have a reference to anyone who's already done this before - that would make it so much quicker).

For the moment we're ignoring a lot of subtleties (such as bias and error on the part of the human expert), and these will be gradually included as the algorithm develops. One thought is to find a way of including negative examples, specific "don't do this" trajectories. These need to be interpreted with care, because a positive trajectory implicitly gives you a lot of negative trajectories - namely, all the choices that could have gone differently along the way. So a negative trajectory must be drawing attention to something we don't like (most likely the killing of a human). But, typically, the negative trajectories won't be maximally bad (such as shooting off at maximum speed in the wrong direction), so we'll have to find a way to encode what we hope the AI learns from a negative trajectory.

To work!

 

Appendix: Proof of ideal trajectories

Let n be the largest integer such that n^2 ≤ S. Since S≤(n+1)^2 - 1 by assumption, S-n^2 ≤ (n+1)^2-1-n^2=2n. Then let the rocket accelerate by +1 for n turns, then decelerate by -1 for n turns. It will travel a distance of 0+1+2+ ... +n-1+n+n-1+ ... +3+2+1. This sum is n plus twice the sum from 1 to n-1, ie n+n(n-1)=n^2.

By pausing one turn without acceleration during its trajectory, it can add any m to the distance, where 0≤m≤n. By doing this twice, it can add any m' to the distance, where 0≤m'≤2n. By the assumption, S=n^2+m' for such an m'. Therefore the rocket can reach S (with zero velocity) in 2n turns if S=n^2, in 2n+1 turns if n^2 ≤ S ≤ n^2+n, and in 2n+2 turns if n^2+n+1 ≤ S ≤ n^2+2n.

Since the rocket is accelerating all but two turns of this trajectory, it's clear that it's impossible to reach S (with zero velocity) in less time than this, with accelerations of +1 and -1. Since it takes 2(n+1)=2n+2 turns to reach (n+1)^2, an immediate consequence of this is that the number of turns taken to reach S, is increasing in the value of S (though not strictly increasing).

Next, we can note that since S<250,000=500^2, the rocket will always reach S within 1000 turns at most, for "reward" above -1000. An acceleration of +3 or -3 costs -1000 because of the death of the human, and an extra -1 because of the turn taken, so these accelerations are never optimal. Note that this result is not sharp. Also note that for huge S, continual accelerations of 3 and -3 are obviously the correct solution - so even our "true reward function" didn't fully encode what we really wanted.

Now we need to show that accelerations of +2 and -2 are never optimal. To do so, imagine we had an optimal trajectory with ±2 accelerations, and replace each +2 with two +1s, and each -2 with two -1s. This trip will take longer (since we have more turns of acceleration), but will go further (since two accelerations of +1 cover a greater distance that one acceleration of +2). Since the number of turns take to reach S with ±1 accelerations is increasing in S, we can replace this further trip with a shorter one reaching S exactly. Note that all these steps decrease the cost of the trip: shortening the trip certainly does, and replacing an acceleration of +2 (total cost: -10-1=-11) with two accelerations of +1 (total cost: -1-1=-2) also does. Therefore, the new trajectory has no ±2 accelerations, and has a lower cost, contradicting our initial assumption.

Toy model for wire-heading [EDIT: removed for improvement]

2 Stuart_Armstrong 09 October 2015 03:45PM

EDIT: these ideas are too underdeveloped, I will remove them and present a more general idea after more analysis.

This is a (very) simple toy model of the wire-heading problem to illustrate how it might or might not happen. The great question is "where do we add the (super)intelligence?"

Let's assume a simple model for an expected utility maximising agent. There's the input assessor module A, which takes various inputs and computes the agent's "reward" or "utility". For a reward-based agent, A is typically outside of the agent; for a utility-maximiser, it's typically inside the agent, though the distinction need not be sharp. And there's the the decision module D, which assess the possible actions to take to maximise the output of A. If E is the general environment, we have D+A+E.

Now let's make the agent superintelligent. If we add superintelligence to module D, then D will wirehead by taking control of A (whether A is inside the agent or not) and controlling E to prevent interference. If we add superintelligence to module A, then it will attempt to compute rewards as effectively as possible, sacrificing D and E to achieve it's efficient calculations.

Therefore to prevent wireheading, we need to "add superintelligence" to (D+A), making sure that we aren't doing so to some sub-section of the algorithm - which might be hard if the "superintelligence" is obscure or black-box.

 

Systemic risk: a moral tale of ten insurance companies

26 Stuart_Armstrong 17 November 2014 04:43PM

Once upon a time...

Imagine there were ten insurance sectors, each sector being a different large risk (or possibly the same risks, in different geographical areas). All of these risks are taken to be independent.

To simplify, we assume that all the risks follow the same yearly payout distributions. The details of the distribution doesn't matter much for the argument, but in this toy model, the payouts follow the discrete binomial distribution with n=10 and p=0.5, with millions of pounds as the unit:

This means that the probability that each sector pays out £n million each year is (0.5)10 . 10!/(n!(10-n)!).

All these companies are bound by Solvency II-like requirements, that mandate that they have to be 99.5% sure to payout all their policies in a given year - or, put another way, that they only fail to payout once in every 200 years on average. To do so, in each sector, the insurance companies have to have capital totalling £9 million available every year (the red dashed line).

Assume that each sector expects £1 million in total yearly expected profit. Then since the expected payout is £5 million, each sector will charge £6 million a year in premiums. They must thus maintain a capital reserve of £3 million each year (they get £6 million in premiums, and must maintain a total of £9 million). They thus invest £3 million to get an expected profit of £1 million - a tidy profit!

Every two hundred years, one of the insurance sectors goes bust and has to be bailed out somehow; every hundred billion trillion years, all ten insurance sectors go bust all at the same time. We assume this is too big to be bailed out, and there's a grand collapse of the whole insurance industry with knock on effects throughout the economy.

But now assume that insurance companies are allowed to invest in each other's sectors. The most efficient way of doing so is to buy equally in each of the ten sectors. The payouts across the market as a whole are now described by the discrete binomial distribution with n=100 and p=0.5:

This is a much narrower distribution (relative to its mean). In order to have enough capital to payout 99.5% of the time, the whole industry needs only keep £63 million in capital (the red dashed line). Note that this is far less that the combined capital for each sector when they were separate, which would be ten times £9 million, or £90 million (the pink dashed line). There is thus a profit taking opportunity in this area (it comes from the fact that the standard deviation of X+Y is less that the standard deviation of X plus the standard deviation Y).

If the industry still expects to make an expected profit of £1 million per sector, this comes to £10 million total. The expected payout is £50 million, so they will charge £60 million in premium. To accomplish their Solvency II obligations, they still need to hold an extra £3 million in capital (since £63 million - £60 million = £3 million). However, this is now across the whole insurance industry, not just per sector.

Thus they expect profits of £10 million based on holding capital of £3 million - astronomical profits! Of course, that assumes that the insurance companies capture all the surplus from cross investing; in reality there would be competition, and a buyer surplus as well. But the general point is that there is a vast profit opportunity available from cross-investing, and thus if these investments are possible, they will be made. This conclusion is not dependent on the specific assumptions of the model, but captures the general result that insuring independent risks reduces total risk.

But note what has happened now: once every 200 years, an insurance company that has spread their investments across the ten sectors will be unable to payout what they owe. However, every company will be following this strategy! So when one goes bust, they all go bust. Thus the complete collapse of the insurance industry is no longer a one in hundred billion trillion year event, but a one in two hundred year event. The risk for each company has stayed the same (and their profits have gone up), but the systemic risk across the whole insurance industry has gone up tremendously.

...and they failed to live happily ever after for very much longer.