Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: gwern 03 February 2016 08:06:25PM *  4 points [-]

Me and feep have implemented a slightly-tweaked version of this using a DQN agent in Reinforce.js. (Tabular turns out to be a bit infeasible.)

At the moment, if you want to modify settings like in Karpathy's demos, you'll have to do something like download it locally to edit, with a command like wget --mirror 'www.gwern.net/docs/rl/armstrong-controlproblem/index.html' && firefox ./www.gwern.net/docs/rl/armstrong-controlproblem/index.html

Comment author: Stuart_Armstrong 08 February 2016 10:59:32AM 0 points [-]

Thanks, most excellent!

Goal completion: algorithm ideas

4 Stuart_Armstrong 25 January 2016 05:36PM

A putative new idea for AI control; index here.

This post will be extending ideas from inverse reinforcement learning (IRL) to the problem of goal completion. I'll be drawing on the presentation and the algorithm from Apprenticeship Learning via Inverse Reinforcement Learning (with one minor modification).

In that setup, the environment is an MDP (Markov Decision process), and the real reward R is assumed to be linear in the "features" of the state-action space. Features are functions φi from the full state-action space S×A to the unit interval [0,1] (the paper linked above only considers functions from the state space; this is the "minor modification"). These features form a vector φ∈[0,1]k, for k different features. The actual reward is given by the inner product with a vector w∈ℝk, thus the reward at state-action pair (s,a) is


To ensure the reward is always between -1 and 1, w is constrained to have ||w||1 ≤ 1; to reduce redundancy, we'll assume ||w||1=1.

The advantages of linearity is that we can compute the expected rewards directly from the expected feature vector. If the agent follows a policy π (a map from state to action) and has a discount factor γ, the expected feature vector is

μ(π) = E(Σt γtφ(st,π(st)),

where st is the state at step t.

The agent's expected reward is then simply

E(R) = w . μ(π).

Thus the problem of computing the correct reward is reduced to the problem of computing the correct w. In practice, to compute the correct policy, we just need to find one whose expected features are close enough to optimal; this need not involve computing w.

continue reading »
Comment author: SilentCal 21 January 2016 06:16:18PM 1 point [-]

I think we need to provide some kind of prior regarding unknown features of model and reward if we want the given model and reward to mean anything. Otherwise, for all the AI knows, the true reward has a +2-per-step term that reverses the reward-over-time feature. It can still infer the algorithm generating the sample trajectories, but the known reward is no help at all in doing so.

I think what we want is for the stated reward to function as a hint. One interpretation might be to expect that the stated reward should approximate the true reward well over the problem and solution domains humans have thought about. This works in, for instance, the case where you put an AI in charge of the paper clip factory with the stated reward '+1 per paper clip produced'.

Comment author: Stuart_Armstrong 22 January 2016 10:19:39AM 0 points [-]

Indeed. But I want to see if I can build up to this in the model.

Comment author: Lumifer 21 January 2016 03:25:45PM 1 point [-]

distinguishing noise from performance

That, actually, is a very very big problem.

Comment author: Stuart_Armstrong 22 January 2016 10:15:34AM 1 point [-]

Yes. But it might not be as bad as the general AI problem. If we end up with an AI that believes that some biases are values, it will waste a lot of effort, but might not be as terrible as it could be. But that requires a bit more analysis than my vague comment, I'll admit.

Comment author: HungryHobo 21 January 2016 12:43:14PM *  1 point [-]

OK, so the AI simply apes humans and doesn't attempt to optimize anything?

Lets imagine something we actually might want to optimize.

lets imagine that the human pilots take 10 turns to run through a paper checklist at the start before any acceleration is applied. The turnover in the middle is 10 turns instead of 2 simply because the humans take time to manually work the controls and check everything.

Do we want our AI to decide that there simply must be some essential reason for this and sit still for no reason, aping humans without knowing why and without any real reason? It could do all checks in a single turn but doesn't actually start moving for 10 turns because of artificial superstition?

And now that we're into superstitions, how do we deal with spurious correlations?

If you give it a dataset of humans doing the same task many times with little variation but by pure chance whenever [some condition is met] the pilot takes one extra turn to start moving because he was picking his nose or just happened to take slightly longer does the AI make sure to always take 11 turns when those conditions are met?

Self driving cars which simply ape the car in front and don't attempt to optimise anything already exist and are probably quite safe indeed but they're not all that useful for coming up with better ways to do anything.

Comment author: Stuart_Armstrong 21 January 2016 02:13:26PM 0 points [-]

That's the next step, once we have the apping done correctly (ie distinguishing noise from performance).

Comment author: HungryHobo 20 January 2016 02:29:53PM *  2 points [-]

Is this AI assessing things from a subjective point of view or objective? ie, is it optimizing for higher values on a counter on board the ship from one turn to the next or is it optimizing for a higher values on a counter kept in a gods-eye-view location?

You spent a great deal of your post explaining the testing scenario but I'm still entirely unclear about how your AI is supposed to learn from sample trajectories (training data?).

Assuming it's given a training set which includes many trajectories where human pilots have killed themselves by going to +3 or -3 you might also run into the problem that those trajectories are also likely to be ones where the ships never reached the station with velocity zero because there were no humans alive on board to bring the ship in.

To the AI they would look like trajectories which effectively go to negative infinity because the ship has accelerated into the void never to be seen again, receiving unlimited -1 points per turn.

It might assume that +3 -3 accelerations cause automatic failure but not make any connection to PA.

To be clear, are you hoping that your AI will find statistical correlations between high-scores and various behaviors without you needing to clearly outline them? how is this different from a traditional scoring function when given training data?

Also I'm completely unclear about where "the discomfort of the passenger for accelerations" comes into the scoring. If the training data includes many human pilots taking +2 -2 trips and many pilots taking +1 -1 trips and the otherwise identical +2 -2 trips get better scores there's no reason (at least that you've given) for your AI to derive that +1 -1 is better. Indeed it might consider the training data to have been included to teach it that +2 -2 is always better.

Comment author: Stuart_Armstrong 21 January 2016 11:18:56AM 2 points [-]

To be clear, in this model, the sample trajectories will all be humans performing perfectly - ie only accelerations of +1 and -1, for every turn (expect maybe two turns).

Informally, what I expect the AI to do is think "hum, the goal I'm given should imply that accelerations of +3 and -3 give the optimal trajectories. Yet the trajectories I see do not use them. Since I know the whole model, I know that accelerations of +3 and -3 move me to PA=0. It seems likely that there's a penalty for reaching that state, so I will avoid it.

Hum, but I also see that no-one is using accelerations of +2 and -2. Strange. This doesn't cause the system to enter any new state. So there must be some intrinsic penalty in the real reward function for using such accelerations. Let's try and estimate what it could be..."

Comment author: Kaj_Sotala 20 January 2016 03:44:34PM *  4 points [-]

I'm calling "goal completion" the idea of giving an AI a partial goal, and having the AI infer the missing parts of the goal, based on observing human behaviour. Here is an initial model to test some of these ideas on.

Given that we cannot exhaustively explain all of the definitions and assumptions behind a goal, and these have to be inferred based on background knowledge about us, isn't the process of giving an AI any non-trivial goal an example of goal completion?

Comment author: Stuart_Armstrong 21 January 2016 11:06:11AM 0 points [-]

In a sense, yes. But here we're trying to automate the process.

Comment author: Vaniver 20 January 2016 05:30:46PM 0 points [-]

Can the AI tell the difference?

It looks to me like the difference is that one constraint is binding, and the other isn't. Are you familiar with primal and dual formulations of optimization problems? (I have a specific idea, but want to know the level of detail I should explain it at / if I should go reference hunting.)

For the moment we're ignoring a lot of subtleties (such as bias and error on the part of the human expert), and these will be gradually included as the algorithm develops.

It seems to me that this is the real heart of this problem. I can see why you'd want to investigate solutions for simpler problems, in the hopes that they'll motivate a solution for the main problem, but I'm worried that you're calling this a subtlety.

so even our "true reward function" didn't fully encode what we really wanted.

Replacing it with -1000 if PA=0 should make it work, by the same reasoning that makes accelerations of ±2 never optimal.

Comment author: Stuart_Armstrong 21 January 2016 11:05:29AM 1 point [-]

Are you familiar with primal and dual formulations of optimization problems?

Nope, sorry! Tell me! ^_^

Replacing it with -1000 if PA=0 should make it work, by the same reasoning that makes accelerations of ±2 never optimal.

But that doesn't really capture our intuitions either - if, in a more general model, a human ended up dying by accident, we wouldn't want the agent to pay the price for this each and every turn thereafter.

Comment author: SilentCal 19 January 2016 05:23:51PM 0 points [-]

I'm not sure I've encountered these more advanced versions. is there a link?

Comment author: Stuart_Armstrong 20 January 2016 02:02:35PM 0 points [-]
Comment author: Lumifer 19 January 2016 06:08:09PM 0 points [-]

De gustibus, of course, but I prefer limited and hard definitions to ones that fuzzily expand until they're "almost equivalent" to a large and vague concept...

Comment author: Stuart_Armstrong 20 January 2016 02:00:39PM *  0 points [-]

I see your point, but I don't think the original Chesterton's fence is a stable concept. Knowing why the person-who-built-the-fence, built the fence, is different from knowing why the fence was built (and allowed to stand). But, as you say, de gustibus (an expression I will steal from now on).

View more: Next