Crossposted at Intelligent Agents Forum.

The nearest unblocked strategy problem (NUS) is the idea that if you program a restriction or a patch into an AI, then the AI will often be motivated to pick a strategy that is as close as possible to the banned strategy, very similar in form, and maybe just as dangerous.

For instance, if the AI is maximising a reward R, and does some behaviour Bi that we don't like, we can patch the AI's algorithm with patch Pi ('maximise R0 subject to these constraints...'), or modify R to Ri so that Bi doesn't come up. I'll focus more on the patching example, but the modified reward one is similar.

The problem is that Bi was probably a high value behaviour according to R-maximising, simply because the AI was attempting it in the first place. So there are likely to be high value behaviours 'close' to Bi, and the AI is likely to follow them.

 

A simple example

Consider a cleaning robot that rushes through its job an knocks over a white vase.

Then we can add patch P1: "don't break any white vases".

Next time the robot acts, it breaks a blue vase. So we add P2: "don't break any blue vases".

The robots next few run-throughs result in more patches: P3: "don't break any red vases", P4: "don't break mauve-turquoise vases", P5: "don't break any black vases with cloisonné enammel"...

 

Learning the restrictions

Obviously the better thing for the robot to do would be just to avoid breaking vases. So instead of giving the robot endless patches, we could try and instead give it patches P1, P2, P3, P4... and have it learn: "what is the general behaviour that these patches are trying to proscribe? Maybe I shouldn't break any vases."

Note that even a single P1 patch would require an amount of learning, as you are trying to proscribe breaking white vases, at all times, in all locations, in all types of lighting, etc...

The idea is similar to that mentioned in the post on emergency learning, trying to have the AI generalise the idea of restricted behaviour from examples (=patches), rather than having to define all the examples.

 

A complex example

The vase example is obvious, but ideally we'd hope to generalise it. We'd hope to have the AI take patches like:

  • P1 "Don't break vases."
  • P2: "Don't vacuum the cat."
  • P3: "Don't use bleach on paintings."
  • P4: "Don't obey human orders when the human is drunk."
  • ...

And then have the AI infer very different restrictions, like "Don't imprison small children."

Can this be done? Can we get a sufficient depth of example patches that most other human-desired patches can be learnt or deduced? And can we do this without the AI simply learning "Manipulate the human"? This is one of the big questions for methods like reward learning.

New Comment
9 comments, sorted by Click to highlight new comments since:

Recently in the LW Facebook group, I shared a real-world example of an AI being patched and finding a nearby unblocked strategy several times. Maybe you can use it one day. This example is about Douglas Lenat's Eurisko and the strategies it generated in a naval wargame. In this case, the 'patch' was a rules change. For some context, R7 is the name of one of Eurisko's heuristics:

A second use of R7 in the naval design task, one which also inspired a rules change, was in regard to the fuel tenders for the fleet. The constraints specified a minimum fractional tonnage which had to be held back, away from battle, in ships serving as fuel tenders. R7 caused us to consider using warships for that purpose, and indeed that proved a useful decision: whenever some front-line ships were moderately (but not totally) damaged, they traded places with the tenders in the rear lines. This maneuver was explicitly permitted in the rules, but no one had ever employed it except in desperation near the end of a nearly-stalemated battle, when little besides tenders were left intact. Due to the unintuitive and undesirable power of this design, the tournament directors altered the rules so that in 1982 and succeeding years the act of 'trading places' is not so instantaneous. The rules modifications introduced more new synergies (loopholes) than they eliminated, and one of those involved having a ship which, when damaged, fired on (and sunk) itself so as not to reduce the overall fleet agility.

Thanks! Do you have a link to the original article?

The quote is from this article, section 4.1. There might be other descriptions elsewhere, Lenat himself cites some documents released by the organization hosting the wargame. You might want to check out the other articles in the 'Nature of Heuristics' series too. I think there are free pdfs for all of them on Google Scholar.

Well we understand fairly well now that in gradient descent-based computer vision systems, the model that eventually gets learned about the environment can pretty well generalize to new examples that have never been seen (see here, and response here).

In other words, the "over-fitting" problem tends to be avoided even in highly over-parameterized networks. This seems to be because:

  • Models that memorize specific examples tend to be more complex than models that generalize.
  • Gradient descent-based optimizers have to do more work (they have to move a point in parameter space across a longer distance) to learn a complex model with more curvature.
  • Even though neural nets are capable of learning functions of nearly arbitrary complexity, they will first try to fit a relatively smooth function, or close to the simplest hypothesis that fits the data.

So it seems that agents that implement neural network based vision systems may possibly be disincentivized from explicitly learning loopholes about object-level concepts - since this would require it to build more complex models about the environment. The danger then arises when the agent is capable of explicitly computing its own reward function. If it knows that its reward function is based on fungible concepts, it could be incentivized to alter those concepts - even if doing so would come at a cost - if it had certainty that doing so would result in a great enough reward.

So what if it didn't have certainty about its reward function? Then it might need to model the reward function using similar techniques, perhaps also a gradient-based optimizer. That might be an acceptable solution insofar as it is also incentive to learn the simplest model that fits what it has observed about its reward function. Considering actions that are complex or seem to get very close to actions that have large cost would then become very unsafe bets due to that uncertainty. Actions that are close together in function space will be given similar cost values if the function that maps actions to rewards is fairly smooth.

Of course I don't know if any of the above remains true in the regime where model capacity and optimization power are no longer an issue.

So it seems that agents that implement neural network based vision systems may possibly be disincentivized from explicitly learning loopholes about object-level concepts - since this would require it to build more complex models about the environment.

This argument works well if the undesired models are also the complex ones (such as don't-break-vase vs don't-break-white-vase). On the other hand, it fails terribly if the reverse is the case.

For example, if humans are manually pressing the reward button, then the hypothesis that reward comes from humans pressing the button (which is the undesirable hypothesis -- it leads to wireheading) will often be the simplest one that fits the data.

The danger then arises when the agent is capable of explicitly computing its own reward function. If it knows that its reward function is based on fungible concepts, it could be incentivized to alter those concepts - even if doing so would come at a cost - if it had certainty that doing so would result in a great enough reward.

This kind of confuses two levels. If the AI has high confidence that it is being rewarded for increasing human happiness, then when it considers modifying its own concept of human happiness to something easier to satisfy, then it will ask itself the question "will such a self-modification improve human happiness?" using its current concept of human happiness.

That's not to say that I'm not worried about such problems; there can easily be bad designs which cause the AI to make this sort of mistake.

I agree with you; this is an old post that I don't really agree with any more.

I also would have expected you to agree w/ my above comment when you originally wrote the post; I just happened to see tristanm's old comment and replied.

However, now I'm interested in hearing about what ideas from this post you don't endorse!

What's the complexity of inputs the robot is using? I think you're mixing up levels of abstraction if you have a relatively complex model for reward, but you only patch using trivially-specific one-liners.

If we're talking about human-level or better AI, why wouldn't the patch be at roughly the same abstraction as for humans? Grandpa yells at the kid if he breaks a white vase while sweeping up the shop. The kid doesn't then think it's OK to break other stuff. Maybe the patch is more systemic - grandpa helps the kid by pointing out a sweeping style that puts less merchandise at risk (and only incidentally is an ancient bo stick fighting style).

In any case, it's never at risk of being interpreted as "don't break white vases".

Even for the nascent chatbots and "AI" systems in play today, which are FAR less complicated and which very often have hamfisted patches to adjust output in ways that are hard to train (like "never use this word, even if it's in a lot of source material"), the people writing patches know the system well enough to know when to fix the model and how to write a semi-general patch.

The patches used are to illustrate the point rather than for being realistic specific examples.