Crossposted at the Intelligent Agent Foundation Forum.

Suppose that we knew that superintelligent AI was to be developed within six months, what would I do?

Well, drinking coffee by the barrel at Miri's emergency research retreat I'd...... still probably spend a month looking at things from the meta level, and clarifying old ideas. But, assuming that didn't reveal any new approaches, I'd try and get something like this working.

Standard setup

Take a reinforcement learner AI, that we want to safely move a strawberry onto a plate. A human sits nearby and provides a reward based on inspecting the AI's behaviour.

As it stands, this setup is completely vulnerable to reward hacking. The reward is not provided for safe moving of the strawberry; instead the reward is provided by having the human judge that the task has been accomplished and then pressing a button. Taking control of the human or control of the button is likely to be possible for a superintelligent AI; and, as it stands, that would be mandated by this reward function.

 

Learning from positive and various negative examples

Could we have the AI instead learn what the reinforcement signal "should be"? It seems that it might at least be possible, if we can make the AI learn from both positive and negative examples. I'd make five categories of examples from which the AI could learn. It may be too dangerous to have the superintelligent AI used directly in constructing these examples; in that case, the rewards would be given to a simpler, dumber version of the AI, and the examples passed on to the superintelligent AI for offline training.

  1. Simple positive and negative examples. These are the basic examples from above: the AI completes the task or fails to, and gets the consequent reward. The AI stays within its room and the human is sober, rested, uninfluenced, and so on.
  2. Simple more dubious examples. These are examples where the AI gets a reward, but the learning process judges that these rewards were not fairly allocated. Maybe the human was drunk, or not paying attention. Maybe they slipped, and pressed the button by accident. Or they were in a hurry to end their shift, and so pressed the button. All these examples are given a score from -1 to 1. Those with a score of 1 are taken as legitimate examples of rewards, those with a score of 0 are taken as random reward giving, while those with a score of -1 are taken as anti-rewards instead.
  3. Extreme negative examples. Here we can offer the AI actions that lead to complicated outcomes that are not legitimate rewards. Maybe the action can *make* the controller drunk. Maybe they can start a tape playing that will convince the controller the area is haunted. They could cut the wire, fuse the button, switch the controller's task description, distract the controller, start an automated threat tape, and so on. All these are labelled as negative examples (even though the reward is given).
  4. Hypothetical negative examples. If we can communicate hypothetical situations to the AI (especially if we can do that without fully specifying the hypothetical in detail), we can use that to extend the previous category to every situation we can imagine. We can have the AI trick, threaten, or kill the controller, destroy the rest of humanity, launch subagents to seize the button, take honeypots, and so on. We can't get an exhaustive list, but we can try to get as much examples of AIs behaving badly. These are also obviously labelled as negative examples (even though the reward is given).
  5. Properties of the example space. It's not clear whether these are examples or features of the learning process. See the next sections for details.

 

Nearest unblocked strategy vs overfitting

If we simply told the AI to maximise reward, subject to the fact that all the negative examples are to be avoided, then we fall into the nearest unblocked strategy problem (NUSP). The AI will simply choose a dangerous strategy that isn't too close to those that have been explicitly ruled out. Since we don't have the time, intelligence, or imagination to rule out all the dangerous strategies (and since some of them would be too dangerous for us to even contemplate), this wouldn't work.

But if we consider the issue as a learning problem, the situation seems less dire. Then the NUSP seems simply an example of overfitting: if the AI has learnt that all these are photos of cats, it should be able to infer that something similar is also a cat, even if it isn't close to the labelled examples. Similarly for negative examples that aren't close to other negative examples.

But another variant of the NUSP continues to plague us. Suppose we had trained the AI only on black cats; how would we expect it to react when confronted with a white cat?

Imagine the space of behaviour as a high dimensional space, with the space of good behaviours as some compact subset. Then the negative examples we gave do not tightly surround that compact subset (that's standard NUSP). But we've also not spread the negative examples much around that space, due to the usual problems with our lack of intelligence and imagination. So the negative examples we've set are likely to occupy a very low dimensional space around the space of good behaviour.

 

Adding the properties

However, once we've realised that, we can add it to the learning process. What method would be the best to pick out a constrained compact set in high dimensions, where the known positive and negative examples occupy a much smaller dimensional subspace? There might not be a good method, but there's probably a best method, given what we know or think we know.

The same goes for all the other properties we known about that space. For instance, if the AI uses different approaches for different humans, this is a likely failure mode. If it doesn't matter what the task the AI is given is (because the AI will just hack the human anyway), that's another failure mode - note, again, that this property makes more sense as a learning property than as a constraint. The perils (to humans) of unconstrained search is another known failure mode. Even the fact that we are using a dumber AI to construct the practical examples, is relevant. What sort of failure are we trying to avoid? When we box the AI or seal off the human with the reward button, again, what is this trying to achieve?

And so on, as we add all our intuitive understanding of the spaces of good and bad behaviours in this area.

Depending on how the learning process is designed, we can add these properties either as constraints for the learning process, or as part of the process itself.

 

Unusual actions versus unusual situations

Now, this approach is not certain to work by any means, but as an emergency measure it seems to have potential. But note that it seems much safer as a measure to prevent the AI taking extreme/manipulative/unusual actions. Rather than as a way to constrain the AI's behaviour in unusual situations. This approach is intended to prevent the AI from taking over the world, rather than telling it what to do if it does take over. That's mainly because we have a vague idea what not taking over the world looks like (and we can produce many examples of this behaviour), but we're much more stumped as to what running the world well means. 

New Comment
10 comments, sorted by Click to highlight new comments since:

If many reasonable people knew that humanity had six months until superintelligence as if by fiat (as if by the task becomes magically easier each month for anyone, perhaps because the relevant research ideas are now in many many people's heads), and this is well understood by nearly everyone, I feel like human coordination issues open up and lots of new possibilities emerge that were foreclosed in the past for lack of a well understood payoff structure.

Think of it like negotiating with reality, on behalf of the human species, to get the best possible outcome.

If some chunk of reality makes humanity an offer we don't like, it seems reasonable to want to be able to check other parts of reality for other (hopefully better) offers.

Also, if there's only like one or two "chunks" of reality where such an offer can be found, and both the chunks of reality know that humanity has very few other options, they can give humanity pretty mediocre choices and smirk at us as we begin to realize how little mobility we have.

So if you know that you're going to get at least one success in the next six months and think that a crazy six month crash development program by a team of less than 10 people could be the cause of that outcome, then it sorta makes sense to fund maybe 10k programmers to go off to various isolated cabins in the woods.

Each little team gets some computer hardware and a very filtered internet connection that is monitored by the people whose job it is to negotiate with reality in general.

Organize the teams a bit, so they cannot become a unified bloc, but with enough (monitored) internal communication that teams that make a lot of progress in the first month or two (who want to merge and develop a complimentary system after that) can somehow find each other and go faster during month 3 and 4...

If promising results are coming from 20 different cabins after 4 months, and some of the teams are asking for way more electricity and GPUs (or whatever) then the people organizing the larger project can hope to have enough options (with enough variety on enough dimensions) that they could maybe slow some teams down, speed others up, and study all the options and generally try to gain some room to maneuver... to negotiate with the 20 incipient godlings, plus the cognitive parents of these godlings, while those entities are all still "babies" with intelligible architectures (because they were created by small human teams operating according to theories intelligible within at least the team) at the same time.

I would be more hopeful for humanity's chances in that kind of development/negotiating context than the context that will probably actually happen.

I'm not fond of your phrasing - humanity negotiating with reality - but the idea you sketch has interesting features.

I am not fond of my phrasing here either. If I had more time I'd have written something shorter and better.

It seems worth calling attention to the importance of causes of urgency. The options humanity has are different if the causes of the urgency are things like "ambient research memes are good enough now" versus "finally some company owns enough data centers".

Take a reinforcement learner AI, that we want to safely move a strawberry onto a plate. A human sits nearby and provides a reward based on inspecting the AI's behaviour.

As it stands, this setup is completely vulnerable to reward hacking. The reward is not provided for safe moving of the strawberry; instead the reward is provided by having the human judge that the task has been accomplished and then pressing a button. Taking control of the human or control of the button is likely to be possible for a superintelligent AI; and, as it stands, that would be mandated by this reward function.

I think this claimed vulnerability is invalid because reward is not the optimization target.

(I know this is an old post and your views may have changed, but I'm posting this comment for anyone who comes by later.)

I like the scenario you presented. 6 months until intelligence explosion changes the entire approach to FAI. More risk is acceptable. More abstract approaches to FAI research seem less useful if they can't lead to tangible algorithms in 6 months.

I think the best strategy would be something like my idea to have AI mimic humans. Then you can task it to do FAI research for you. It could possibly produce years worth of FAI research papers in a fraction of the time. I don't think we should worry too much about the nuances of actually training an FAI directly.

Your mimic human ideas feels similar to various things I've been playing around with. Incidentally, I've radically simplified the original "mimic humans" idea (see the second Oracle design here https://agentfoundations.org/item?id=884 ). Instead of imitating humans, the AI selects from a list of human-supplied answers. This avoids any need for GANs or similar assessment methods ^_^ "Could a human have given this answer? Well, yes, because a human did."

Selecting from a list of predetermined answers extremely limits the AI's ability. Which isn't good if we want it to actually solve very complex problems for us! And that method by itself doesn't make the AI safe, just makes it much harder for it to do anything at all.

Note someone found a way to simplify my original idea in the comments. Instead of using the somewhat complicated GAN thing, you can just have it try to predict the next letter a human would type. In theory these methods are exactly equivalent.

Instead of using the somewhat complicated GAN thing, you can just have it try to predict the next letter a human would type.

How do you trade that off against giving an actually useful answer?

Same as with the GAN thing. You condition it on producing a correct answer (or whatever the goal is.) So if you are building a question answering AI, you have it model the probability distribution something like P(human types this character | human correctly answers question). This could be done simply by only feeding it examples of correctly answered questions as it's training set. Or you could have it predict what a human might respond if they had n days to think about it.

Though even that may not be necessary. What I had in mind was just having the AI read MIRI papers and produce new ones just like them. Like a superintelligent version of what people do today with markov chains or RNNs to produce writing in the style of an author.

Yes these methods do limit the AI's ability a lot. It can't do anything a human couldn't do, in principle. But it can automate the work of humans and potentially do our job much faster. And if human ability isn't enough to build an FAI, well you could always set it to do intelligence augmentation research instead.

I see that working. But we still have the problem that if the number of answers is too large, somewhere there is going to be an answer X, such that the most likely behaviour for a human that answers X is to write something dangerous. Now, that's ok if the AI has two clearly defined processes: first find the top answer, independently of how it's written up, then write up as a human. If those goals are mixed, it will go awry.