A putative new idea for AI control; index here.

The first step to solving a problem is to define it. The first first step is to realise how tricky it is to define. This is a stub on a difficult problem.

Subagents and turning AIs off

One of the hardest things with designing safe AIs, is turning them off. Installing an OFF button is trivial; motivating them to let you push it can be done. No, the real problem is the subagent problem. The AI can create one or many subagents to carry out their task; once that's done, turning off the original AI is of little value, we have to control or extinguish the subagents.

A general problem

Subagents are a problem for many other control ideas. For example, if you want to program 'achieve X using only this many computer cycles', then 'create a subagent that wants to achieve X without the computer cycle restrictions' is one way of doing it (as long as that can be done within the restrictions). The same goes for energy use and other restrictions. Satisficers can satisfice by creating a maximiser. A boxed Oracle can remain in the box while cheerfully plotting to create an unboxed copy of itself. In fact, whatever the AI's motivation, "create subagents with the same motivation" is almost always a good idea. And, once created, there is no motivation to want to shut them down. Even if the AI is motivated to shut itself down, there is no reason for it to extend this to subagents.

Pre-corriged agents would at least be motivated to keep control of their subagents (unlike corrigible agents, who have no current desire to create subagents that would obey their future preferences).

What is a subagent?

It's also almost impossible to define a subagent. Is a corporation a subagent? Is a thousand pieces of code that reside on different computers and occasionally communicate a subagent? Is an employee or a seduced human a subagent? A pile of rocks with water running through could be a piece of conceptual art - or a slow computer. How about a virulent meme among the human population, making us all into its subagents?

The best idea I've found for defining subagents is that they are reactive intelligences with planning abilities, so random natural events that should have large influence in the future (say, a meteor heading towards Earth) would instead be re-diverted to have little impact. However, if this is used as a definition, then the subagents would simply be programmed to not behave "reactively" in whatever way defined. A slightly more subtle problem is that, since we don't have a definition of agent or subagent, we can't easily protect against holes such as "there is an alien agent on that asteroid, which will optimise the solar system iff it detects no subagent on Earth". Such holes could completely reverse the definition.

No subagent versus no influential subagents

Without the ability to properly define subagents, we can't restrict them directly. Even if you had an AI that was properly motivated to, say, do some action then turn itself off, there is no reason for it to want to do the same with its subagents.

Most of the methods of reduced impact can help with problematic subagents. They do not preclude the creation of subagents, nor mandate their destruction, but they do prevent them from having much influence. This is because reduced impact methods are aimed at preventing the AI itself from having much influence, either directly or via subagent creation. Even the attempt in this post to detect subagents directly, more properly detects influential decisions (of which the creation of a subagent could be one of them).

This is certainly a positive, but it may not be robust. If the definition of "not having much influence" turns out to have holes, it would be less than ideal to have any remaining subagents hanging about.

The "make sure that you cannot be detected in the future" approach is probably the most promising for controlling subagents. The AI there is motivated to erase all traces of its existence, which subagents would presumably be.

In all, it's a very tricky problem, and the core failure of many ideas for AI control.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 11:04 PM

This doesn't really solve your problem, but it's interesting that humans are also trying to create subagents. The whole AI problem is us trying to create subagents. It turns out that that is very very hard. And if we want to solve FAI, making subagents that actually follow our utility function, that's even harder.

So humans are an existence proof for minds which are very powerful, but unable to make subagents. Controlling true superintelligences is a totally different issue of course. But maybe in some cases we can restrict them from being superintelligent?

We'd be considerably better at subagent creation if we could copy our brains and modify them at will...

Well it's not impossible to restrict the AIs from accessing their own source code. Especially if they are implemented in specialized hardware like we are.

It's not impossible, no. But it's another failure point. And the AI might deduce stuff about itself by watching how it's run. And a world that has built an AI is a world where there will be lots of tools for building AIs around...

It helps to realize that "agent" and "actor" are roles, not identities. The (sub)agents you're discussing are actors to themselves: they have beliefs and desires, just like the entity who's trying to control them.

We're all agents, and we comprise subagents in the form of different mental modules and decision processes.

Well one thing to keep in mind is that non-superintelligent subagents are a lot less dangerous without their controller.

Why would they be non-superintelligent? And why would they need a controller? If the AI is under some sort of restriction, the most effective idea for it would be to create a superintelligent being with the same motives as itself, but without restrictions.

Well, banning the creation of other superintelligence seems easier than banning the creation of any subagents.

How? (and are you talking in terms of motivational restrictions, which I don't see at all, or physical restrictions, which seem more probable)

just program agi to always ask for permission to do something new before it does it ?