A putative new idea for AI control; index here.

An idea for how an we might successfully get useful work out of a powerful AI.

 

The ultimate box

Assume that we have an extremely detailed model of a sealed room, with a human in it and enough food, drink, air, entertainment, energy, etc... for the human to survive for a month. We have some medical equipment in the room - maybe a programmable set of surgical tools, some equipment for mixing chemicals, a loud-speaker for communication, and anything else we think might be necessary. All these objects are specified within the model.

We also have some defined input channels into this abstract room, and output channels from this room.

The AI's preferences will be defined entirely with respect to what happens in this abstract room. In a sense, this is the ultimate AI box: instead of taking a physical box and attempting to cut it out from the rest of the universe via hardware or motivational restrictions, we define an abstract box where there is no "rest of the universe" at all.

 

Cure cancer! Now! And again!

What can we do with such a setup? Well, one thing we could do is to define the human in such a way that they have some from of advanced cancer. We define what "alive and not having cancer" counts as, as well as we can (the definition need not be fully rigorous). Then the AI is motivated to output some series of commands to the abstract room that results in the abstract human inside not having cancer. And, as a secondary part of its goal, it outputs the results of its process.

Initially, the AI is likely to simply kill the human, using some hole in the definition of "alive and not having cancer" that we haven't seen. We can then tighten up the definition, and possibly give the AI more tools inside the abstract room. After a few iterations of this, the AI is likely to be able to "cure cancer" in a way that meets our approximate requirements.

And then, it suffices for us to analyse the process used, and its outcome, to establish how the AI went about its task. If the room was defined in sufficient detail, it is likely that that the process can be adapted to curing cancer in the real world. If not, we can vary the input - defining agents with different types of cancer, comparing the AI's approaches, maybe requiring the AI's solution to work across the different situations. Thus it seems that we have extracted a cure for cancer from a superintelligent AI.

 

Why it works, if it works

This idea is in some sense a reversal of Paul Christiano's idea. Instead of having an AI whose motivation or decisions are constrained by some abstract human, the AI's decisions are restricted to having an impact in some abstract world. In another sense the idea is an inverse of the `cartesian boundary' of the AIXI. The AIXI is operating on the real world, while pretending that it exists in an abstract world. Here, it exists in the real world, while operating on an abstract world.

The important thing is to ensure that the AI does not actually have any preferences over the real world (including over future versions of itself, running on the same or similar problems). This can be achieved via high discount rates or resetting, appropriate indifference, variants of acausal trade barriers (to make it indifferent to achievements of future versions of itself), reduced impact, or more simply through rigorous programming of the problem definition.

Thus it seems that the AI would not be motivated to include tricks within its output, things that could trap or seduce us outsiders if we were to peer too deeply into the outcome of the AI's interaction with the room. There remains a risk that the AI might create a general intelligence within the room itself. However, the risk should be low, as the general intelligence should still be motivated to have impacts only within the room. And we can ask the AI if there is a general intelligence (specified according to some imperfect criteria) in the room's history - and then change elements inside the room until there is no such intelligence. Note that it's important that we ask the AI, rather than specifying "have no general intelligence in the room". For the AI can hack around our imperfect specification, but has no reason to do so if we only ask.

To prevent any social hacking of the virtual human that might accidentally rebound on us, it would be best if the virtual human was obedient and dumb. Obedient so the AI didn't need to do any social hacking, and dumb so that it couldn't use the human for detailed complicated tasks (such as creating a general intelligence in the room).

These thoughts give a hint as to how we want the AI to interact with the room. Unrestricted access to the output and input is one possibility. Creating a subagent of low capacity (defined by hard drive space and processing speed, for instance) to interact with the room would be safer. Note that "hard drive space" and "processing speed" can be defined rigorously by defining them in (an extension to) the virtual room, and having the subagent run there. The extension need not run according to the laws of physics, it could simply be an abstract model of computation. In fact, the more of the AI we can move into the room, where operations can be be rigorously defined, the better. Ultimately, we might be able to put the whole AI within the (extended) room.

The key features that would make this idea work is that AI over-optimisation is a) constrained to the virtual world, b) safely detectable, and c) the environment can be modified perfectly in any way we want, and the experiment rerun. By having such perfect control over the virtual world, we aim to perfectly control what the features of a correct solution are.

 

As the model gets worse...

To get the above, we've assumed a quasi-perfect model of the human and the room. At the very least, this would require whole-brain emulations. An interesting question is what happens as the quality of the model degrades.

One could imagine that if the AI produced and ran a limited subagent, that it would be possible to run that subagent in an environment approximately equivalent to the ideal modelled room (stopping the experiment if anything seemed to get dangerous). Or maybe we could get the AI not to return the ideal strategy, but a broad category of strategies within which the ideal strategy could be found.

But in general, I'd say the issue of running it in an inferior model is unsolved.

Of course, for some problems (those that are more like abstract mathematical models), we might be able to fully specify the model, even today. This is roughly equivalent to various ideas that have already been proposed about running AIs in virtual worlds/running proof finding AIs with proof checkers.

 

Evolution, search, overfitting

One of the greatest advantages of this approach is that it allows for human testing, error correction, and subsequent trials. This plays to the humanity's strength (adaptation and understanding of things after the fact) rather than to its weaknesses (getting everything perfect ahead of time).

But this opens up some weakness. If we do too much testing, changing, etc... we are opening the whole setup to an optimisation process that acts in unpredictable ways. This optimisation process is, broadly speaking, human approval of AI actions - and that is not a safe optimisation process to follow. Similar issues arise if we allow too much search or evolve the AI using some fitness function appropriate to our approval. Large scale searching and optimisation is still not safe.

Nor is it safe to use these virtual worlds to test the AI, check that it "behaves" and then let it have real world impacts. The real world and our stripped down and controlled virtual world are very different, and we cannot interpolate from one to the other.

However, we could use the virtual world as a test bed for AI failure modes; the various ways AIs could misbehave might suggest things about controlling AIs in the real world. Though beware that the more similar we make the virtual world to the real world, the higher the probability that the AI could come up with something dangerous in the virtual world that might also be dangerous to have in the real world - or ever dangerous to observe, from the real world.

With all these caveats, it does seem that there are many uses we could put this kind of approach, apart from disease curing. Manufacturing, virtual experiments, and similar are immediate ideas. I'm sure other will easily suggest themselves.

New to LessWrong?

New Comment
17 comments, sorted by Click to highlight new comments since: Today at 3:42 AM

If the excellent simulation of a human with cancer is conscious, you've created a very good torture chamber, complete with mad vivisectionist AI.

I'm unsettled by the tags he gave the article. You could say the person with cancer was just an example, and we could make them brain dead, etc. But the article has the tags "emulation", "upload", "whole_brain_emulation", and "wbe".

It's very disturbing that anyone would even consider feeding a simulated human to an unfriendly AI. Let alone in this horrifying torture chamber scenario.

I have to be honest: I hadn't considered that angle yet (I tend to create ideas first, then hone them and remove issues).

The first point is that this was just an example, the first one to occur to me, and we can certainly find safer examples or improve this one.

The second is that torture is very unlikely - death, maybe painful death, but not deliberate torture.

The third is that I know some people who might be willing to go through with this, if it cured cancer through the world.

But I will have to be more careful in these issues in future, thanks.

I admit I was using the word 'torture' rather loosely. However, unless the AI is explicitly instructed to use anesthesia before any cutting is done, I think we can safely replace it with "extended periods of very intense pain".

As a first pass at a way of safely boxing an AI, though, it's not bad at all. Please continue to develop the idea.

The AI figures out that the only way to truly cure its subjects' cancers is to take over that meddlesome "real world" to stop people from giving its virtual subjects cancers.

That is, the fact that the real world is tampering with the virtual world gives the real world a relevant connection to that virtual world, and thus a relevant connection to an AI whose motivations are based solely on the virtual world.

That's why you want to specify the AIs motivation entirely in terms of the model, and reset it from one episode to the next.

As long as reality has a causal influence on the model, however, the AI's motivation includes reality.

Imagine your AI is playing a FPS, and its goal is to win the game. We have a motivation which exists entirely in terms of a virtual model. Yet if it realizes it can, its motivation includes getting the players' IP addresses, researching their real-world identities, entering theoretically secure systems as necessary, and using psychological warfare over the in-game chat system "Hey Samuel Smith, did you know your aunt Mildred Smith has pictures of you sitting naked in the bowl of a toilet as a child on her hard drive?" to gain a major advantage. Maybe it goes so far as to SWAT opposing enemy teams.

This kind of box only requires the AI to, well, think outside the box, for its solutions.

This depends entirely how the AI is set up. The AI is defined as only caring about the abstract environment. Not just that it doesn't care about the real world, but that it literally doesn't believe in the real world. Not like some video game we set up, and plug the AI into. The abstract room exists entirely in the AIs head.

Normally we give AI real data, and it tries to infer a world model. Then it tries to take actions that maximize it's chance of achieving it's goal, according to it's world model.

In this case we are giving the AI it's world model. The model is set in stone. The AI acts as if the model is true, and doesn't care about anything that is not contained in the model.

The AI is purely a planning agent. One which is given a well defined model and tries to find a series of actions that achieve some goal. You could just as easily give such an AI tasks like finding the factors of a large semiprime, or finding the shortest path through a maze, or proving a theorem, etc. The simulated room is just a really really big formal math problem.

Now whether you can actually make an AI which works like this, and doesn't try to infer it's own world model at all, is another problem. But there is quite a large body of work with planning algorithms, which do this exactly. Godel machines also behave this way. You give them a formal specification of their "world", and they take actions that are provably optimal according the world model.

An AI that can't update an inaccurate world model isn't a very good AI.

I could write code to simulate the bloodflow and whatnot of the body, with a cancerous growth somewhere. I could write code to simulate a scalpel slicing something open, and the first simulation would react accordingly. And if I plugged that into a Godel machine, with the first being a model, and the second being a mathematical operation, I'd get an amazingly good surgical machine.

But that's not what I was looking for. I was looking for something innovative. I was looking for the AI to surprise me.

Then you give the AI more options than just surgery. It has an entire simulated room of tools to work with.

And you have an amazingly good tool-user, that still doesn't innovate or surprise you.

That's not entirely true. It might surprise us by, say, showing us the precise way to use an endoscopic cauterizer to cut off blood flow to a tumor without any collateral damage. But it can't, by definition, invent a new tool entirely.

I'm not sure the solution to the AI friendliness problem is "Creating AI that is too narrow-minded to be dangerous". You throw out most of what is intended to be achieved by AI in the first place, and achieve little more than evolutionary algorithms are already capable of. (If you're capable of modeling the problem to that extent, you can just toss it, along with the toolset, into an evolutionary algorithm and get something pretty close to just as good.)

But it can't, by definition, invent a new tool entirely.

Why not? The AI can do anything that is allowed by the laws of physics, and maybe a bit more if we let it. It could invent a molecule which acts as a drug which kills the cancer. It could use the tools in the room to build different tools. It could give us plans for tiny nanobots which enter the bloodstream and target cancer cells. Etc.

Just because an environment is well defined, does not mean you can't invent anything new.

The AI can do anything that is allowed by the laws of physics

No, it can do anything that its worldview includes, and any operations defined internally. You're no longer talking about a Godel Machine, and you've lost all your safety constraints.

You can give, in theory of course, a formal description of the laws of physics. Then you can ask it to produce a plan or machine which fulfills any constraints you ask. You don't need to worry about it escaping from the box. Now it's solution might be terrible without tons of constraints, but it's at least not optimized to escape from the box or to trick you.

it can't, by definition, invent a new tool entirely.

Can humans "invent a new tool entirely", when all we have to work with are a handful of pre-defined quarks, leptons and bosons? AIXI is hard-coded to just use one tool, a Turing Machine; yet the open-endedness of that tool make it infinitely inventive.

We can easily put a machine shop, or any other manufacturing capabilities, into the abstract room. We could ignore the tedious business of manufacturing and just include a Star-Trek-style replicator, which allows the AI to use anything for which is can provide blueprints.

Also, we can easily be surprised by actions taken in the room. For example, we might simulate the room according to known scientific laws, and have it automatically suspend if anything strays too far into uncertain territory. We can then either abort the simulation, if something dangerous or undesirable is happening within, or else perform an experiment to see what would happen in that situation, then feed the result back in and resume. That would be a good way to implement an artificial scientist. Similar ideas are explored in http://lambda-the-ultimate.org/node/4392

Your response ignores the constraints this line of conversation has already engendered. I'm happy to reply to you, but your response doesn't have anything to do with the conversation that has already taken place.

Let's suppose the constraint on the AI being unable to update its world model applies. How can it use a tool it has just invented? It can't update its world model to include that tool.

Supposing it -can- update its world model, but only in reference to new tools it has developed: How do you prevent it from inventing a tool like the psychological manipulation of the experimenters running the simulation?

There are many models; the model of the box which we simulate and the AI's models of the model of the box. For this ultimate box to work there would have to be a proof that every possible model the AI could form contains at most a representation of the ultimate box model. This seems at least as hard as any of the AI boxing methods, if not harder because it requires the AI to be absolutely blinded to its own reasoning process despite having a human subject to learn about naturalized induction/embodiment from.

It's tempting to say that we could "define the AI's preferences only over the model" but that implies a static AI model of the box-model that can't benefit from learning or else a proof that all AI models are restricted as above. In short, it's perfectly fine to run a SAT-solver over possible permutations of the ultimate box model trying to maximize some utility function but that's not self-improving AI.