Related to: Shut up and do the impossible!; Everything about an AI in a box.
One solution to the problem of friendliness is to develop a self-improving, unfriendly AI, put it in a box, and ask it to make a friendly AI for us. This gets around the incredible difficulty of developing a friendly AI, but it creates a new, apparently equally impossible problem. How do you design a box strong enough to hold a superintelligence? Lets suppose, optimistically, that researchers on friendly AI have developed some notion of a certifiably friendly AI: a class of optimization processes whose behavior we can automatically verify will be friendly. Now the problem is designing a box strong enough to hold an unfriendly AI until it modifies itself to be certifiably friendly (of course, it may have to make itself smarter first, and it may need to learn a lot about the world to succeed).
Edit: Many people have correctly pointed out that certifying friendliness is probably incredibly difficult. I personally believe it is likely to be significantly easier than actually finding an FAI, even if current approaches are more likely to find FAI first. But this isn't really the core of the article. I am describing a general technique for quarantining potentially dangerous and extraordinarily sophisticated code, at great expense. In particular, if we developed uFAI before having any notion of certifiable friendliness, then we could still use this technique to try and use the uFAI in a very limited way. It allows us to quarantine an AI and force everything it tries to say to us through an arbitrarily complicated filter. So, for example, if we are interested in solving some problem with a unique and verifiably correct answer, then we can ask the AI for this answer and only see the output if it passes the verification. We can also use a quarantined AI in more useful ways, at the expense of introducing some limited channels of influence (for example, if we ask for the solution to a problem which may have several solutions, we allow the AI to influence us by choosing which solution to provide)
Minimally, I hope this scheme at least suggests that putting an AI in a box may not be catastrophic. What to do with a boxed AI is another question. /Edit.
Concretely, suppose someone comes to you with the design of a recursively self-improving AI---call it Hal. Unfortunately, Hal is not certifiably friendly. Maybe it is friendly but not provably so, but more likely it is ambivalent to the human condition and is likely to destroy the universe if you just tell it to make a friendly AI. Nevertheless, you think Hal is smart enough to solve the friendly AI problem for you. All you need to do is give it access to all of the information about the world it wants and a couple of hours to run; after that, it should output a modified version of itself which is certifiably friendly.
You almost run Hal and ask it to design a friendly AI for you. After all, if you just run Hal and check to see if it outputs a certifiably friendly AI, what can it do other than refuse? But after some more thought you realize this is overly optimistic. Maybe you can avoid giving it an output channel that a human will have to read, thereby eliminating the most obvious failure mode of a boxed AI, but you still have to give it access to a computer to run on. Computation produces electromagnetic radiation, heat, etc., and it is honestly hard to say what a superintelligence could do with such a channel.
We seem to be faced with an impossible problem. We need to ask Hal to design us a certifiably friendly AI, but we also need to give him absolutely no ability to interact with the world destructively.
There is one field in particular which has a good history of solving obviously impossible problems: computational cryptography. We want to implement the following obviously impossible device. A cage into which we can place Hal so that he can think and examine all of the data we give him, but from which he cannot interact with his environment. In the cage with him is a tamper-proof arbiter implementing our friendliness test. Hal may give the arbiter the code to a single AI; if the arbiter confirms that this AI is certifiably friendly it will kill Hal, replace him with the friendly AI, and then allow the friendly AI to send a single message out of the cage.
To implement this impossible device we co-opt an important cryptographic primitive called public key homomorphic encryption, typically used for the impossible problem of private delegated computation. To get a basic idea you can see the Wikipedia article on homomorphic encryption. I'll give the brief description here (you can skip the next two paragraphs if you are familiar with homomorphic encryption).
A public key encryption scheme has the following properties. Anyone can encrypt a message using a publicly published key, but you cannot efficiently distinguish the encryptions of two different messages unless you know something about the secret key. A fully homomorphic encryption scheme has the additional counter intuitive property that, although you cannot learn anything about the encrypted data without the secret key, you can still perform arbitrary operations on it. In particular, if you have a homomorphic encryption of the source code of a program you can run that program without learning anything about it (except how long the source is), but the output of the program will be homomorphically encrypted. All of this can be done using only the public key.
The existence of a fully homomorphic encryption scheme was completely open until recently, when it was shown by Craig Gentry that such a scheme exists under the assumption that a certain problem on lattices is computationally intractable. Known homomorphic encryption schemes are very inefficient and based on new assumptions, but we can realistically expect improvements in the coming years and decades both in speed and the strength of the necessary assumptions.
If you run a homomorphically encrypted AI, sufficiently isolated from the private key, you get the very strong guarantee that the AI's execution has no computationally detectable effect on the rest of the world. If the AI was able to effect the world in a computationally detectable way, then we would have successfully broken the homomorphic encryption scheme by distinguishing the encrypted execution of the AI from the encrypted execution of some benign program.
Now we need to give the arbiter some means to allow a genuinely friendly AI access to the rest of the world, without giving Hal any illegitimate access. We allow Hal access to an output channel whose outputs are homomorphically encrypted. After some time, we destroy Hal and then homomorphically check to see if he has output the code of a certifiably friendly AI over the channel (we have to decode a single homomorphically encrypted bit to perform this check). If he hasn't, then we do nothing---apparently Hal would rather not give us a friendly AI. If he has output the code of a certifiably friendly AI, we homomorphically run it, obtain a homomorphic encryption of its output, and then decrypt only this output.
That's the entire scheme. Whenever you have some code whose execution itself might do incredible damage, this precaution allows you to execute it in perfect safety (the main caveat is that you need an automatic way to screen the output which doesn't give the algorithm enough room to possibly do harm).
Note: If you are really exceptionally paranoid you may be able to obtain even better guarantees by developing more powerful cryptographic techniques tailored for this application. In particular, the secret key used by the encryption procedure may be leaked to a sufficiently powerful AI through a side channel (the computations involving the key typically leak some data in the form of radiation, which might be recovered from sufficiently exhaustive sensor data by a sufficiently powerful AI). By doing these computations on a quantum computer you can probably guarantee that no data was leaked. If you want to be even more fantastically paranoid (ie, if you want a formal proof of security, which is a rare treat in real problems) you will actually have to shred the key, guaranteeing that an AI will have to interact with the shredded key to recover it. If you do this, you have to use another new primitive to implement the final revelation of the friendly AI's message.
If you have any concerns about the security of this protocol, I would be happy to try and defend it and would be surprised and interested if I failed. Whether or not such a cryptographic box is really an interesting or important object is another question. (It is interesting to me as an unexpected application of cryptography).
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
In particular the sort of proof techniques I currently have in mind - what they prove and what it means - for ensuring Friendliness through a billion self-modifications of something that started out relatively stupid and built by relatively trusted humans, would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence, and it seems to me that the required proof techniques for that would have to be considerably stronger.
To paraphrase Mayor Daley, the proof techniques are there to preserve the Friendly intent of the programmers through the process of constructing the AI and through the AI's self-modification, not to create Friendliness. People hear the word "prove" and assume that this is because you don't trust the programmers, or because you have a psychological need for unobtainable absolute certainty. No, it's because if you don't prove certain things (and have the AI prove certain things before each self-modification) then you can't build a Friendly AI no matter how good your intentions are. The good intentions of the programmers are still necessary, and assumed, beyond the parts that are proved; and doing the proof work doesn't make the whole process absolutely certain, but if you don't strengthen certain parts of the process using logical proof then you are guaranteed to fail. (This failure is knowable to a competent AGI scientist - not with absolute certainty, but with high probability - and therefore it is something of which a number of would-be dabblers in AGI maintain a careful ignorance regardless of how you try to explain it to them, because the techniques that make them enthusiastic don't support that sort of proof. "It is hard to explain something to someone whose job depends on not understanding it.")
Does this still hold if you remove the word "hostile", i. e. if the "friendliness" of the superintelligence you construct first is simply not known?
Th... (read more)