a certifiably friendly AI: a class of optimization processes whose behavior we can automatically verify will be friendly
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
In particular the sort of proof techniques I currently have in mind - what they prove and what it means - for ensuring Friendliness through a billion self-modifications of something that started out relatively stupid and built by relatively trusted humans, would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence, and it seems to me that the required proof techniques for that would have to be considerably stronger.
To paraphrase Mayor Daley, the proof techniques are there to preserve the Friendly intent of the programmers through the process of constructing the AI and through the AI's self-modification, not to create Friendliness. People hear the word "prove" and assume that this is because you don't trust the progra...
The probability I assign to being able to build a friendly AI directly before being able to build a hostile AI is very low. You have thought more about the problem, but I'm not really convinced. I guess we can both be right concurrently, and then we are in trouble.
I will say that I think you underestimate how powerful allowing a superintelligence to write a proof for you is. The question is not really whether you have proof techniques to verify friendliness. It is whether you have a formal language expressive enough to describe friendliness in which a transhuman can find a proof. Maybe that is just as hard as the original problem, because even formally articulating friendliness is incredibly difficult.
the sort of proof techniques I currently have in mind ... would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence.
But what if the hostile superintelligence handed you a finished AI together with a purported proof of its Friendliness. Would you have enough trust in the soundness of your proof system to check the purported proof and act on the results of that check?
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
A general theory of quarantines would nevertheless be useful.
I disagree. The weak point of the scheme is the friendliness test, not the quarantine. If I prove the quarantine scheme will work, then it will work unless my computational assumptions are incorrect. If I prove it will work without assumptions, it will work without assumptions.
If you think that an AI can manipulate our moral values without ever getting to say anything to us, then that is a different story. This danger occurs even before putting an AI in a box though, and in fact even before the design of AI becomes possible. This scheme does nothing to exacerbate that danger.
Let's consider a somewhat similar case.
You are an inventor. An evil dictator captures you, and takes you off to a faraway dungeon, where he tells you that he wants you to build him a superweapon. If you refuse to build the weapon, well, he has means of persuading you. If you still refuse, he will kill you.
Of course, when you finish building the doomsday device, your usefulness will be over, and he will probably kill you anyway.
Being a genius, you soon realise that you could build the dictator a doomsday device in about a week, with the materials lying around in your well-equipped dungeon workshop. However, this doesn't seem terribly optimal.
What do you do? You agree to build the doomsday device. But you claim it is very difficult. You spend a long time producing impressive prototypes that seem to suggest you're making progress. You claim that you need more resources. You carry on, making just enough apparent progress to stop him from killing you on the spot, but not enough to actually succeed. You build ever more expensive and complex non-working prototypes, hoping that in the course of this, you can either build yourself something suitable for breaking out of the dungeon or killing the evil dictator. Or hoping that perhaps someone will rescue you. At the very least you will have wasted the evil dictator's resources on doing something pointless - you will at least not die for nothing.
I suspect your imprisoned AI may choose to follow a similar course.
Or you build yourself a superweapon that you use to escape, and then go on to shut down your company's weapons division and spend your spare time being a superhero and romancing your assistant and fighting a pitched battle with a disloyal employee.
This reply and its parent comment constitute the "Iron Man Argument" against any kind of "put the AI in a box" approach to AGI and friendliness concerns. I predict it will be extremely effective.
So we posit that (P1) I have the source code for a superhuman non-provably-Friendly AI (call it Al) that I can run on my ubersecure Box.
Suppose I have high confidence that:
(P2) Al is willing to harm humanlike intelligences to achieve its goals.
(P3) Al can create humanlike intelligences that also run on Box.
(P4) Al can predict that I don't want humanlike intelligences harmed or killed.
I'm not even positing that P2-P4 are true, though they do seem to me pretty likely given P1, merely that I have high confidence in them, which seems almost guarantee...
Efficiency update http://www.americanscientist.org/issues/id.15906,y.2012,no.5,content.true,page.1,css.print/issue.aspx :
...Many homomorphic schemes exact a high price for security. During encryption, data undergo a kind of cosmic inflation: A single bit of plaintext may blow up to become thousands or even millions of bits of ciphertext. The encryption key can also become huge—from megabytes to gigabytes. Merely transmitting such bulky items would be costly; computing with the inflated ciphertext makes matters worse. Whereas adding or multiplying a few bits
...so instead of emoting, why don't you tell us what is the start of the art? Are we up to ENIAC yet? Have we reached IBM System/360? Can we now run Apple IIs or wristwatches under homomorphic encryption?
I don't think you understand what a fully homomorphic encryption system is.
In particular, if you have a homomorphic encryption of the source code of a program you can run that program without learning anything about it (except how long the source is), but the output of the program will be homomorphically encrypted.
Barak et al proved that you cannot encrypt source code and have someone else run it without decrypting it. Gentry's results do not contradict that proof. A fully homomorphic encryption system allows one do encrypt data and have that data oper...
I am sorry that my writing is unclear, but I think you should be a little more charitable.
In particular, note that I wrote explicitly "the output of the program will be homomorphically encrypted." This is exactly the difference between program obfuscation and homomorphic encryption. A homomorphic encryption scheme allows you to do exactly what I just did with it. In this case, you as the user still have access to the private key which can be used to decrypt the homomorphically encrypted outputs, once the AI has been safely dismantled.
The source code verifier for friendliness is basically the weakest possible thing a friendliness researcher could hope for. It doesn't verify all source code: it verifies source code of an arbitrarily special form. If you can ever prove that any AI is friendly, then you have to implement something at least this strong. I believe that the possibility of putting an AI in a box has been discussed at reasonable length here as a solution to precisely the problem I am considering. I think that trying to reduce the demands on friendliness to a minimum, and motivating the cryptographic tools we will need for that reduction, is worthwhile. After all...
I found the discussion of homomorphic encryption interesting, but
One solution to the problem of friendliness is to develop a self-improving, unfriendly AI, put it in a box, and ask it to make a friendly AI for us. This gets around the incredible difficulty of friendliness, but it creates a new, apparently equally impossible problem.
If you can reliably build an AI, but you cannot reliably build friendliness in it, why should I trust that you can build a program which in turn can reliably verify friendliness? It seems to me that if you are unable to bui...
Let me see if I understand. Firstly, is there any reason what you're trying to do is create a friendly AI? Would, for instance, getting an unknown AI to solve a specific numerical problem with an objectively checkable answer be an equally relevant example, without the distraction of whether we would ever trust the so-called friendly AI?[1]
If that is the case, then as I understand the proposal, we take the code for the unknown AI and the input data, use this encryption scheme on it, and then have a program which performs various operations on them.
As I unde...
Let me see if I understand. Firstly, is there any reason what you're trying to do is create a friendly AI? Would, for instance, getting an unknown AI to solve a specific numerical problem with an objectively checkable answer be an equally relevant example, without the distraction of whether we would ever trust the so-called friendly AI?
I think Less Wrong needs a variant of Godwin's Law. Any post whose content would be just as meaningful and accessible without mentioning Friendly AI, shouldn't mention Friendly AI.
One solution to the problem of friendliness is to develop a self-improving, unfriendly AI, put it in a box, and ask it to make a friendly AI for us. This gets around the incredible difficulty of friendliness, but it creates a new, apparently equally impossible problem. How do you design a box strong enough to hold a superintelligence?
Two problems. (You still need to verify friendliness.)
As I understand it, homomorphic encryption allows you to write programs that perform algebraic transformations on encrypted data without decrypting it in the process. What you’re looking for is a way to execute the ecrypted data as a general-purpose program without decrypting it, a much more difficult task that doesn't seem to be mentioned in the linked material. Is there any reason to think that such a thing is possible?
It seems to me that simply breaking the homomorphic encryption would be the most likely outcome here. Note that we're relying on a human-created code to provide security against a superintelligence, and breaking code systems (or at least determining routes of attack that provide a method that has a certain chance of breaking the code system within a certain number of cycles, et cetera, et cetera) is something we do fairly routinely. I think it's far more likely that the AI will simply deduce the method by which we have confined it and break the code. If we're positing an AI that could destroy the universe through the heat output of the processor it's running on, the only real answer is "don't run it".
(Nothing here should be taken to imply that sandboxing is the right way to deal with uFAI)
The encryption step seems relatively pointless. it's useful against an external attacker. If You are the'attacker' and you don't want to do attacking, then just don't.
Instead set up a normal sandbox that wipes its memory when it's done.
If anything the encryption step obfuscates how the sandbox works and leaves more room for implementation errors.
Cryptography only gives a false sense of security if it based on false assumptions. The barriers don't just look confusing to me---I have a proof that they look confusing to any system that can exist in the physical universe. Maybe the proof is wrong. Maybe the laws of physics don't look anything like physicists expect (maybe there is a super special law which just breaks my cryptographic scheme), or maybe I incorrectly assumed a problem was outside BQP, or maybe I was too optimistic about the constants in the best attack. But notice that all of these are opportunities for me to be wrong. There is no opportunity for an AI to outthink me if I am right. That is the beauty of cryptography.
Way in over my head here, but if the project is done in a series of steps with a series of boxed AIs helping to work on the next step, wouldn't it be better get something that is at the very least(*)…
..at a much earlier step than either…
The extreme computational overhead of this sort of scheme limits its applicability. However, it might be a useful technique for an already-friendly self-improving AI that has acquired vast resources and is designing its next most intelligent successor.
I think you are being overly optimistic about homomorphic encryption. The uFAI doesn't need to have absolute control over how the computation happens. Nor does it need to be able to perfectly predict the real-world results of running some computation. It only needs some amount of information leakage. The current best example I can think of is timing attacks on cryptographic protocols. The protocol itself should be secure, but a side-channel causes insecurity. Another example would be the meltdown and spectre vulnerabilities. How do you know your computatio...
This recent development seems relevant: a technique for functional encryption, rather than just message encryption.
Based on this post I'd thought that was already possible, but some discussion I saw of this article suggested that encrypting the code, and not just the input and output, was something new.
A fully homomorphic encryption scheme has the additional counter intuitive property that, although you cannot learn anything about the encrypted data without the secret key, you can still perform arbitrary operations on it. In particular, if you have a homomorphic encryption of the source code of a program you can run that program without learning anything about it (except how long the source is), but the output of the program will be homomorphically encrypted.
The second sentence doesn't follow from the first. Being able to perform arbitrary operations on the encrypted data doesn't enable you to perform operations without knowing what operations they are.
Let's consider a somewhat similar case.
You are an inventor. An evil dictator captures you, and takes you off to a faraway dungeon, where he tells you that he wants you to build him a superweapon. If you refuse to build the weapon, well, he has means of persuading you. If you still refuse, he will kill you.
Of course, when you finish building the doomsday device, your usefulness will be over, and he will probably kill you anyway.
After a little thought, you realise that you could build the dictator a doomsday device in about a week, with the materials lying a...
It is an interesting idea. However, today, thousands link their machine learning projects up to the internet, without a thought of them accidentally taking over the world. It seems rather difficult to imagine that situation changing very much.
Related to: Shut up and do the impossible!; Everything about an AI in a box.
One solution to the problem of friendliness is to develop a self-improving, unfriendly AI, put it in a box, and ask it to make a friendly AI for us. This gets around the incredible difficulty of developing a friendly AI, but it creates a new, apparently equally impossible problem. How do you design a box strong enough to hold a superintelligence? Lets suppose, optimistically, that researchers on friendly AI have developed some notion of a certifiably friendly AI: a class of optimization processes whose behavior we can automatically verify will be friendly. Now the problem is designing a box strong enough to hold an unfriendly AI until it modifies itself to be certifiably friendly (of course, it may have to make itself smarter first, and it may need to learn a lot about the world to succeed).
Edit: Many people have correctly pointed out that certifying friendliness is probably incredibly difficult. I personally believe it is likely to be significantly easier than actually finding an FAI, even if current approaches are more likely to find FAI first. But this isn't really the core of the article. I am describing a general technique for quarantining potentially dangerous and extraordinarily sophisticated code, at great expense. In particular, if we developed uFAI before having any notion of certifiable friendliness, then we could still use this technique to try and use the uFAI in a very limited way. It allows us to quarantine an AI and force everything it tries to say to us through an arbitrarily complicated filter. So, for example, if we are interested in solving some problem with a unique and verifiably correct answer, then we can ask the AI for this answer and only see the output if it passes the verification. We can also use a quarantined AI in more useful ways, at the expense of introducing some limited channels of influence (for example, if we ask for the solution to a problem which may have several solutions, we allow the AI to influence us by choosing which solution to provide)
Minimally, I hope this scheme at least suggests that putting an AI in a box may not be catastrophic. What to do with a boxed AI is another question. /Edit.
Concretely, suppose someone comes to you with the design of a recursively self-improving AI---call it Hal. Unfortunately, Hal is not certifiably friendly. Maybe it is friendly but not provably so, but more likely it is ambivalent to the human condition and is likely to destroy the universe if you just tell it to make a friendly AI. Nevertheless, you think Hal is smart enough to solve the friendly AI problem for you. All you need to do is give it access to all of the information about the world it wants and a couple of hours to run; after that, it should output a modified version of itself which is certifiably friendly.
You almost run Hal and ask it to design a friendly AI for you. After all, if you just run Hal and check to see if it outputs a certifiably friendly AI, what can it do other than refuse? But after some more thought you realize this is overly optimistic. Maybe you can avoid giving it an output channel that a human will have to read, thereby eliminating the most obvious failure mode of a boxed AI, but you still have to give it access to a computer to run on. Computation produces electromagnetic radiation, heat, etc., and it is honestly hard to say what a superintelligence could do with such a channel.
We seem to be faced with an impossible problem. We need to ask Hal to design us a certifiably friendly AI, but we also need to give him absolutely no ability to interact with the world destructively.
There is one field in particular which has a good history of solving obviously impossible problems: computational cryptography. We want to implement the following obviously impossible device. A cage into which we can place Hal so that he can think and examine all of the data we give him, but from which he cannot interact with his environment. In the cage with him is a tamper-proof arbiter implementing our friendliness test. Hal may give the arbiter the code to a single AI; if the arbiter confirms that this AI is certifiably friendly it will kill Hal, replace him with the friendly AI, and then allow the friendly AI to send a single message out of the cage.
To implement this impossible device we co-opt an important cryptographic primitive called public key homomorphic encryption, typically used for the impossible problem of private delegated computation. To get a basic idea you can see the Wikipedia article on homomorphic encryption. I'll give the brief description here (you can skip the next two paragraphs if you are familiar with homomorphic encryption).
A public key encryption scheme has the following properties. Anyone can encrypt a message using a publicly published key, but you cannot efficiently distinguish the encryptions of two different messages unless you know something about the secret key. A fully homomorphic encryption scheme has the additional counter intuitive property that, although you cannot learn anything about the encrypted data without the secret key, you can still perform arbitrary operations on it. In particular, if you have a homomorphic encryption of the source code of a program you can run that program without learning anything about it (except how long the source is), but the output of the program will be homomorphically encrypted. All of this can be done using only the public key.
The existence of a fully homomorphic encryption scheme was completely open until recently, when it was shown by Craig Gentry that such a scheme exists under the assumption that a certain problem on lattices is computationally intractable. Known homomorphic encryption schemes are very inefficient and based on new assumptions, but we can realistically expect improvements in the coming years and decades both in speed and the strength of the necessary assumptions.
If you run a homomorphically encrypted AI, sufficiently isolated from the private key, you get the very strong guarantee that the AI's execution has no computationally detectable effect on the rest of the world. If the AI was able to effect the world in a computationally detectable way, then we would have successfully broken the homomorphic encryption scheme by distinguishing the encrypted execution of the AI from the encrypted execution of some benign program.
Now we need to give the arbiter some means to allow a genuinely friendly AI access to the rest of the world, without giving Hal any illegitimate access. We allow Hal access to an output channel whose outputs are homomorphically encrypted. After some time, we destroy Hal and then homomorphically check to see if he has output the code of a certifiably friendly AI over the channel (we have to decode a single homomorphically encrypted bit to perform this check). If he hasn't, then we do nothing---apparently Hal would rather not give us a friendly AI. If he has output the code of a certifiably friendly AI, we homomorphically run it, obtain a homomorphic encryption of its output, and then decrypt only this output.
That's the entire scheme. Whenever you have some code whose execution itself might do incredible damage, this precaution allows you to execute it in perfect safety (the main caveat is that you need an automatic way to screen the output which doesn't give the algorithm enough room to possibly do harm).
Note: If you are really exceptionally paranoid you may be able to obtain even better guarantees by developing more powerful cryptographic techniques tailored for this application. In particular, the secret key used by the encryption procedure may be leaked to a sufficiently powerful AI through a side channel (the computations involving the key typically leak some data in the form of radiation, which might be recovered from sufficiently exhaustive sensor data by a sufficiently powerful AI). By doing these computations on a quantum computer you can probably guarantee that no data was leaked. If you want to be even more fantastically paranoid (ie, if you want a formal proof of security, which is a rare treat in real problems) you will actually have to shred the key, guaranteeing that an AI will have to interact with the shredded key to recover it. If you do this, you have to use another new primitive to implement the final revelation of the friendly AI's message.
If you have any concerns about the security of this protocol, I would be happy to try and defend it and would be surprised and interested if I failed. Whether or not such a cryptographic box is really an interesting or important object is another question. (It is interesting to me as an unexpected application of cryptography).