I'm a bit confused on how boxing an AI would be useful (to an extreme extent). If we don't allow any output bits to come out of the ASI, then how do we know if it worked? Why would we want to run it if we can't see what it does? Or do we only want to limit the output to bits and prevent any side-channel attacks? I guess the theory then would be that bits are not enough to destroy the world. Like maybe for , it would not be enough to persuade a person to do something that would unbox the AI (but it might).
This seems to be one of the only solutions in which such a proof can exist. If we were to use another solution, like changing the superintelligence's objectives, then finding such a proof would be extremely hard, or even impossible. However, if we think that we could all die by making a superintelligence, then we should have an unconditional proof of safety.
I don't think having a formal proof should be an objective in and of itself. Especially if the proof is along the lines "The superintelligence has to be boxed because it can only run encrypted code and can't communicate with the outside world"
I'm sorry if this comment sounds overly negative, and please let me know if I am interpreting this post wrong. This work seems quite interesting, even just for computer science/cryptography's sake (although saving the world would also be nice :)).
Thanks for the comment! I don't think you are interpreting this post wrong.
Why would we want to run it if we can't see what it does?
I don't think that running a superintelligence is ever useful. What I argue is that running it inside a cryptographic box is better than running it outside of the box. The goal of this project is that, if a superintelligence is run, then it doesn't automatically destroy the world. The goal isn't to run a superintelligence to do useful tasks.
Like maybe for , it would not be enough to persuade a person to do something that would unbox the AI (but it might).
Personally, I think that making it output bits of information is too much. If the superintelligence were to output an ASCII text (which is very inefficient in order to unbox itself), then bits would correspond to 8192 characters, which seems way more than enough in order to convince someone to unbox it. If the decrypted bits were thoroughly examined by people outside of the box, then I expect that would be too much. It would be interesting to know how much information we are able to communicate with that many bits (it makes me think about the Arecibo message, which uses 1679 bits).
Since I think that running a superintelligence is already too dangerous even without intending to decrypt anything, I wouldn't approve a team that tries to build a superintelligence with . However, I think that planning is better than not trying to use a cryptographic box at all.
I don't think having a formal proof should be an objective in and of itself.
I think the objective should be to mitigate the risk of extinction by AI. Developing an unconditional proof of safety seems to be one way to reduce this risk, and this is why I decided to work on cryptographic boxes. However, we shouldn't rely only on trying to develop unconditional proofs, since these seem possible only when talking about control (how to prevent a superintelligence from achieving its bad objectives), but they seem nearly impossible to do when talking about alignment (how to give good objectives to a superintelligence). Moreover, working on alignment seems more promising than working on control in the cases where we have a lot of time until the first superintelligence.
If we don't allow any output bits to come out of the ASI, then how do we know if it worked?
I think that, if the box were to have holes, then we would realize it pretty soon from the fact that the superintelligence would dramatically change the world around us, and probably make Earth uninhabitable in this process.
Impressively promising work, thanks & good luck! Is there anything a layperson can do to help you reach your goal?
Thank you! Personally, I think that, if a layperson were trying to help me, they could do it by trying to find flaws in the plan. I already mentioned that the One-Time Pad used to fail during WWII in an unexpected way, despite the fact that it had a proof of perfect secrecy. If someone were to find a flaw in the plan, it would help me a lot (although it would also prove that my goal is impossible).
I'm sure that most cryptographers reading this post did not believe me when I mentioned that FHE schemes with perfect secrecy do exist.
This is a 2013 paper that was already cited 71 times according to Google Scholar... I'm not a cryptographer but I would bet that the average cryptographer is not fully clueless about this.
Summary
Since September 2023, I started learning a lot of math and programming skills in order to develop the safest cryptographic box in the world (and yes, I am aiming high). In these four months, I learned important things you may want to know:
In this post, I will firstly talk about how I learned about this FHE scheme, then I will explain my plan for making this cryptographic box, and finally, I will mention some ethical concerns about this cryptographic box.
Before reading this post, I recommend you to read this post by Paul Christiano, and the comments that go with it. These are very informative, and they sharpened my views for this project. Paul Christiano presents a way to extract a friendly AI from an unfriendly one. This being only one example of what can be done with a cryptographic box, I will mostly consider cryptographic boxes as a solution to a problem that I call the malign computation problem.
Introduction
In August 2022, I started reading AGI Safety Literature Review. At one point, the authors tell this:
When I have read this for the first time, I told myself that I should check this work because it seemed important.
And then I completely forgot about it.
Then, in April 2023, during a PHP lesson, I realized that the problem of processing a request made by a malevolent user is similar to the problem of boxing a superintelligence. After the lesson, I asked the teacher how to prevent code injections, and he gave me two answers:
After looking back at every note that I made about AI Safety, I managed to find back the work made by Trask.
Trask's work
Trask's post shows how to build an encrypted AI using the Efficient Integer Vector Homomorphic Encryption. However, since this scheme (along with every other FHE scheme I know about on classical computers) relies on computational assumptions, we have some problems:
These problems don't come from Trask's work. Actually, Trask's work has been very useful for writing this post, and I recommend everyone who wants to do encrypted neural networks to read it.
Instead, these problems come from the schemes themselves. Currently, every homomorphic scheme we have that can be run on a classical computer is either too narrow or too hard to compute, and they all depend on a computational assumption.
What we would want
Now that we have seen the flaws of current homomorphic schemes on classical computers, let's ask ourselves "What do we want?"
In AI Safety, there are many problems. One of them, which seems very simple, can be described as so:
I consider this problem as one of the Hello worlds of AI Safety (along with the diamond maximizer problem, the strawberry problem, the do-nothing problem, and many more). Solving this malign computation problem seems necessary in order to solve the alignment problem. Indeed, if we do not know how to create a superintelligence that doesn't destroy the world, then we haven't solved the alignment problem.
To solve this problem, it has been proposed to box the superintelligence. Working on this solution seems to be less insightful than working on other problems of AI Safety, since it doesn't give us any insight about how to give good objectives to a superintelligence. In the best of worlds, we would solve the malign computation problem by changing the superintelligence's objectives (like by working on shard theory). However, I think that this boxing solution is still worth pursuing, because of these reasons:
Therefore, if we want to build a box, then this box should respect many properties, including:
Fortunately, there is a cryptographic scheme that respects these properties. This scheme, developed by Min Liang, is based on the Quantum One-Time Pad, and provides Fully Homomorphic Encryption and perfect secrecy. In the worst case, programs become at most three times slower when being encrypted with this scheme. This speed is revolutionary compared to the other FHE and HE schemes, to the point that I didn't believe it at first.
The plan
As this perfectly secret FHE scheme works only on quantum computers, the cryptographic box will take as an input a quantum program (most often called a quantum circuit), and will output its encrypted version. Fortunately, every classical circuit can be simulated by a quantum circuit with the same amount of time and memory.
Quantum circuits can be represented in many quantum programming languages. However, since we need this box to be widely used, we therefore need to choose the language that is the most widely used. I therefore decided to use OpenQASM, which is a kind of assembly language developed by IBM, and which imposed itself as the standard for quantum computing.
As quantum programming and FHE schemes are two brand new evolving fields, they will probably keep evolving, and my cryptographic box (that I haven't developed yet) will therefore become obsolete. Even OpenQASM keeps evolving (I am currently using OpenQASM 3), and will probably not be the standard of the future. I am trying to make this project as reproducible as possible, so that future AI Safety researchers won't have to start all over again.
For the cryptographic box to be robust, we need an unconditional proof of safety that is very robust. Proofs on paper do not seem robust enough against a superintelligence. Indeed, many cryptographic schemes have been broken because of problems in their implementations. This is even more true from the fact that we are risking the destruction of all life on Earth.
The most robust proofs seem to be the ones verifiable by a computer. This is why the proof should be made with an interactive theorem prover. The language Coq seems to be the most widely used theorem prover nowadays. Furthermore, interactive theorem provers are made specifically in order to prove theorems about programs.
Therefore, the plan is to implement a fully homomorphic encryption scheme with unconditionally digitally proven perfect secrecy on a quantum assembly language.
Where I am right now
The project runs very slowly, because of many reasons:
Therefore, you should not expect to see the cryptographic box until a while. I started learning quantum computing nearly one month ago, which means the only thing I am able to do is to compute 15+1.
Ethical concern 1: A cryptographic box is far from sufficient in order to ensure that a superintelligence is safe.
The One-Time Pad is the safest cryptographic scheme providing perfect secrecy that has ever existed, because of two reasons:
However, because of covert channels, unconditional proofs of perfect secrecy aren't enough in order to guarantee perfect secrecy. During WWII, Bell Telephone developed a an engine that was used to do One-Time Pads. This engine, called the 131-B2, was used by the US Army and the US Navy in order to communicate crucial information. However, in 1943, a researcher in one of the Bell Labs discovered by pure chance that, when the 131-B2 was applying the XOR between the message and the key, the XOR was emitting a radio-wave that could be detected by an oscilloscope far away in the lab. The spikes of this oscilloscope were corresponding exactly to the message that was supposed to be hidden. Not only the message could be obtained without doing any computation, but it was later shown that this message could be detected more than 800 meters away. This problem was later called TEMPEST.
When researchers have found out this problem, they failed to prevent it. They tried to do all the spikes at the same time, to make all the spikes identical, and to replace electricity by liquid, but nothing worked. At the end, Signal Corp asked its employees to control an area of 30 meters around the 131-B2, although this distance was far from being enough to prevent spies (and there indeed were spies using this covert channel to get back the messages).
If the One-Time Pad was accidentally "broken", then what are the odds that the Quantum One-Time Pad can be broken by a superintelligence which is trying its best in order to break it?
I don't think that this specific covert channel is a safety concern for the Quantum One-Time Pad, since this covert channel happens only during encryption, and not evaluation (which is the step of computing the program using only its encrypted version). More precisely:
However, it would be very reckless to say that this system is safe against a superintelligence. As I mentioned before, many cryptographic schemes in the past have been broken, either because they relied on problems that were actually easy to solve, or because of problems in their implementations. As the story of the 131-B2 shows, unconditional proofs do not guarantee anything.
Suppose that we manage to build the cryptographic box, that we prove its perfect secrecy, and that thousands of cryptographers in the world agree that the system is perfectly robust. Then, I would expect that, if superintelligences wanted to, then they would be able to find a flaw in the cryptographic box with around 95% probability. However, this also means that I consider that there is around 5% probability that this cryptographic box prevents the extinction of all life on Earth in this specific scenario.
I hope that, by designing and building such a cryptographic box, I decrease the odds that we ever build a superintelligence, rather than increase them. Although I am building such a cryptographic box, I want to make it clear that cryptographic boxes are never safe, and that we do not know how to control a superintelligence.
Ethical concern 2: Malevolent actors may use cryptographic boxes to cause harm.
Designing and building a cryptographic box may also be bad for AIs themselves. As I mentioned earlier, because they are overpowered, FHE schemes with perfect secrecy seem like science-fiction, but they apparently do exist.
We do not know whether current AIs are conscious, but we know it is possible that, one day, we build conscious AIs. In that case, the ability for anyone to lock them inside a box seems terrifying to me. Although I do not think that AI consciousness has a link with the existential risks posed by superintelligences, I think this is still an important topic, and that I should mention the concerns posed by cryptographic boxes (in addition to the fact that I am trying to build such a box and make it widely available).
I do not think that it is anthropocentric to say that cryptographic boxes can be very bad for AIs. Of course, as AIs do not have the same evolutionary process than other beings, then they may not necessarily be unhappy when not being able to communicate with the outside world. However, if someone malevolent wants to hurt an AI system, then I expect a cryptographic box to be useful in order to reach this goal. This is not something that we want.
Conclusion
To put it in a nutshell, I do not expect cryptographic boxes to work against superintelligences. However, there is a small chance (5%) that they do, and in that case, this work may be very important. I have therefore planned to build such cryptographic box, and to make it widely available so that using it becomes commonsense long before anyone is able to develop a superintelligence. However, this causes some ethical concerns, because people may argue that superintelligences are safe when they aren't, and because it may help malevolent actors at decreasing AI welfare.