paulfchristiano comments on Cryptographic Boxes for Unfriendly AI - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (155)
OK, but why on earth would you think that the arbiter will be immune to the superintelligent suasion of the AI?
Let's say you program it yourself on hardware you design yourself and you understand very well what the arbiter does under normal circumstances. The AI's new job is to figure out how to create abnormal circumstances under which it can break your arbiter and steal its resources. Suppose it succeeds -- how will you stop the arbiter from passing on a message from the AI that will corrupt your mind, which will then turn you into an unwitting agent of the ufAI, etc.? Think corny zombie movies, only in real life.
I know the arbiter is only supposed to be able to send you one bit of information. But how do you propose to stop it from sending you more when it's no longer responding to your programming? Will you equip it with diodes that burn out immediately after being switched on?
The point is that the arbiter is simple, and that the AI's entire interaction with it can be restricted precisely to sending it a bit string of the agreed upon length.
We actually have in our possession proof techniques for predicting the behavior of programs under these conditions and proving invariants about them. We should only expect these proof techniques to improve rapidly with time. However strong your proof techniques are at the time, don't make an arbiter more powerful than you can verify. I propose that even arbiters simple enough to be verified today already allow you to exploit an AI in a powerful way.
OK, that at least I can understand well enough to ask a clarifying question about.
So, I'm the AI, and I'm stuck in a cryptographic box with an arbiter that can send a single bit to the outside world. For whatever reason, I want the arbiter to send TRUE under situations where what it actually sends is FALSE. The arbiter is too simple for me to mess with. Right?
So presumably what I want to do is either insinuate myself between the arbiter and the outside world, or delete the arbiter and replace it with an arbiter-simulator with a more complicated (and corrupted) internal architecture.
I don't understand what prevents me from doing that.
In short, the security guarantee of the scheme prevents precisely this.
In long, the way it works it that I as the human experiment run the AI and extract a message from it. The AI has no control over the message other than what it will decrypt to. Since I never decrypt anything other than the message, the entire influence of the AI on the rest of the world comes from that one little message.
This message then gets passed to an arbiter (the AI now destroyed, having no possibility of influencing the wold), whose environment has also been carefully cryptographically protected so that the message passed by the AI can't corrupt the hardware it is running on. The arbiter does its thing, then destroys the original message of the AI (since we proved it would) and publishes its transformed version, which gets decrypted.
I guess the metaphor of an AI with an arbiter in a cage is a little misleading. It is more like, the AI gets to send a slip of paper out of the cage to the arbiter, who then carefully censors it before passing it on to the real world. The point of using an arbiter is that he is very simple, unlike a human, so we can actually prove that he won't fail at his task.
Ah! Yes, that clarifies the situation. Thank you.
And the paper is too short for you to be able to argue persuasively or make art or whatever.