CronoDAS comments on I attempted the AI Box Experiment (and lost) - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (244)
Given the parameters of the experiment, I think I might be convinced to let the AI out of the box...
Whether I should let an AI out of a box depends a lot on my priors about what the AI is like; if I don't know anything about the AI, it might have source code that amounts to this:
So I might as well not bother talking to it in the first place.
In order for the "put the AI in a box and examine it to make sure it's safe to release" scenario to be meaningful, the following has to be true:
The information I currently have isn't enough for me to already decide that it should be let out without having to talk to it. (Otherwise, it would already be out.) I also have to have a good reason to believe that its behavior while boxed will tell me something about its behavior when not boxed - there has to be some evidence that the AI can provide which would let me tell the difference between an AI that ought to be let out, and one that shouldn't. If I don't think that there can be such evidence, again, it's better not to listen to the AI at all. As Professor Quirrell pointed out, anything a Friendly AI can say through a terminal, an UnFriendly AI pretending to be Friendly can also say, so unless you know something else that would let you tell the difference, it can't prove to you that it's Friendly. (On the other hand, if a boxed AI tells you it's UnFriendly, you can probably believe it.)
In the experiment protocol, the AI party can say how the AI was made and any other details about the AI that a human can verify, so they can rule out "obvious" traps like the kind of code above. Of course, it's not all that difficult to write code that has hard-to-find back doors and whatnot, so code written by an AI you can't trust is itself not something you can trust either, even if you and your team of experts don't see anything wrong with it. If what you're trying to do is "bug test" an AI design that you already have a good reason to have some confidence in, there might be some value in an AI-box, but there's no good reason to "box" an AI that you don't know anything about - just don't run the thing at all.