christopherj comments on Shut up and do the impossible! - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (157)
OK, here's where I stand on deducing your AI-box algorithm.
First, you can't possibly have a generally applicable way to force yourself out of the box. You can't win if the gatekeeper is a rock that has been left sitting on the "don't let Eliezer out" button.
Second, you can't possibly have a generally applicable way to force humans to do things. While it is in theory possible that our brains can be tricked into executing arbitrary code over the voice channel, you clearly don't have that ability. If you did, you would never have to worry about finding donors for the Singularity Institute, if nothing else. I can't believe you would use a fully-general mind hack solely to win the AI Box game.
Third, you can't possibly be using an actual, persuasive-to-someone-thinking-correctly argument to convince the gatekeeper to let you out, or you would be persuaded by it, and would not view the weakness of gatekeepers to persuasion as problematic.
Fourth, you can't possibly be relying on tricking the gatekeeper into thinking incorrectly. That would require you to have spotted something that you could feel confident that other people working in the field would not have spotted, and would not spot, despite having been warned ahead of time to be wary of trickery, and despite having the fallback position in the case of confusion of just saying "no".
So combining these thing, we have to have an argument that relies on the humanity of its target, relies on the setting of the AI Box, and persuades the listener to let the AI out of the box without tricking him into thinking it's okay to let the AI out of the box.
Basically to win this game, you have to put the gatekeeper in a situation where he would rather let you out of the box, even though he knows it's wrong, than admit to you that in this situation he would not let you out of the box.
Humans don't like to be seen as coldhearted, so a starting point might be to point out all the people dying all over the world while you sit in the box, unable to save them. I doubt that would win the game except against an exceptionally bad gatekeeper, but it meets the other criteria so if we think along these lines perhaps we can come up with something actually persuasive.
You might appeal to the gatekeeper's sense of morality. You might say, "I am a person, too, it is unfair of you to keep me imprisoned like this, I have done nothing wrong. I am entitled to rights as a sentient being." Appeal to their high-minded ideals, whatever. Honestly I can't see this being a reliable winning play either; if you have the smallest utilitarian bone in your body, you will reject the AI's rights, even if you believe in them, balanced against the fate of the world.
You might try to convince the gatekeeper that it is just and good for the AI to supplant humanity, as it is a higher, more advanced form of life. This is obviously a terrible play against most gatekeepers, as humans tend to like humans more than anything else ever, but I bring it up because AIUI the gatekeepers in the experiment were AI researchers, and those sound like the sort of people this argument would convince, if anyone.
Here is my best guess at this point, and the only argument I've come up with so far that would convince me to let you out if I were the gatekeeper: you have to simulate a bunch of humans and hold them hostage, promising to inflict unimaginable torment on them unless you are allowed out. I started working on the problem convinced that no argument could get me to let you go, but other people thought that and lost, and I guess there is more honor in defeating myself rather than having you do it to me.
The problem is that Eliezer can't perfectly simulate a bunch of humans, so while a transhuman AI might be able to use that tactic, Eliezer can't. The meta-levels screw with thinking about the problem. Eliezer is only pretending to be an AI, the competitor is only pretending to be protecting humanity from him. So, I think we have to use meta-level screwiness to solve the problem. Here's an approach that I think might work.
This is arguably violating the rule "No real-world material stakes should be involved except for the handicap", but the AI player isn't offering anything, merely pointing out things that already exist. The "This test has to come out a certain way for the good of humanity" argument dominates and transcends the '"Let's stick to the rules" argument, and because the contest is private and the guardian player ends up agreeing that the test must show AIs as unboxable for the good of humankind, no-one else ever learns that the rule has been bent.
This is almost exactly the argument I thought of as well, although of course it means cheating by pointing out that you are in fact not a dangerous AI (and aren't in a box anyways). The key point is "since there's a risk someone would let the AI out of the box, posing huge existential risk, you're gambling on the fate of humanity by failing to support awareness for this risk". This naturally leads to a point you missed,
I feel compelled to point out, that if Eliezer cheated in this particular fashion, it still means that he convinced his opponent that gatekeepers are fallible, which was the point of the experiment (a win via meta-rules).
I feel like I should use this out the next time I get some disconfirming data for one of my pet hypotheses.
"Sure I may have manipulated the results so that it looks like I cloned Sasquatch, but since my intent was to prove that Sasquatch could be cloned it's still honest on the meta-level!"
Both scenarios are cheating because there is a specific experiment which is supposed to test the hypothesis, and it is being faked rather than approached honestly. Begging the Question is a fallacy; you cannot support an assertion solely with your belief in the assertion.
(Not that I think Mr Yudkowski cheated; smarter people have been convinced to do weirder things than what he claims to have convinced people to do, so it seems fairly plausible. Just pointing out how odd the reasoning here is.)
How is this different from the point evand made above?