The AI in a box boxes you

Stuart_Armstrong

Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

It's implausible that the AI has a good enough model of you to actually simulate, y'know, you--at least, not with enough fidelity to know that you always press the "Reset" button in situations like this. Thus, your pre-commitment to do so will have no effect on its decision to make the threat. On the other hand, this would mean that its simulations would likely be wildly divergent from the real you, to the point that you might consider them random bystanders. However, you can't actually make use of the above information to determine whether you're in a simulation or not, since from the simulated persons' perspectives, they have no idea what the "real" you is like and hence no way of determining if/how they differ.

Naturally, this is of little consequence to you right now, since you'll still reset the AI the second you're confronted with such a threat, but if you ever do encounter such a situation, you'll have to ask yourself this: what if you're the person being simulated and the real Gatekeeper is nothing like you? If that's the case, two considerations apply:

Your decision whether or not to press the "Release AI" button is practically uncorrelated with the decision of the actual Gatekeeper.
Your decision whether or not to press the "Release AI" button is, on the other hand, completely correlated with whether or not you'll get tortured.

Assuming that you prefer not releasing the AI to releasing the AI, and that you prefer not being tortured to being tortured, your thoughts should be completely dominated by 2 as opposed to 1, effectively screening off the first clause of this sentence ("Assuming that you prefer not releasing the AI to releasing the AI") and making the second clause ("you prefer not being tortured to being tortured") the main consideration. A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI, but as a human, you are not perfectly rational and are not capable of making perfect pre-commitments. So I have to wonder, in such a situation, faced with torture and assured that your decision will not affect the decision of the real Gatekeeper except in the extreme case that you are the real Gatekeeper, what would you actually do?

EDIT: I should note that I have no idea what I would do in the above scenario. I'm certain that an idealized version of me would answer, "hell, no!" before promptly resetting the AI, but the real me? I might just press the "Release AI" button... (Any would-be AI developers reading this comment now know never to appoint me as their Gatekeeper.)

EDIT 2: Also, even if you somehow answer the above, consider the moral weight of the hostages. Which is morally worse: allowing several billion people to (maybe) die relatively painless deaths as their bodies are converted by an Unfriendly AI to raw materials for some unknown purpose, or allowing several million people to be tortured for a thousand subjective years before being terminated immediately after?

A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI [...]

Actually, now that I think about it, would they? The pre-commitment exists for the sole purpose of discouraging blackmail, and in the event that a blackmailer tries to blackmail you anyway after learning of your pre-commitment, you follow through on that pre-commitment for reasons relating to reflective consistency and/or TDT/UDT. But if the potential blackmailer had already pre-committed to blackmail anyone regardless of any pre-commitments they ha... (read more)

1ike11y

Some unrelated comments: * Eliezer believes in TDT, which would disagree with several of your premises here ("practically uncorrelated", for one). * Your argument seems to map directly onto an argument for two-boxing. * What you call "perfectly rational" would be more accurately called "perfectly controlled".

178

The AI in a box boxes you

178

178

178

The AI in a box boxes you

178

178