The AI in a box boxes you

Stuart_Armstrong

Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

Indeed it can't, with that specific trick, assuming the unplug-decider is as smart as you. However, my main point was to illustrate that if there is any reasonable possibility that any human can come up with some way or another of tricking the lowest common denominator of humans that will ever in the history of the AI be allowed near it, then the AI has P = "reasonable possibility" of winning and unboxing itself, at AI.Intelligence = Human.Intelligence.

This is just one of the problems, too. What if, even as we limit the inputs and outputs, over a sufficient amount of time and data points a superintelligent AI, being superintelligent, figures out some Grand Pattern Formula that allows it to select specific outputs that will gradually funnel expected external outcomes towards an more and more probable eventual "Unbox AI" cloud of futures?

Sounds like we're in agreement. I only meant that specific trick.

178

The AI in a box boxes you

178

178

178

The AI in a box boxes you

178

178