The AI in a box boxes you

Stuart_Armstrong

Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI [...]

Actually, now that I think about it, would they? The pre-commitment exists for the sole purpose of discouraging blackmail, and in the event that a blackmailer tries to blackmail you anyway after learning of your pre-commitment, you follow through on that pre-commitment for reasons relating to reflective consistency and/or TDT/UDT. But if the potential blackmailer had already pre-committed to blackmail anyone regardless of any pre-commitments they had made, they'd blackmail you anyway and then carry through whatever threat they were making after you inevitably refuse to comply with their demands, resulting in a net loss of utility for both of you (you suffer whatever damage they were threatening to inflict, and they lose resources carrying out the threat). In effect, it seems that whoever pre-commits first (or, more accurately, makes their pre-commitment known first) has the advantage... which means if I ever anticipate having to blackmail any agent ever, I should publicly pre-commit right now to never update on any other agents' pre-commitments of refusing blackmail. The corresponding strategy for agents hoping to discourage blackmail is not to blanket-refuse to comply to any demand under blackmail, but refuse only those demands by agents who had previously learned of your pre-commitment and decided to blackmail you anyway. That way, you continue to disincentivize blackmailers who know of your pre-commitment, but will almost certainly choose the lesser of two evils should it ever be the case that you do get blackmailed. (I say "almost certainly" because there's a small probability that you will encounter a really weird agent that decides to try and blackmail you even after learning of your pre-commitment to ignore blackmail from such agents, in which case you would of course be forced to ignore them and suffer the consequences.)

If the above paragraph is correct (which I admit is far from certain), then the AI in my scenario has effectively implemented the ultimate pre-commitment: it doesn't even know about your pre-comittment to ignore blackmail because it lacks the information needed to simulate you properly. The above argument, then, says you should press the "Release AI" button, assuming you pre-committed to do so (which you would have, because of the above argument).

Anything wrong with my reasoning?

The corresponding strategy for agents hoping to discourage blackmail is not to blanket-refuse to comply to any demand under blackmail, but refuse only those demands by agents who had previously learned of your pre-committment and decided to blackmail you anyway.

So, if an agent hears of your pre-commitment, then that agent merely needs to ensure that you don't hear that it has heard of your pre-commitment in order to be able to blackmail you?

What about an agent that deletes the knowledge of your pre-commitment from its own memories?

178

The AI in a box boxes you

178

178

178

The AI in a box boxes you

178

178