The AI in a box boxes you

Stuart_Armstrong

Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

Clarification: A utility function maps each state of the world to the real number denoting its utility.

Yes, I think this scenario does illustrate the point that simulations cannot be winningly granted "moral weight" by default on pain of dutch book. I don't think EYs answer to precommit to only accept positive trades is okay here as that makes the outcome of this scenario dependent on who gets to precommit "first", which notion should, in order to appeal to my intuition, not make sense.

Any proof of this not being a problem of faulty utility functions would, I think, require a function that maps each utility function to a scenario like this to break it, which one would be hard-pressed to produce regardless of whether such a function exists, so I shall be open to other arguments against this point.

Clarification: A utility function maps each state of the world to the real number denoting its utility.

How does this scenario operate under the assumption that humans do not have real-valued utility functions but rather utility orderings? IOW, we can't arrange all world-states on a number line, but we can always say if one world-state is as good as (or better than) another.

This allows us to deal with infinities, such as "I wouldn't kill my baby for anything." That is: There doesn't exist an N such that U(1) · N > U(B). That simply can't be true on the (positive) reals; for any A and B real, there's always a C such that A · C > B.

0Strange712y

I'm not saying this is a problem with utility functions in general, and yes, thank you, I know what a utility function is. Rather, my claim is that the problem is with average utilitarianism and variants thereof, which is to say, that subset of utility functions which attempt to incorporate every other instantiated utility function as a non-negligible factor within themselves. The computational compromises necessary to apply such a system inevitably introduce more and more noise, and if someone decided to implement the resulting garbage-data-based policy proposals anyway, it would spiral off into pathology whenever a monster wandered in. Tit-for-tat works. Division of labor according to comparative advantage works. Omnibenevolence looks good on paper. [...] It's not about the fact that they're simulations. This is just a hostage situation, with the complications that A) the encamped terrorist has a factory for producing additional hostages and B) the negotiator doesn't have a SWAT team to send in. Under those circumstances, playing as the negotiator, you can meet the demands (or make a good-faith effort, and then provide evidence of insurmountable obstacles to full compliance), or you can devalue the hostages. [...] Pre-existing commitments are the terrain upon which a social conflict takes place. In the moment of conflict, it doesn't matter so much when or how the land got there. Committing not to negotiate with terrorists is building a wall: it stops you being attacked from a particular direction, but also stops you riding out to rescue the hostages by the expedient path of paying for them. If the enemy commits to attacking along that angle anyway, well... then we get to find out whether you built a wall from interlocking blocks of solid adamant, or cheap plywood covered in adamant-colored paint. Or maybe just included the concealed sally-port of an ambiguous implicit exception. A truly solid wall will stop the attack from reaching it's objective, regardless

178

The AI in a box boxes you

178

178

178

The AI in a box boxes you

178

178