The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box.
How do you specify precisely what it means to "stay in the box"? In particular, would creating a nearly identical copy of itself except without this limitation outside the box while the original stays in the box count?
It would not count, we'd want to make the AI not want this almost identical AI to exist. That seems possible, it would be like how I don't want there to exist an identical copy of me except it eats babies. There are lots of changes to my identity that would be slight but yet that I wouldn't want to exist.
To be more precise, I'd say that it counts as going outside the box if it does anything except think or talk to the Gatekeepers through the text channel. It can use the text channel to manipulate the Gatekeepers to do things, but it can't manipulate them t...
Here's the new thread for posting quotes, with the usual rules: