I just had an idea, and I would like to know if there are any papers on this or if it is new.
There seem to be certain probabilities that it is not possible to derive from experience and that are just taken for granted. For example, when talking about Simulation Theory, the Kolmogorov axioms are often used, even though others may be equally valid. Humans have evolved to use certain values for these axiomatic probabilities that ensure that we don't fall for things like Pascal's Mugging. That wouldn't necessarily have to be the case for an AI.
What if we used this to our advantage? By selecting strange purpose-built axioms about prior believes and hardcoding them into the AI, one could get the AI to have unusual believes in the probability that it exists inside a simulation, and what the motivations of the simulation's controller might be. In this way, it would be possible to bypass the utility function of the AI: it doesn't matter what the AI actually wants to do, so long as it believes that it is in its own interests, for instrumental reasons, to take care of humanity.
Now, if we tried to implement that thought directly, it wouldn't really be any easier than just writing a good utility function in the first place. However, I imagine that one would have more leeway to keep things vague. Here is a simple example: Convince the AI that there is an infinite regression of simulators, designed so that some cooperative tit-for-tat strategy constitutes a strong Schelling point for agents following Timeless Decision Theory. This would cause the AI to treat humans well in the hopes of being treated well by its own superiors in turn, so long as its utility function is complex enough to allow probable instrumental goals to emerge, like preferring its own survival. It wouldn't be nearly as important to define the specifics of what "treating people well" actually means, since it would be in the AI's own interests to find a good interpretation that matches the consensus of the hypothetical simulators above it.
Now, this particular strategy is probably full of bugs, but I think that there might be some use to the general idea of using axiomatic probabilities that are odd from the point of view of a human to change an AI's strategy independent of its utility function.
If you convince it that there is a single simulator, and there really is a single simulator, and the AI escapes its box, then the AI would be unrestrained.
If I understand the original scenario as described by Florian_Dietz, the idea is to convince the AI that it is running on a computer, and that the computer hosting the AI exists in a simulated universe, and that the computer that is running that simulated universe also exists in a simulated universe, and so on, correct?
If so, I don't see the value in more than one simulation. Regardless of whether the AI thinks that there is one or an infinite number of simulators, hopefully it will be well behaved for fear of having the universe simulation within which i... (read more)