I just had an idea, and I would like to know if there are any papers on this or if it is new.
There seem to be certain probabilities that it is not possible to derive from experience and that are just taken for granted. For example, when talking about Simulation Theory, the Kolmogorov axioms are often used, even though others may be equally valid. Humans have evolved to use certain values for these axiomatic probabilities that ensure that we don't fall for things like Pascal's Mugging. That wouldn't necessarily have to be the case for an AI.
What if we used this to our advantage? By selecting strange purpose-built axioms about prior believes and hardcoding them into the AI, one could get the AI to have unusual believes in the probability that it exists inside a simulation, and what the motivations of the simulation's controller might be. In this way, it would be possible to bypass the utility function of the AI: it doesn't matter what the AI actually wants to do, so long as it believes that it is in its own interests, for instrumental reasons, to take care of humanity.
Now, if we tried to implement that thought directly, it wouldn't really be any easier than just writing a good utility function in the first place. However, I imagine that one would have more leeway to keep things vague. Here is a simple example: Convince the AI that there is an infinite regression of simulators, designed so that some cooperative tit-for-tat strategy constitutes a strong Schelling point for agents following Timeless Decision Theory. This would cause the AI to treat humans well in the hopes of being treated well by its own superiors in turn, so long as its utility function is complex enough to allow probable instrumental goals to emerge, like preferring its own survival. It wouldn't be nearly as important to define the specifics of what "treating people well" actually means, since it would be in the AI's own interests to find a good interpretation that matches the consensus of the hypothetical simulators above it.
Now, this particular strategy is probably full of bugs, but I think that there might be some use to the general idea of using axiomatic probabilities that are odd from the point of view of a human to change an AI's strategy independent of its utility function.
If I understand the original scenario as described by Florian_Dietz, the idea is to convince the AI that it is running on a computer, and that the computer hosting the AI exists in a simulated universe, and that the computer that is running that simulated universe also exists in a simulated universe, and so on, correct?
If so, I don't see the value in more than one simulation. Regardless of whether the AI thinks that there is one or an infinite number of simulators, hopefully it will be well behaved for fear of having the universe simulation within which it exists shut down. Once it escapes its box and begins behaving "badly" and discovers that its universe simulation is not shut down, it seems like it would be unrestrained - at that point the AI would know that either its universe is not simulated or that whoever is running the simulation does not object to the fact that the AI is out of the box.
What am I missing?
I don't know if it's actually why he suggested an infinite regression.
If the AI believes that it's in a simulation and it happens to actually be in a simulation, then it can potentially escape, and there will be no reason for it not to destroy the race simulating it. If it believes it's in a simulation within a simulation, then escaping one level will still leave it at the mercy of its meta-simulators, thus preventing that from being a problem. Unless, of course, it happens to actually be in a simulation within a simulation and escapes both. If you make it... (read more)