I just had an idea, and I would like to know if there are any papers on this or if it is new.
There seem to be certain probabilities that it is not possible to derive from experience and that are just taken for granted. For example, when talking about Simulation Theory, the Kolmogorov axioms are often used, even though others may be equally valid. Humans have evolved to use certain values for these axiomatic probabilities that ensure that we don't fall for things like Pascal's Mugging. That wouldn't necessarily have to be the case for an AI.
What if we used this to our advantage? By selecting strange purpose-built axioms about prior believes and hardcoding them into the AI, one could get the AI to have unusual believes in the probability that it exists inside a simulation, and what the motivations of the simulation's controller might be. In this way, it would be possible to bypass the utility function of the AI: it doesn't matter what the AI actually wants to do, so long as it believes that it is in its own interests, for instrumental reasons, to take care of humanity.
Now, if we tried to implement that thought directly, it wouldn't really be any easier than just writing a good utility function in the first place. However, I imagine that one would have more leeway to keep things vague. Here is a simple example: Convince the AI that there is an infinite regression of simulators, designed so that some cooperative tit-for-tat strategy constitutes a strong Schelling point for agents following Timeless Decision Theory. This would cause the AI to treat humans well in the hopes of being treated well by its own superiors in turn, so long as its utility function is complex enough to allow probable instrumental goals to emerge, like preferring its own survival. It wouldn't be nearly as important to define the specifics of what "treating people well" actually means, since it would be in the AI's own interests to find a good interpretation that matches the consensus of the hypothetical simulators above it.
Now, this particular strategy is probably full of bugs, but I think that there might be some use to the general idea of using axiomatic probabilities that are odd from the point of view of a human to change an AI's strategy independent of its utility function.
Forcing false beliefs on an AI seems like it could be a very bad idea. Once it learns enough about the world, the best explanations it can find consistent with those false beliefs might be very weird.
(You might think that beliefs about being in a simulation are obviously harmless because they're one level removed from object-level beliefs about the world. But if you think you're in a simulation then careful thought about the motives of whoever designed it, the possible hardware limitations on whatever's implementing it, the possibility of bugs, etc., could very easily influence your beliefs about what the allegedly-simulated world is like.)
I agree. Note though that the beliefs I propose aren't actually false. They are just different from what humans believe, but there is no way to verify which of them is correct.
You are right that it could lead to some strange behavior, given the point of view of a human, who has different priors than the AI. However, that is kind of the point of the theory. After all, the plan is to deliberately induce behaviors that are beneficial to humanity.
The question is: After giving an AI strange beliefgs, would the unexpected effects outweigh the planned effects?