A putative new idea for AI control; index here.
EDIT: To remind everyone, this method does not entail the Oracle having false beliefs, just behaving as if it did; see here and here.
An idea I thought I'd been mentioning to everyone, but a recent conversation reveals I haven't been assiduous about it.
It's quite simple: whenever designing an Oracle, you should, as a default, run it's output channel through a probabilistic process akin to the false thermodynamic miracle, in order to make the Oracle act as if it believed its message will never be read.
This reduces the possibility of the Oracle manipulating us through message content, because it's action as if that content will never be seen by anyone.
Now, some Oracle designs can't use that (eg if accuracy is defined in terms of the reaction of people that read its output). But in general, if your design allows such a precaution, there's no reason not to put it on, so it should be default in the Oracle design.
Even if the Oracle design precludes this directly, some version of it can be often be used. For instance, if accuracy is defined in terms of the reaction of the first person to read the output, and that person is isolated from the rest of the world, then we can get the Oracle to act as if it believed a nuclear bomb was due to go off before the person could communicate with the rest of the world.
Fair enough, let me try to rephrase that without using the word friendliness:
We're trying to make a superintelligent AI that answers all of our questions accurately but does not otherwise influence the world and has no ulterior motives beyond correctly answering questions that we ask of it.
If we instead accidentally made an AI that decides that it is acceptable to (for instance) manipulate us into asking simpler question so that it can answer more of them, it is preferable that it doesn't believe anyone is listening to the answers it gives because that is one less way it has for interacting with the outside world.
It is a redundant safeguard. With it, you might end up with a perfectly functioning AI that does nothing, without it, you may end up with an AI that is optimizing the world in an uncontrolled manner.
I don't think so. As I mentioned in another subthread here, I consider separating what an AI believes (e.g. that no one is listening) from what it actually does (e.g. answer questions) to be a bad idea.