EDIT: Whenever I use colloquial phrases like "the AI believes a (false) X" I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.
A big problem with most AI control ideas is that the AI will route around them. Suppose you try to ensure that the AI stays friendly by - say - requiring it to ensure that the world's stock markets continue to function. Then a badly motivated AI will cause the destruction it was going to cause anyway - except it will also create some system that technically counts as "the world's stock markets" and ensure that it "continues to function".
I think we can use counterfactuals and corrigibility (AI-acceptable motivation changes) to partially get around the problem, in some circumstances. Suppose we had a candidate AI, with utility U, that we thought was friendly - but that we weren't sure. We have a subgoal, say SG="ensuring that the world's stock markets continue to function". We're pretty sure that if the AI is friendly, then SG will happen. We're pretty sure that if the AI were unfriendly, and the AI did not have SG added as an objective, then SG wouldn't happen. I deliberately chose a rather rough subgoal to illustrate this.
So we then add a module M to the AI. The task of M, which it has chosen to accept, is to look into the future, using the AI's predictive abilities, and see if SG is accomplished. Crucially, it will do so under the assumption that the module M does not exist (or wasn't activated, or will soon be overwritten, or will fail to do anything, or so on). If M finds that SG is not accomplished, it will turn the AI off (in a way compatible with the AI's motivations - see corrigibility).
How is this different from the usual setup? Well, a generic FAI would accomplish SG (by assumption - we may need to work a bit on this part). A generic UFAI without M wouldn't accomplish SG. A generic UFAI with M would accomplish SG, as this prevents it from being turned off.
But a generic UFAI with M which thought that M would have no impact, would behave exactly as a generic UFAI without M - and wouldn't aim to accomplish SG, and thus would be turned off by M.
Note that M can only be used as a last test for a putatively friendly AI. We cannot use it if we are continually refining AIs (either through evolution or simply redesign) as then we are optimising for SG, and SG is a poor goal to be aiming for (many, many UFAI have SG as a goal - it's just that a generic one won't). Similarly, we can't use a unconstrained search to find such an AI.
I wonder if this idea can be extended. Suggestions?
A belief that M has no impact will generate extremely poor predictions of the future, iff it's an UFAI; it's interesting to have a prescriptive belief in which Friendly agents definitionally believe a true thing and Unfriendly agents definitionally believe a false thing.
It's a solution that seems mostly geared towards problems of active utility deception- it prevents certain cases of an AI that deliberately games a metric. To the extent that a singleton is disingenuous about its own goals, this is a neat approach. I am a little worried that this kind of deliberate deception is stretching the orthogonality thesis further than it can plausibly go; with the right kind of careful experimentation and self-analysis, an UFAI with a proscribed falsehood might derive a literally irreconcilable set of beliefs about the world. I don't know if that would crack it down the middle or what, or how 'fundamental' that crack would go in its conception of reality.
We also have the problem of not being able to specify our own values with precision. If an AI will produce catastrophic results by naively following our exact instructions, then M will presumably be using the same metric, and it will give a green light on a machine that proceeds to break down all organic molecules for additional stock market construction projects or something. I suppose that this isn't really the sort of problem that you're trying to solve, but it is a necessary limitation in M even though M is fairly passive.
Whenever I use the colloquial phrase "the AI believes a false X" I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.