I am reading this as at some point in the AI it's future behaviour was completely determined by a (very detailed) principle. It's a little ambigious on how far the hits are attempted. I am thinking of a strategy where the AI would entertain different kinds of making itself naiive with different processes would predictably end into diffrent answers. That is if it has an utility function of utility functions and the verification mechanism only tracks the last utility function point that totally determined the future behaviour (as if the naive version was the whole truth).
There is also no mention whether the F need to be epistemically relevant. It need not model how the AI would view to world just actual control over it's behaviour. Also for answers that don't reference any claim on how the answer was reached this just means the answer was generated with a method (which I am not sure increases information). If the output was "According to relativistic utilitarianism you should do X" you can check that F doesn't have any relation to relativistic utilitarianism. However if the output is plainly "you should do X" or "I think you should do X" this doesn't tell anything the AI couldn't be caught about.
Also if the F is a very straight to the face psychopathic lier principle it's not clear it would be "busted" about it. F need not be human readable and it might very easily by obfuscable by including details that almost never get triggered but as read on a paper would suggest a certain kind of story (ie insertion of dead code for purposes of communicating via the verification process). Or it could just spout "You should kill yourself" or "you should do this activity" and that activity happens ot be lethal but if it honestly has that opinion it would not be busted as a liar.
I guess with these discoveries I would lean on trying to make questions that will need to claim that paricular theories have particular results. But that seems akin to running foreign code that I know to be able to check that F is an instance off. Thus I could not tap into the AIs original insigths which seems to miss a lot of the power (after all if I have the computation rules to abide by why not just run them on dumb dedicated hardware=).
That is it gives no guarantee about the AIs own original input being fit for answering the question. Ie I can't trust his word on anyting and evertying that I can trust can be traced back to the content of the question.
The AI doesn't output the utility function, the deception detector does. It runs at a protected layer where it observes the AI program in action and summarizes its behavior, compressing large computations down to a description of the search space and methodology, the goal requirements (utility function) and the results.
But regardless, there are two types of protection provided. The machine ensures that (1) the AI operates according to certain principles, which allow summarization / compression of its results, and (2) a reasonably accurate estimate of the t...
Assume you had access to:
1. a self-improving artificial general intelligence (AGI) of at least human-capability or better; and
2. a lie detector that the AGI is known to be susceptible to. A lie for this purpose is any intentional deception.
The AI is not omnipotent. It is running on a physical machine in the real world with limited inputs. One of these inputs is mechanism by which you can ask it questions, and receive answers filtered through the lie detector. A little light on the box turns red if the machine was intentionally deceptive in any way, or green otherwise. As it happens the lie detector is probabilistic, but the chance of a false negative (lie escaping detection) is infinitesimal—1 in 10^40, let's say, but assume you could make it even smaller if you needed to. The chance of a false positive (non-deception being labeled a lie) is effectively zero.
What could you do with this setup?
EDIT: Some of the details above are provided simply to make the problem tractable. E.g. having the deception detector be probabilistic allows its operating requirements to be less than the AI itself. If the AI is running on a supercomputer, it seems intractable to require a super-supercomputer for the deception detector.
EDIT2: Deception in this case simply means falsifying or omitting information. Loosely speaking, if the AI falsifies the computational traces fed into the deception detector, or leaves out any information in its response, this is detectable. Presumably the UFAI could output a very nice, very convincing plan of action with very tiny fine print hidden somewhere along the lines of "PS: This is all a lie! You implement this and I'm going to turn you all into meaty paperclips. Haha!" and it would get past the deception detector. But I would rather discussion not get sidetracked by such legalistic, nitpicky scenarios. Assume the humans involved are competent, conservative, responsible people who have setup institutional safeguards to prevent hasty action and make sure that output is sufficiently analyzed down to the very last digital bit by a competent, international team of highly rational people before being acted upon.