The AI doesn't output the utility function, the deception detector does. It runs at a protected layer where it observes the AI program in action and summarizes its behavior, compressing large computations down to a description of the search space and methodology, the goal requirements (utility function) and the results.
But regardless, there are two types of protection provided. The machine ensures that (1) the AI operates according to certain principles, which allow summarization / compression of its results, and (2) a reasonably accurate estimate of the total work performed allows calculation of energetic requirements which can be checked against the electric meter. Deception requires plotting and scheming. Plotting and scheming costs energy points.
So a program that intentionally tries to deceive its operators would be found out. First because that intention to deceive would be pulled out and identified as a goal by the detector, something which the program can only get around by breaking out of the box containing it. Looking for a breakout exploit would have been caught by the detector.
But assuming it does breakout, there are limits to what it can achieve. It still has to pass back a computational trace which validates to a human-approved set of goals. That requires geometrically more computational power as the program must find a solution which satisfies G(friendly-looking) & G(evil), then reverse a fraudulent computational trace which shows the same solution deriving from a search for G(friendly-looking) only. The "hidden" computation looking for G(friendly-looking) & G(evil) has nowhere to hide.
If the AI used a efficient method to scheming and then calculated a semi-efficent "legit" computation and then faked of a really energy ineffficent log it could fit extra baggage within the energy budjet.
It could do a punch of stuff that could be dangerous. It could plan to be dangerously neglectfull. Identifying and objecting to all that spectra of misbehaviour isn't a trivial matter. I think a big danger is a human reading a legit log of harmful activity and approving of it. " I guess murdering babies is a legit step on this friendly calcu...
Assume you had access to:
1. a self-improving artificial general intelligence (AGI) of at least human-capability or better; and
2. a lie detector that the AGI is known to be susceptible to. A lie for this purpose is any intentional deception.
The AI is not omnipotent. It is running on a physical machine in the real world with limited inputs. One of these inputs is mechanism by which you can ask it questions, and receive answers filtered through the lie detector. A little light on the box turns red if the machine was intentionally deceptive in any way, or green otherwise. As it happens the lie detector is probabilistic, but the chance of a false negative (lie escaping detection) is infinitesimal—1 in 10^40, let's say, but assume you could make it even smaller if you needed to. The chance of a false positive (non-deception being labeled a lie) is effectively zero.
What could you do with this setup?
EDIT: Some of the details above are provided simply to make the problem tractable. E.g. having the deception detector be probabilistic allows its operating requirements to be less than the AI itself. If the AI is running on a supercomputer, it seems intractable to require a super-supercomputer for the deception detector.
EDIT2: Deception in this case simply means falsifying or omitting information. Loosely speaking, if the AI falsifies the computational traces fed into the deception detector, or leaves out any information in its response, this is detectable. Presumably the UFAI could output a very nice, very convincing plan of action with very tiny fine print hidden somewhere along the lines of "PS: This is all a lie! You implement this and I'm going to turn you all into meaty paperclips. Haha!" and it would get past the deception detector. But I would rather discussion not get sidetracked by such legalistic, nitpicky scenarios. Assume the humans involved are competent, conservative, responsible people who have setup institutional safeguards to prevent hasty action and make sure that output is sufficiently analyzed down to the very last digital bit by a competent, international team of highly rational people before being acted upon.