that would have to be a really sophisticated bug to misinterpret "always answer questions thruthfully as far as possible while admitting uncertainty" as "kill all humans". I'd imagine that something as drastic as that could be found and corrected long before that. Consider that you have its goal set to this. It knows no other motivation but to respond thruthfully. It doesn't care about the survival of humanity, or itself or about how reality really is. All it cares for is to answer the questions to the best of its abilities.
I don't think that this goal would be all too hard to define either, as "the truth" is a pretty simple concept. As long it deals with uncertainty in the right way (by admitting it), how could this be misinterpreted? Friendliness is far harder to define because we don't even know a definition for it ourselves. There are far too many things to consider when defining "friendliness".
Trivial Failure Case: The AI turns the universe into hardware to support really big computations, so it can be really sure it's got the right answer, and also callibrate itself really well on the uncertainty.
edit: I think I have phrased this really poorly and that this has been misinterpreted. See my comment below for clarification.
A lot of thought has been put into the discussion of how one would need to define the goals of an AI so that it won't find any "loopholes" and act in an unintended way.
Assuming one already had an AI that is capable of understanding human psychology, which seems necessary to me to define the AI's goals anyway, wouldn't it be reasonable to assume that the AI would have an understanding of what humans want?
If that is the case, would the following approach work to make the AI friendly?
-give it the temporary goal to always answer questions thruthfully as far as possible while admitting uncertainty
-also give it the goal to not alter reality in any way besides answering questions.
-ask it what it thinks would be the optimal definition of the goal of a friendly AI, from the point of view of humanity, accounting for things that humans are too stupid to see coming.
-have a discussion between it and a group of ethicists/philosophers wherein both parties are encouraged to point out any flaws in the definition.
-have this go on for a long time until everyone (especially the AI, seeing as it is smarter than anyone else) is certain that there is no flaw in the definition and that it accounts for all kinds of ethical contingencies that might arise after the singularity.
-implement the result as the new goal of the AI.
What do you think of this approach?