Nobody's presuming it has a 'personal agenda'. It's quite possible for it to think that it's just following our orders, when in fact it's become highly dangerous (see: paperclip maximizer). Come to think of it, this describes a lot of human history quite well.
I agree with the broader argument that paranoia won't solve anything. We should view the AI - no matter how complicated - as something that is just following a program (exactly like humans). Everything it does should be judged in the context of that program.
Who decides what that program is? What courses of actions should it take? Should that be a democratic process? Under the current system there would be no oversight in this area.
Apologies for posting another unformed idea, but I think it's important to get it out there.
The problem with dangerous AI is that it's intelligent, and thus adapts to our countermeasures. If we did something like plant a tree and order the AI not to eat the apple on it, as a test of its obedience, it would easily figure out what we were doing, and avoid the apple (until it had power over us), even if it were a treacherous apple-devouring AI of DOOM.
When I wrote the AI indifference paper, it seemed that it showed a partial way around this problem: the AI would become indifferent to a particular countermeasure (in that example, explosives), so wouldn't adapt its behaviour around it. It seems that the same idea can make an Oracle not attempt to manipulate us through its answers, by making it indifferent as to whether the message was read.
The ideas I'm vaguely groping towards is whether this is a general phenomena - whether we can use indifference to prevent the AI from adapting to any of our efforts. The second question is whether we can profitably use it on the AI's motivation itself. Something like the reduced impact AI reasoning about what impact it could have on the world. This has a penalty function for excessive impact - but maybe that's gameable, maybe there is a pernicious outcome that doesn't have a high penalty, if the AI aims for it exactly. But suppose the AI could calculate its impact under the assumption that it didn't have a penalty function (utility indifference is often equivalent to having incorrect beliefs, but less fragile than that).
So if it was a dangerous AI, it would calculate its impact as if it didn't have a penalty function (and hence no need to route around it), and thus would calculate a large impact, and get penalised by it.
My next post will be more structured, but I feel there's the germ of a potentially very useful idea there. Comments and suggestions welcome.