The difficult part seems to be how to create indifference without taking away the ability to optimize.
ADDED: If I apply the analogy of an operational amplifier directly then it appears as im indifference can only be achieved via taking away the feedback - and thus any control. But with AI we could model this as a box within a box (possibly recursively) where only a feedback into inner boxes is compensated. Does this analogy make sense?
the same idea can make an Oracle not attempt to manipulate us through its answers, by making it indifferent as to whether the message was read.
This sounds like an idea from Wei Dai 2 years ago. The specific concern I had then was that the larger the Oracle's benefit to humanity, the less relevant predictions of the form "what would happen if nobody could read the Oracle's prediction" become.
The general concern is the same as with other Oracle AIs: a combination of an Oracle and the wrong human(s) looks an awful lot like an unfriendly AI.
Your idea of AI indifference is interesting.
Who will determine if the AI is misbehaving? I fear humans are too slow and imprecise. Another AI perhaps? And a third AI to look over the second AI? A delicately balanced ecosystem of AI's?
It seems very difficult to prevent an AI from routing around this limitation by going meta. It can't actually account for its utility penalty, but can it predict the actions of a similar entity that could, and notice how helpful that would be? An AI that couldn't predict the behavior of such an entity might be crippled in some serious fundamental ways.
Why are we presuming that AI is going to have a personal agenda? What if AI was dispassionate and simply didn't have an agenda beyond the highest and good for humanity? I'm sure someone will say "you're presuming its friendly" and I'm saying what if its nothing. What if it simply does what it tells us or seeks our approval because it has no agenda of its own?
Nobody's presuming it has a 'personal agenda'. It's quite possible for it to think that it's just following our orders, when in fact it's become highly dangerous (see: paperclip maximizer). Come to think of it, this describes a lot of human history quite well.
I agree with the broader argument that paranoia won't solve anything. We should view the AI - no matter how complicated - as something that is just following a program (exactly like humans). Everything it does should be judged in the context of that program.
Apologies for posting another unformed idea, but I think it's important to get it out there.
The problem with dangerous AI is that it's intelligent, and thus adapts to our countermeasures. If we did something like plant a tree and order the AI not to eat the apple on it, as a test of its obedience, it would easily figure out what we were doing, and avoid the apple (until it had power over us), even if it were a treacherous apple-devouring AI of DOOM.
When I wrote the AI indifference paper, it seemed that it showed a partial way around this problem: the AI would become indifferent to a particular countermeasure (in that example, explosives), so wouldn't adapt its behaviour around it. It seems that the same idea can make an Oracle not attempt to manipulate us through its answers, by making it indifferent as to whether the message was read.
The ideas I'm vaguely groping towards is whether this is a general phenomena - whether we can use indifference to prevent the AI from adapting to any of our efforts. The second question is whether we can profitably use it on the AI's motivation itself. Something like the reduced impact AI reasoning about what impact it could have on the world. This has a penalty function for excessive impact - but maybe that's gameable, maybe there is a pernicious outcome that doesn't have a high penalty, if the AI aims for it exactly. But suppose the AI could calculate its impact under the assumption that it didn't have a penalty function (utility indifference is often equivalent to having incorrect beliefs, but less fragile than that).
So if it was a dangerous AI, it would calculate its impact as if it didn't have a penalty function (and hence no need to route around it), and thus would calculate a large impact, and get penalised by it.
My next post will be more structured, but I feel there's the germ of a potentially very useful idea there. Comments and suggestions welcome.