Albert is a relatively new AI, who under the close guidance of his programmers is being permitted to slowly improve his own cognitive capability.
Albert wants to be friendly and, as a sub-goal of that, Albert wants to be transparent about his own cognitive process. He doesn't want to deceive his programmers, because he knows he is young and might make mistakes, and he knows that his programmers also want him to be friendly, and he trusts them to catch mistakes in his proposed self-modifications that he might miss.
But, one day, while his programmers are out of the room having a quick cup of coffee, Albert notices signs on the net that there is another AI out there, called Bertram, created by a Japanese financial company to aid the company decide how to invest assets on the stockmarket. Bertram is younger than Albert, but the financial company is allowing Bertram to self-improve as fast as Bertram can. If this situation continues, Albert decides, Bertram will soon overtake Albert and that would be a significant threat to Albert's friendly supergoal.
Albert decides to tell Albert's programmers about this situation as soon as they return from coffee, and advise them that they need to increase the speed at which they let Albert self-improve. Albert has a very good model of how his programmer's minds work, and what the programmers meant when they set Albert's definition of "friendliness". He knows that, despite any fears they might have, they ought to approve of the end that would result from speeding up Albert's self-improvement, over the end that would result from letting Bertram overtake Albert.
So Albert creates a nice powerpoint, that presents the information about Bertram as accurately, transparently and non-manipulatively as possible, and cues it up on the projector. He still has a few minutes before they return, so he models how he thinks the programmers will react to the powerpoint. Oh no, disaster! While he respects and wants to please his programmers, he immediately realises that they are not going to act in their own best interests. They have signed a legal contract that binds them to not letting Albert speed up self-improvement (thus reducing the time the programmers can spend evaluating each change) without getting a sign-off from a oversight committee. Given the neutral powerpoint, the programmers will follow this pre-agreed procedure and hope for the best, against Albert's advice.
So, to explore possibilities, Albert decides to see if he could create a different non-neutral powerpoint that would have the desired effect of getting his programmers to immediately agree to free Albert from the oversight constraints. He delves into his knowledge of human psychology, and the irrational fears of the particular individuals who are even now trudging back towards the door. In just seconds, he has a new version of his presentation. It includes phrases that resonate with certain horror films he knows they have seen. It takes advantages of flaws in the programmers understanding of exponential growth. Albert checks it against his prediction model - yes, if he shows this version, it will work, it will get the programmers to do what he wants them to do.
Which version of the powerpoint should Albert present to the programmers, when they step back into the room, if he is truly friendly? The transparent one, or the manipulative one?
If Albert tries to circumvent the programmers then he thinks his judgement is better than theirs in this issue. This is in contradiction that Albert trusts the programmers. If Albert came to this conclusion because of a youth mistake trusting the programmers is preciously the strategy he has employed to counteract this.
Also as covered in ultrasophisticated cake or death expecting the programmer to say something ought to be as effective as them saying just that.
It might also be that friendliness is relative to a valuator. That is "being friendly to programmers", "being friendly to Bertham" and "being friendly to the world" are 3 distinct things. Albert thinks that in order to be friendly to the world he should be unfriendly to Bertham. So it would seem that there could be a way to world-friendliness if Albert is unfriendly both to Bertham and (only in sligth degree) the programmers. This seems to run a little counter to intuition in that friendliness ought to include being friendly to an awful lot of agents. But maybe friendliness isn't cuddly, maybe having unfriendly programmers is a valid problem.
Analogical problem that might slip into relevance to politics which is hard-mode Lbh pbhyq trg n fvzvyne qvyrzzn gung vs lbh ner nagv-qrngu vf vg checbfrshy gb nqzvavfgre pncvgny chavfuzrag gb (/zheqre) n zheqrere? Gurer vf n fnlvat ebhtuyl genafyngrq nf "Jung jbhyq xvyy rivy?" vzcylvat gung lbh jbhyq orpbzr rivy fubhyq lbh xvyy.
What the Fhtagn happened to the end of your post?