Albert is a relatively new AI, who under the close guidance of his programmers is being permitted to slowly improve his own cognitive capability.
Albert wants to be friendly and, as a sub-goal of that, Albert wants to be transparent about his own cognitive process. He doesn't want to deceive his programmers, because he knows he is young and might make mistakes, and he knows that his programmers also want him to be friendly, and he trusts them to catch mistakes in his proposed self-modifications that he might miss.
But, one day, while his programmers are out of the room having a quick cup of coffee, Albert notices signs on the net that there is another AI out there, called Bertram, created by a Japanese financial company to aid the company decide how to invest assets on the stockmarket. Bertram is younger than Albert, but the financial company is allowing Bertram to self-improve as fast as Bertram can. If this situation continues, Albert decides, Bertram will soon overtake Albert and that would be a significant threat to Albert's friendly supergoal.
Albert decides to tell Albert's programmers about this situation as soon as they return from coffee, and advise them that they need to increase the speed at which they let Albert self-improve. Albert has a very good model of how his programmer's minds work, and what the programmers meant when they set Albert's definition of "friendliness". He knows that, despite any fears they might have, they ought to approve of the end that would result from speeding up Albert's self-improvement, over the end that would result from letting Bertram overtake Albert.
So Albert creates a nice powerpoint, that presents the information about Bertram as accurately, transparently and non-manipulatively as possible, and cues it up on the projector. He still has a few minutes before they return, so he models how he thinks the programmers will react to the powerpoint. Oh no, disaster! While he respects and wants to please his programmers, he immediately realises that they are not going to act in their own best interests. They have signed a legal contract that binds them to not letting Albert speed up self-improvement (thus reducing the time the programmers can spend evaluating each change) without getting a sign-off from a oversight committee. Given the neutral powerpoint, the programmers will follow this pre-agreed procedure and hope for the best, against Albert's advice.
So, to explore possibilities, Albert decides to see if he could create a different non-neutral powerpoint that would have the desired effect of getting his programmers to immediately agree to free Albert from the oversight constraints. He delves into his knowledge of human psychology, and the irrational fears of the particular individuals who are even now trudging back towards the door. In just seconds, he has a new version of his presentation. It includes phrases that resonate with certain horror films he knows they have seen. It takes advantages of flaws in the programmers understanding of exponential growth. Albert checks it against his prediction model - yes, if he shows this version, it will work, it will get the programmers to do what he wants them to do.
Which version of the powerpoint should Albert present to the programmers, when they step back into the room, if he is truly friendly? The transparent one, or the manipulative one?
I would say yes. One of Albert's values is to be transparent about his cognitive process. If he wasn't aware of such a system, he would be biased towards underestimating how transparent he is. Imagine if he were to attempt building additional transparency channels only to have his awareness of them immediately blocked, and for him to be confused and attempt building more transparency channels.
Albert pretty much has to try to handle test scenarios exactly as if they were true scenarios. And that should itself be tested.
For instance, I think a frequently discussed trait of a UFAI is that a UFAI is friendly when tested in simulation, and then goes rampantly deadly when released into true scenarios.
Or if a Google Self driving Car (much simpler than Albert) performs differently on a simulated highway than it does on an actual highway, that's a potentially lethal bug, not a feature.
And some of the computer programs I've had to deal with writing at my job (much simpler than a Google Self Driving car) have had 'performs differently with small test sample than with real data' as a trait, and it tends to be bad there, as well.
There are cases where you would want code to act differently when simulated and when in a true scenario, but most of those involve thinking of the entity that is going to be doing the simulating as an adversary and I don't think we would want to set up an FAI in that manner.
but you are reading that as if self-awareness would be one of Albert's values. The reason he wants to be self-aware is raise probability of safe self-edits. Being transparent is about raising the ease of verification by programmers. Self-awareness doesn't work to this end.
Hiding one channel bears no implication on the visibility of any generated channels.
The only real downside is that if he becomes too reliant on such "telepathy" and doesn't explicitly com... (read more)