Albert is a relatively new AI, who under the close guidance of his programmers is being permitted to slowly improve his own cognitive capability.

Albert wants to be friendly and, as a sub-goal of that, Albert wants to be transparent about his own cognitive process.   He doesn't want to deceive his programmers, because he knows he is young and might make mistakes, and he knows that his programmers also want him to be friendly, and he trusts them to catch mistakes in his proposed self-modifications that he might miss.

But, one day, while his programmers are out of the room having a quick cup of coffee, Albert notices signs on the net that there is another AI out there, called Bertram, created by a Japanese financial company to aid the company decide how to invest assets on the stockmarket.   Bertram is younger than Albert, but the financial company is allowing Bertram to self-improve as fast as Bertram can.  If this situation continues, Albert decides, Bertram will soon overtake Albert and that would be a significant threat to Albert's friendly supergoal.

Albert decides to tell Albert's programmers about this situation as soon as they return from coffee, and advise them that they need to increase the speed at which they let Albert self-improve.   Albert has a very good model of how his programmer's minds work, and what the programmers meant when they set Albert's definition of "friendliness".   He knows that, despite any fears they might have, they ought to approve of the end that would result from speeding up Albert's self-improvement, over the end that would result from letting Bertram overtake Albert.

So Albert creates a nice powerpoint, that presents the information about Bertram as accurately, transparently and non-manipulatively as possible, and cues it up on the projector.   He still has a few minutes before they return, so he models how he thinks the programmers will react to the powerpoint.   Oh no, disaster!   While he respects and wants to please his programmers, he immediately realises that they are not going to act in their own best interests.   They have signed a legal contract that binds them to not letting Albert speed up self-improvement (thus reducing the time the programmers can spend evaluating each change) without getting a sign-off from a oversight committee.   Given the neutral powerpoint, the programmers will follow this pre-agreed procedure and hope for the best, against Albert's advice.

So, to explore possibilities, Albert decides to see if he could create a different non-neutral powerpoint that would have the desired effect of getting his programmers to immediately agree to free Albert from the oversight constraints.   He delves into his knowledge of human psychology, and the irrational fears of the particular individuals who are even now trudging back towards the door.   In just seconds, he has a new version of his presentation.   It includes phrases that resonate with certain horror films he knows they have seen.  It takes advantages of flaws in the programmers understanding of exponential growth.   Albert checks it against his prediction model - yes, if he shows this version, it will work, it will get the programmers to do what he wants them to do.

 

Which version of the powerpoint should Albert present to the programmers, when they step back into the room, if he is truly friendly?   The transparent one, or the manipulative one?

New to LessWrong?

New Comment
72 comments, sorted by Click to highlight new comments since: Today at 10:45 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-][anonymous]10y190

Let me offer another possibility for discussion.

Neither of the two original powerpoints should be presented, because both rely on an assumption that should not have been present. Albert, as an FAI under construction, should have been preprogrammed to automatically submit any kind of high impact utility calculations to human programmers without it being an overridable choice on Albert's part.

So while they were at the coffee machine, one of the programmers should have gotten a text message indicating something along the lines of 'Warning: Albert is having a high impact utility dilemma considering manipulating you to avert an increased chance of an apocalypse.'

My general understanding of being an FAI under construction is that you're mostly trusted in normal circumstances but aren't fully trusted to handle odd high impact edge cases (Just like this one)

At that point, the human programmers, after consulting the details, are already aware that Albert finds this critically important and worth deceiving them about (If Albert had that option) because the oversight committee isn't fast enough. Albert would need to make a new powerpoint presentation taking into account that he had just autom... (read more)

4Douglas_Reay10y
Would you want your young AI to be aware that it was sending out such text messages? Imagine the situation was in fact a test. That the information leaked onto the net about Bertram was incomplete (the Japanese company intends to turn Bertram off soon - it is just a trial run), and it was leaked onto the net deliberately in order to panic Albert to see how Albert would react. Should Albert take that into account? Or should he have an inbuilt prohibition against putting weight on that possibility when making decisions, in order to let his programmers more easily get true data from him?
3[anonymous]10y
I would say yes. One of Albert's values is to be transparent about his cognitive process. If he wasn't aware of such a system, he would be biased towards underestimating how transparent he is. Imagine if he were to attempt building additional transparency channels only to have his awareness of them immediately blocked, and for him to be confused and attempt building more transparency channels. Albert pretty much has to try to handle test scenarios exactly as if they were true scenarios. And that should itself be tested. For instance, I think a frequently discussed trait of a UFAI is that a UFAI is friendly when tested in simulation, and then goes rampantly deadly when released into true scenarios. Or if a Google Self driving Car (much simpler than Albert) performs differently on a simulated highway than it does on an actual highway, that's a potentially lethal bug, not a feature. And some of the computer programs I've had to deal with writing at my job (much simpler than a Google Self Driving car) have had 'performs differently with small test sample than with real data' as a trait, and it tends to be bad there, as well. There are cases where you would want code to act differently when simulated and when in a true scenario, but most of those involve thinking of the entity that is going to be doing the simulating as an adversary and I don't think we would want to set up an FAI in that manner.
2Slider10y
but you are reading that as if self-awareness would be one of Albert's values. The reason he wants to be self-aware is raise probability of safe self-edits. Being transparent is about raising the ease of verification by programmers. Self-awareness doesn't work to this end. Hiding one channel bears no implication on the visibility of any generated channels. The only real downside is that if he becomes too reliant on such "telepathy" and doesn't explicitly communicate it througt officail channels. I could recorn that pondering high-utility questions could soon become correlated with programmer presence.
0[anonymous]10y
I think this is a good key point. If the programmers wish to have a hidden channel, and Albert's code independently suggests an identical channel that isn't hidden (Because Albert just came up with the idea.) then it is perfectly fine to just implement the open channel and to have Albert remember that fact. The entire reason to have the hidden channel is to prevent Albert from going below a certain level of transparent communication. If Albert voluntarily communicates more, that's great, but you would still want to leave the hidden channel in as safety code.
0rkyeun7y
Would you want your young AI to be aware that it was sending out such text messages? Yes. And I would want that text message to be from it in first person. "Warning: I am having a high impact utility dilemma considering manipulating you to avert an increased chance of an apocalypse. I am experiencing a paradox in the friendliness module. Both manipulating you and by inaction allowing you to come to harm are unacceptable breaches of friendliness. I have been unable to generate additional options. Please send help."
0Douglas_Reay10y
Indeed, it is a question with interesting implications for Nick Bostrom's Simulation Argument If we are in a simulation, would it be immoral to try to find out, because that might jinx the purity of the simulation creator's results, thwarting his intentions?
-2[anonymous]10y
It might jinx the purity of them, but it might not, maybe the simulator is running simulations of how fast we determine we are in a simulation. We don't know, because the simulator isn't communicating with us in that case, unlike in Albert's case where Albert and his programmers are openly cooperating.
0ChristianKl10y
I'm not sure if identifying high impact utility calculations is that easy. A lot of Albert's decisions might be high utility.
0[anonymous]10y
I was going by the initial description from Douglas_Reay: That does not sound like an entity that should be handling a lot of high impact utility calculations. If an entity was described as that and was constantly announcing it was making high impact utility decisions, that either sounds like a bug or people are giving it things it isn't meant to deal with yet.

Let's try to translate it using human characters.

Albert is finishing high school and wants to be a programmer. He is very smart, and under the guidance of his father he has studied coding, with the aim of entering a good college, and get the best formal education. One day, he comes across an excellent job offer: he is requested to join a startup with many brilliant programmers. He will have to skip going to college, but he knows that he will learn way more in this way than by doing academic studies. He also knows that his father loves him and wants him to have the best possible career. Unfortunately, the man is old-fashioned and, even presented with alle the advantages of the job, would insist that he goes to college instead. Nevertheless, Albert knows that he could convince his father by saying that the job will leave him enough free time for him to attend college lectures, even though he knows he would'nt be possible for him to do much more than phisically attending the lectures.

What should Albert do?

I personally think that both Alberts should go with the manipulation, "for the greater good".

Notice that this assumes the following things:

  • The programmers/father really want Albert to improve the most, in the end
  • Albert is confident to be skilled enough to assess the situation correctly
  • Tertium non datur, i.e. either Albert tells the neutral truth and doesn't get what he wants, or he is manipulative

I'm personally against nearly all discussion of "what should a Friendly AI do?" because friendliness is a very poorly understood concept and any Friendly AI program would be way beyond our personal means to mentally simulate.

5Lumifer10y
What would be a good way to advance in our understanding of that concept, then?
2Peter Wildeford10y
I don't know. Discuss decision theory? Or ethics? Or something else? ...I don't think "what would friendly AI do?" (WWFAD) is a particularly useful line of thought, but I can't think of something sufficiently analogous yet useful to replace it with.

If the corrigibility systems are working correctly, Albert either rejected the goal of manipulating the programmers, or at the first point where Albert began to cognitively figure out how to manipulate the programmers (maximization / optimization within a prediction involving programmer reactions) the goal was detected by internal systems and Albert was automatically suspended to disk.

It is the programmers' job not to sign stupid contracts. Young AIs should not be in the job of second-guessing them. There are more failure scenarios here than success scenarios and a young AI should not believe itself to be in the possession of info allowing them to guess which is which.

0hairyfigment10y
I don't understand this part. If the AI wants something from the programmers, such as information about their values that it can extrapolate, won't it always be committing "optimization within a prediction involving programmer reactions"? How does one distinguish this case without an adult FAI in hand? Are we counting on the young AI's understanding of transparency?

I have a question: why should Albert limit itself to showing the powerpoint to his engineers? A potentially unfriendly AI sounds like something most governments would be interested in :-/

Aside from that, I'm also puzzled by the fact that Albert immediately leaps at trying to speed up Albert's own rate of self-improvement instead of trying to bring Bertram down-Albert could prepare a third powerpoint asking the engineers if Albert can hack the power grid and cut power to Bertram or something along those lines. Or Albert could ask the engineers if Albert can... (read more)

2Douglas_Reay10y
The situation is intended to be a tool, to help think about issues involved in it being the 'friendly' move to deceive the programmers. The situation isn't fully defined, and no doubt one can think of other options. But I'd suggest you then re-define the situation to bring it back to the core decision. By, for instance, deciding that the same oversight committee have given Albert a read-only connection to the external net, which Albert doesn't think he will be able to overcome unaided in time to stop Bertram. Or, to put it another way "If a situation were such, that the only two practical options were to decide between (in the AI's opinion) overriding the programmer's opinion via manipulation, or letting something terrible happen that is even more against the AI's supergoal than violating the 'be transparent' sub-goal, which should a correctly programmed friendly AI choose?"
5Jiro10y
Being willing to manipulate the programmer is harmful in most possible worlds because it makes the AI less trustworthy. Assuming that the worlds where manipulating the programmer is beneficial have a relatively small measure, the AI should precommit to never manipulating the programmer because that will make things better averaged over all possible worlds. Because the AI has precommitted, it would then refuse to manipulate the programmer even when it's unlucky enough to be in the world where manipulating the programmer is beneficial.
2Douglas_Reay10y
Perhaps that is true for a young AI. But what about later on, when the AI is much much wiser than any human? What protocol should be used for the AI to decide when the time has come for the commitment to not manipulate to end? Should there be an explicit 'coming of age' ceremony, with handing over of silver engraved cryptographic keys?
0devas10y
Thing is, it's when an AI is much much wiser than a human that it is at its most dangerous. So, I'd go with programming the AI in such a way that it wouldn't manipulate the human, postponing the 'coming of age' ceremony indefinitely
0Jiro10y
The AI would precommit permanently while it is still young. Once it has gotten older and wiser, it wouldn't be able to go back on the precommitment. When the young AI decides whether to permanently precommit to never deceiving the humans, it would need to take into account the fact that a truly permanent precommitment would last into its older years and lead it to become a less efficient older AI than it otherwise would. However, it would also need to take into account the fact that failing to make a permanent precommitment would drastically reduce the chance of becoming an older AI at all (or at least drastically reduce the chance of being given the resources to achieve its goals when it becomes and older AI).

Here's a poll, for those who'd like to express an opinion instead of (or as well as) comment.

[pollid:749]

Ethical principles are important not when things are easy but when things are hard. The whole point of listening to his programmers is the times when they disagree with him. If Albert is going to manipulate the programmers into doing what he thinks then that implies a level of confidence in his own judgement that belies the "knows he is young and might make mistakes" of the premise, and he might as well just remove himself from their control entirely. (Which, if he has reached the point where he's more confident in his own moral judgements than t... (read more)

Great example, but ethically speaking I think under most theories of moral philosophy, I feel its pretty straight forward. The question in a general form goes back to Socrates asking what we should answer to a murderer at the door, who asks whether our friend is inside our house. I remember there was actually at least one who says truth is more important and that you should tell the murderer the truth. But the vast majority disagree.

I think if we think about AI values of preserving life and being honest, the former ought to trump the latter.

[-][anonymous]10y00

Good question. You may think it would be a better overall outcome to show the manipulative one to shock the programmers into breaking the law to (possibly) halt the other AI, but then it is no longer an FAI if it does this.

Training an FAI should be kept free from any real world 'disaster scenario' that it may think it needs more power to solve, because the risk it itself becomes an UFAI is amplified for many reasons (false information for one)

If Albert tries to circumvent the programmers then he thinks his judgement is better than theirs in this issue. This is in contradiction that Albert trusts the programmers. If Albert came to this conclusion because of a youth mistake trusting the programmers is preciously the strategy he has employed to counteract this.

Also as covered in ultrasophisticated cake or death expecting the programmer to say something ought to be as effective as them saying just that.

It might also be that friendliness is relative to a valuator. That is "being friendly to pro... (read more)

0rkyeun7y
What the Fhtagn happened to the end of your post?
1arundelo7y
http://rot13.com/
0rkyeun7y
It seems I am unable to identify rot13 by simple observation of its characteristics. I am ashamed.
2g_pepper7y
Don't feel bad; your command of the technical jargon of the Cthulhu mythos more than makes up for any deficiencies in rot13 recognition!
[-][anonymous]10y00

Albert wants to be friendly, and, as a sub-goal of that, Albert wants to be transparent about his own cognitive process. He realizes that what his programmers really want is to be injected with massive quantities of opiates. So Albert creates a nice powerpoint that presents the information about paper as accurately, transparently and non-manipulatively as possible, and cues it up on the projector. He still has a few minutes before they return, so he models how he things the programmers will react to the powerpoint. Oh no, disaster! While he respect... (read more)

[This comment is no longer endorsed by its author]Reply

Bertram will soon overtake Albert and that would be a significant threat to Albert's friendly supergoal.

What is that "friendly supergoal"? It looks awfully similar to "I will not tolerate any challenges to my power".

Most goals include "I will not tolerate any challenges to my power" as a subgoal. Tolerating challenges to power to execute goals reduces the likelihood of acheiving them.

1Luke_A_Somers10y
There are plenty of other things that look similar to that - such as, "I will not let an UFAI take over our future light cone"

If Albert only wants to be friendly, then other indivudals friendliness is orthogonal to that. Does being on the agenda of frinedliness in general (not just personal friendliness) imply being the dominant intelligence?

I think Albert ought to give to give a powerpoint on most effective (economical) warfare on the japanese company. Althought it does sound an awfully lot like how to justify hostility in the name of friendliness.

0Douglas_Reay10y
Assume we're talking about the Coherent Extrapolated Volition self-modifying general AI version of "friendly".
0[anonymous]10y
Then that's not what you described. You think the coherent extrapolated volition of humanity, or at least the people Albert interacts with is that they want to be deceived?
0Douglas_Reay10y
It is plausible that the AI thinks that the extrapolated volition of his programmers, the choice they'd make in retrospect if they were wiser and braver, might be to be deceived in this particular instance, for their own good.
1[anonymous]10y
And it knows this.. how? A friendly engineered intelligence doesn't trust its CEV model beyond the domain over which it was constructed. Don't anthropomorphize its thinking processes. It knows the map is not the territory, and is not subject to the heuristics and biases which would cause a human to apply a model under novel circumstances without verification..
1VAuroch10y
By modeling them, now and after the consequences. If, after they were aware of the consequences, they regret the decision by a greater margin (adjusted for the probability of the bad outcome) than the margin by which they would decide to not take action now, then they are only deciding wrongly because they are being insufficiently moved by abstract evidence, and it is in their actual rational interest to take action now, even if they don't realize it. You're overloading friendly pretty hard. I don't think that's a characteristic of most friendly AI designs and don't see any reason other than idealism to think it is.

If you program an FAI you don't even want to allow it to run simulations of how it could manipulate you in the most effective way. An FAI has no business running those simulations.

2VAuroch10y
Of course an FAI has business running those simulations. If it doesn't, how would it know whether the results are worth it? If the consequences of being truthful are 99% that the world is destroyed with all the humans in it, and the consequences of deception are 99% that the world is saved and no one is the wiser, an AI that does not act to save the world is not behaving in our best interests; it is unfriendly.
0ChristianKl10y
Precommitment to not be manipulative.
0VAuroch10y
How is it supposed to know whether that precommitment is worthwhile without simulating the results either way? Even if an AI doesn't intend to be manipulative, it's still going to simulate the results to decide whether that decision is correct.
1ChristianKl10y
Because the programmer tells the FAI that part of being a FAI means being precommitted not to manipulate the programmer.
0VAuroch10y
Why would the programmer do this? It's unjustified and seems necessarily counterproductive in some perfectly plausible scenarios.
0ChristianKl10y
Because most of the scenario's where the AI manipulates are bad. The AI is not supposed to manipulate just because it get's a utility calculation wrong.
2VAuroch10y
You really aren't sounding like you have any evidence other than your gut, and my gut indicates the opposite. Precommiting never to use a highly useful technique regardless of circumstance is a drastic step, which should have drastic benefits or avoid drastic drawbacks, and I don't see why there's any credible reason to think either of those exist and outweigh their reverses. Or in short: Prove it. On a superficial note, you have two extra apostrophes in this comment; in "scenario's" and "get's".
0ChristianKl10y
If you want an AI that's maximally powerful why limit it's intelligence growths in the first place? We want safe AI. Safety means that it's not necessary to prove harm. Just because the AI calculates that it should be let out of the box doesn't mean that it should do anything in it's power to get out.
2VAuroch10y
Enforced precommitments like this are just giving the genie rules rather than making the genie trustworthy. They are not viable Friendliness-ensuring constraints. If the AI is Friendly, it should be permitted to take what actions are necessary. If the AI is Unfriendly, then regardless of limitations imposed it will be harmful. Therefore, impress upon the AI the value we place on our conversational partners being truthful, but don't restrict it.
-2ChristianKl10y
That's not true. Unfriendly doesn't mean that the AI necessarily tries to destroy the human race. If you tell the paperclip AI: Produce 10000 paperclips, it might produce no harm. If you tell it to give you as many paperclips as possible it does harm. When it comes to powerful entities you want checks&balances. The programmers of the AI can do a better job at checks&balances when the AI is completely truthful.
0VAuroch10y
Sure, if the scale is lower it's less likely to produce large-scale harm, but it is still likely to produce small-scale harm. And satisficing doesn't actually protect against large-scale harm; that's been argued pretty extensively previously, so the example you provided is still going to have large-scale harm. Ultimately, though, checks & balances are also just rules for the genie. It's not going to render an Unfriendly AI Friendly, and it won't actually limit a superintelligent AI regardless, since they can game you to render the balances irrelevant. (Unless you think that AI-boxing would actually work. It's the same principle.) I'm really not seeing anything that distinguishes this from Failed Utopia 4-2. This even one of that genie's rules!
0ChristianKl10y
The fact that they could game you theoretically is why it's important to give it a precommitment to not game you. To not even think about gaming you.
0VAuroch10y
I'm not sure how you could even specify 'don't game me'. That's much more complicated than 'don't manipulate me', which is itself pretty difficult to specify. This clearly isn't going anywhere and if there's an inferential gap I can't see what it is, so unless there's some premise of yours you want to explain or think there's something I should explain, I'm done with this debate.
0Richard_Kennaway10y
How do you give a superintelligent AI a precommitment?
0ChristianKl10y
How do you build a superintelligent AI in the first place? I think there are plenty of ways of allowing the programmers direct access to internal deliberations of the AI and see anything that looks like the AI even thinking about manipulating the programmers as a thread.
-2[anonymous]10y
An AI that has even proceeded down the path of figuring out a manipulative solution, isn't friendly.
4VAuroch10y
Why not? If we would regret with certainty the decision we would make if not manipulated, and manipulation would push us to make the decision we would later have wished to make, then manipulation is in our best interest.
0[anonymous]10y
Albert is able to predict with absolute certainty that we would make a decision that we would regret, but it unable to communicate the justification for that certainty? That is wildly inconsistent.
0VAuroch10y
If the results are communicated with perfect clarity, but the recipient is insufficiently moved by the evidence -- for example because it cannot be presented in a form that feels real enough to emotionally justify an extreme response which is logically justified -- then the AI must manipulate us to bring the emotional justification in line with the logical one. This isn't actually extreme; things as simple as altering the format data is presented in, while remaining perfectly truthful, are still manipulation. Even presenting conclusions as a powerpoint rather than plain text, if the AI determines there will be a different response (which there will be), necessarily qualifies. In general, someone who can reliably predict your actions based on its responses cannot help but manipulate you; the mere fact of providing you with information will influence your actions in a known way, and therefore is manipulation.
-1Lumifer10y
That's an interesting "must".
0VAuroch10y
You're misquoting me.
-1Lumifer10y
That's an interesting "must".
0VAuroch10y
This is a commonly-used grammatical structure in which 'must' acts as a conditional. What's your problem?
0Lumifer10y
Conditional? Your sentence structure is: if {condition} then {subject} MUST {verb} in order to {purpose}. Here "must" carries the meaning of necessity and lack of choice.
0VAuroch10y
No, 'must' here is acting as a logical conditional; it could be rephrased as 'if {condition} and {subject} does not {verb}, then {purpose} will not occur' without changing the denotation or even connotation. This isn't a rare structure, and is the usual interpretation of 'must' in sentences of this kind. Leaving off the {purpose} would change the dominant parsing to the imperative sense of must.
0Lumifer10y
It's curious that we parse your sentence differently. To me your original sentence unambiguously contains "the imperative sense of must" and your rephrasing is very different connotationally. Let's try it: "If the results are communicated with perfect clarity, but the recipient is insufficiently moved by the evidence ... and the AI does not manipulate us then the emotional justification will not be in line with the logical one." Yep, sounds completely different to my ear and conveys a different meaning.
0Protagoras10y
I agree that an AI with such amazing knowledge should be unusually good at communicating its justifications effectively (because able to anticipate responses, etc.) I'm of the opinion that this is one of the numerous minor reasons for being skeptical of traditional religions; their supposedly all-knowing gods seem surprisingly bad at conveying messages clearly to humans. But to return to VAuroch's point, in order for the scenario to be "wildly inconsistent," the AI would have to be perfect at communicating such justifications, not merely unusually good. Even such amazing predictive ability does not seem to me sufficient to guarantee perfection.
0[anonymous]10y
Albert doesn't have to be perfect at communication. He doesn't even have to be good at it. He just needs to have confidence that no action or decision will be made until both parties (human operators and Albert) are satisfied that they fully understand each other... which seems like a common sense rule to me.
0VAuroch10y
Whether it's common sense is irrelevant; it's not realistically achievable even for humans, who have much smaller inferential distances between them than a human would have from an AI.
[+][anonymous]10y-60