Vaniver comments on Debunking Fallacies in the Theory of AI Motivation - Less Wrong

8 Post author: Richard_Loosemore 05 May 2015 02:46AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (343)

You are viewing a single comment's thread. Show more comments above.

Comment author: Richard_Loosemore 11 May 2015 09:56:01PM *  1 point [-]

Let me first address the way you phrased it before you gave me the two options.

After saying

My concern is that I don't think this doctrine [of Logical Infallibility] is an essential part of the arguments or scenarios that Yudkowsky et al present.

you add:

An intelligent AI might come to a conclusion about what it ought to do, and then recognize "yes, I might be wrong about this" (whatever is meant by "wrong"---this is not at all clear).

The answer to this is that in all the scenarios I address in the paper - the scenarios invented by Yudkowsky and the rest - the AI is supposed to take an action in spite of the fact that it is getting '''massive feedback''' from all the humans on the planet, that they do not want this action to be executed. That is an important point: nobody is suggesting that these are really subtle fringe cases where the AI thinks that it might be wrong, but it is not sure -- rather, the AI is supposed to go ahead and be unable to stop itself from carrying out the action in spite of clear protests from the humans.

That is the meaning of "wrong" here. And it is really easy to produce a good definition of "something going wrong" with the AI's action plans, in cases like these: if there is an enormous inconsistency between descriptions of a world filled with happy humans (and here we can weigh into the scale a thousand books describing happiness in all its forms) and the fact that virtually every human on the planet reacts to the postulated situation by screaming his/her protests, then a million red flags should go up.

I think that when posed in this way, the question answers itself, no?

In other words, option 2 is close enough to what I meant, except that it is not exactly as a result of its fallibility that it hesitates (knowledge of fallibility is there as a background all the time), but rather due to the immediate fact that its proposed plan causes concern to people.

Comment author: Vaniver 12 May 2015 01:45:34PM 2 points [-]

the AI is supposed to take an action in spite of the fact that it is getting '''massive feedback''' from all the humans on the planet, that they do not want this action to be executed.

I think the worry is at least threefold:

  1. It might make unrecoverable mistakes, possibly by creating a subagent to complete some task that it cannot easily recall once it gets the negative feedback (think Mickey Mouse enchanting the broomstick in Fantasia, or, more realistically, an AI designing a self-replicating computer virus or nanobot swarm to accomplish some task, or the AI designing the future version of itself, that no longer cares about feedback).

  2. It might have principled reasons to ignore that negative feedback. Think browser extensions that prevent you from visiting time-wasting sites, which might also prevent you from disabling them. "I'm doing this for your own good, like you asked me to!"

  3. It might deliberately avoid receiving negative feedback. It may be difficult to correctly formulate the difference between "I want to believe correct ideas" and "I want to believe that my ideas are correct."

I doubt that this list is exhaustive, and unfortunately it seems like they're mutually reinforcing: if it has some principled reasons to devalue negative feedback, that will compound any weakness in its epistemic update procedure.

the fact that virtually every human on the planet reacts to the postulated situation by screaming his/her protests, then a million red flags should go up.

I am uncertain how much of this is an actual difference in belief between you and Yudkowsky, and how much of this is a communication difference. I think Yudkowsky is focusing on simple proposals with horrible effects, in order to point out that simplicity is insufficient, and jumps to knocking down individual proposals to try to establish the general trend that simplicity is dangerous. The more complex the safety mechanisms, the more subtle the eventual breakdown--with the hope that eventually we can get the breakdown subtle enough that it doesn't occur!

(Most people aren't very good deductive thinkers, but alright inductive thinkers--if you tell them "simple ideas are unsafe," they are likely to think "well, except for my brilliant simple idea" instead of "hmm, that implies there's something dangerous about my simple idea." So I don't think I disagree that Yudkowsky's strategy's was the right one, though it has its defects.)

Comment author: Richard_Loosemore 12 May 2015 02:38:07PM 2 points [-]

Well, yes ... but I think the scenarios you describe are becoming about different worries, not covered in the original brief.

It might make unrecoverable mistakes, possibly by creating a subagent to complete some task that it cannot easily recall once it gets the negative feedback (think Mickey Mouse enchanting the broomstick in Fantasia, or, more realistically, an AI designing a self-replicating computer virus or nanobot swarm to accomplish some task, or the AI designing the future version of itself, that no longer cares about feedback).

That one should come under the heading of "How come it started to do something drastic, before it even checked with anyone?"

In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI -- because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present.

It might have principled reasons to ignore that negative feedback. Think browser extensions that prevent you from visiting time-wasting sites, which might also prevent you from disabling them. "I'm doing this for your own good, like you asked me to!"

Well, you cite an example of a non-AI system (a browser) doing this, so we are back to the idea the AI could (for some reason) decide that there was a HIGHER directive, somewhere, that enabled it to justify ignoring the feedback. That goes back to the same point I just made: checking for consistency with humans' professed opinions on the idea would be a sine qua non of any action.

Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn't fail the "Did I implicitly insert an extra supergoal?" test. In the paper I mentioned this at least once, I think -- it came up in the context where I was asking about efficiency, because many people make statements about the AI that, if examined carefully, entail the existence of previously unmentioned supergoal ON TOP of the supergoal that was already supposed to be on top.

Comment author: Vaniver 12 May 2015 06:31:05PM *  5 points [-]

I think the scenarios you describe are becoming about different worries, not covered in the original brief.

Ah! That's an interesting statement because of the last two paragraphs in the grandparent comment. I think that the root worry is the communication problem of transferring our values (or, at least, our meta-values) to the AI, and then having the AI convince us that it has correctly understood our values. I also think that worry is difficult to convey without specific, vivid examples.

For example, I see the Maverick Nanny as a rhetorical device targeting all simple value functions. It is not enough to ask that humans be "happy"--you must instead ask that humans be happy and give it a superintelligence-compatible definition of consent, or ask that humans be "<long description of human values>."

I do agree with you that if you view the Maverick Nanny as a specific design proposal, then a relatively simple rule suffices to prevent that specific failure. Then there will a new least desirable allowed design, and if we only use a simple rule, that worst allowed design might still be horrible!

In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI -- because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present.

First, I suspect some people don't yet see the point of checking code, and I'm not sure what you mean by "baseline." Definitely it will be core to the design, but 'baseline' makes me think more of 'default' than 'central,' and the 'default' checking code is "does it compile?", not "does it faithfully preserve the values of its creator?"

What I had in mind was the difference between value uncertainty ('will I think this was a good purchase or not?') and consequence uncertainty ('if I click this button, will it be delivered by Friday?'), and the Fantasia example was unclear because it was meant only to highlight the unrecoverability of the mistake, when it also had a component of consequence uncertainty (Mickey was presumably unaware that one broomstick would turn to two).

That goes back to the same point I just made: checking for consistency with humans' professed opinions on the idea would be a sine qua non of any action.

Would we want a police robot to stop arresting criminals because they asked it to not arrest them? A doctor robot to not vaccinate a child because they dislike needles or pain? If so, then "humans' professed opinions" aren't quite our sine qua non. Even if we say "well, in general, humans approve of enforcing laws, even if they might not want the laws they break enforced," then we need to talk about what we mean by "in general"--is it an unweighted vote? Is it some sort of extrapolation process?

It seems reasonable to me to expect that an AGI grounded in principles might be more robust than an AGI grounded in the approval of humans. It's one thing to have a concept of bodily autonomy and respect that; another thing to have humans convey their disapproval because you broke their concept of bodily autonomy. Among other things, the second approach is vulnerable to changes that happen too quickly for them to disapprove!

Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn't fail the "Did I implicitly insert an extra supergoal?" test.

I apologize for being unclear--I meant that we might have given it an explicit supergoal that outranks the negative feedback. In the specific case of the Maverick Nanny, told to "make people happy," then if happiness is understood as chemical balance in the brain, people's verbal protests and distress at the prospect of being edited are temporary problems that can also be solved through chemical means. If it is also told to "obtain consent," then maybe it sneaks consent statements into lots of EULAs that people click through without reading. Unless you've managed to convey your entire sense of what is proper and what is not, there's a risk of something improper but legal looking better than all proper solutions.