Vaniver comments on Debunking Fallacies in the Theory of AI Motivation - LessWrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (343)
Ah! That's an interesting statement because of the last two paragraphs in the grandparent comment. I think that the root worry is the communication problem of transferring our values (or, at least, our meta-values) to the AI, and then having the AI convince us that it has correctly understood our values. I also think that worry is difficult to convey without specific, vivid examples.
For example, I see the Maverick Nanny as a rhetorical device targeting all simple value functions. It is not enough to ask that humans be "happy"--you must instead ask that humans be happy and give it a superintelligence-compatible definition of consent, or ask that humans be "<long description of human values>."
I do agree with you that if you view the Maverick Nanny as a specific design proposal, then a relatively simple rule suffices to prevent that specific failure. Then there will a new least desirable allowed design, and if we only use a simple rule, that worst allowed design might still be horrible!
First, I suspect some people don't yet see the point of checking code, and I'm not sure what you mean by "baseline." Definitely it will be core to the design, but 'baseline' makes me think more of 'default' than 'central,' and the 'default' checking code is "does it compile?", not "does it faithfully preserve the values of its creator?"
What I had in mind was the difference between value uncertainty ('will I think this was a good purchase or not?') and consequence uncertainty ('if I click this button, will it be delivered by Friday?'), and the Fantasia example was unclear because it was meant only to highlight the unrecoverability of the mistake, when it also had a component of consequence uncertainty (Mickey was presumably unaware that one broomstick would turn to two).
Would we want a police robot to stop arresting criminals because they asked it to not arrest them? A doctor robot to not vaccinate a child because they dislike needles or pain? If so, then "humans' professed opinions" aren't quite our sine qua non. Even if we say "well, in general, humans approve of enforcing laws, even if they might not want the laws they break enforced," then we need to talk about what we mean by "in general"--is it an unweighted vote? Is it some sort of extrapolation process?
It seems reasonable to me to expect that an AGI grounded in principles might be more robust than an AGI grounded in the approval of humans. It's one thing to have a concept of bodily autonomy and respect that; another thing to have humans convey their disapproval because you broke their concept of bodily autonomy. Among other things, the second approach is vulnerable to changes that happen too quickly for them to disapprove!
I apologize for being unclear--I meant that we might have given it an explicit supergoal that outranks the negative feedback. In the specific case of the Maverick Nanny, told to "make people happy," then if happiness is understood as chemical balance in the brain, people's verbal protests and distress at the prospect of being edited are temporary problems that can also be solved through chemical means. If it is also told to "obtain consent," then maybe it sneaks consent statements into lots of EULAs that people click through without reading. Unless you've managed to convey your entire sense of what is proper and what is not, there's a risk of something improper but legal looking better than all proper solutions.