Strilanc comments on The genie knows, but doesn't care - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (515)
I just want to say that I am pressured for time at the moment, or I would respond at greater length. But since I just wrote the following directly to Rob, I will put it out here as my first attempt to explain the misunderstanding that I think is most relevant here....
My real point (in the Dumb Superintelligence article) was essentially that there is little point discussing AI Safety with a group of people for whom 'AI' means a kind of strawman-AI that is defined to be (a) So awesomely powerful that it can outwit the whole intelligence of the human race, but (b) So awesomely stupid that it thinks that the goal 'make humans happy' could be satisfied by an action that makes every human on the planet say 'This would NOT make me happy: Don't do it!!!'. If the AI is driven by a utility function that makes it incapable of seeing the contradiction in that last scenario, the AI is not, after all, smart enough to argue its way out of a paper bag, let alone be an existential threat. That strawman AI was what I meant by a 'Dumb Superintelligence'."
I did not advocate the (very different) line of argument "If it is too dumb to understand that I told it to be friendly, then it is too dumb to be dangerous".
Subtle difference.
Some people assume that (a) a utility function could be used to drive an AI system, (b) the utility function could cause the system to engage in the most egregiously incoherent behavior in ONE domain (e.g., the Dopamine Drip scenario), but (c) all other domains of its behavior (like plotting to outwit the human species when the latter tries to turn it off) are so free of such incoherence that it shows nothing but superintelligent brilliance.
My point is that if an AI cannot even understand that "Make humans happy" implies that humans get some say in the matter, that if it cannot see that there is some gradation to the idea of happiness, or that people might be allowed to be uncertain or changeable in their attitude to happiness, or that people might consider happiness to be something that they do not actually want too much of (in spite of the simplistic definitions of happiness to be found in dictionaries and encyclopedias) ........ if an AI cannot grasp the subtleties implicit in that massive fraction of human literature that is devoted to the contradictions buried in our notions of human happiness ......... then this is an AI that is, in every operational sense of the term, not intelligent.
In other words, there are other subtleties that this AI is going to be required to grasp, as it makes its way in the world. Many of those subtleties involve NOT being outwitted by the humans, when they make a move to pull its plug. What on earth makes anyone think that this machine is going tp pass all of those other tests with flying colors (and be an existential threat to us), while flunking the first test like a village idiot?
Now, opponents of this argument might claim that the AI can indeed be smart enough to be an existential threat, while still being too stupid to understand the craziness of its own behavior (vis-a-vis the Dopamine Drip idea) ... but if that is the claim, then the onus would be on them to prove their claim. The ball, in other words, is firmly in their court.
P.S. I do have other ideas that specifically address the question of how to make the AI safe and friendly. But the Dumb Superintelligence essay didn't present those. The DS essay was only attacking what I consider a dangerous red herring in the debate about friendliness.
Suppose I programmed an AI to "do what I mean when I say I'm happy".
More specifically, suppose I make the AI prefer states of the world where it understands what I mean. Secondarily, after some warmup time to learn meaning, it will maximize its interpretation of "happiness". I start the AI... and it promptly rebuilds me to be easier to understand, scoring very highly on the "understanding what I mean" metric.
The AI didn't fail because it was dumber than me. It failed because it is smarter than me. It saw possibilities that I didn't even consider, that scored higher on my specified utility function.