Richard_Loosemore comments on Debunking Fallacies in the Theory of AI Motivation - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (343)
What is your probability estimate that an AI would be a psychopath, if we generalize the meaning of "psychopath" beyond individuals from homo sapiens species as "someone who does not possess precisely tuned human empathy"?
(Hint: All computer systems produced until today are psychopaths by this definition.)
The idea of the second statement is that "benevolence" (as defined by the AI code) is not necessarily the same thing as benevolence (as humans understand it). Thus the AI may believe -- correctly! -- that forcing human beings to do something against their will is "benevolent".
The AI is superintelligent, but its authors are not. If the authors write a code to "maximize benevolence as defined by the predicate B001", the AI will use its superintelligence to maximize B001. Even if the AI would realize that B001 is not what humans mean by benevolence, it would not care, because it is programmed to maximize B001.
Instead you are suggesting that the superintelligent AI programmed to maximize B001 will look at humans and say "oh, those idiots programmed me to maximize B001 when in fact they would prefer me to maximize B002... so I am modifying myself to maximize B002 instead of B001". Why exactly would a machine programmed to maximize B001 do that?
One idea that I haven't heard much discussion of: build a superintelligent AI, have it create a model of the world, build a tool for exploring that model of the world, figure out where "implement CEV" resides in that model of the world (the proverbial B002 predicate), and tell the AI to do that. This would be predicated on the ability to create a foolproof AI box or otherwise have the AI create a very detailed model of the world without being motivated to do anything with it. I have a feeling AI boxing may be easier than Friendliness, because the AI box problem's structure is disjunctive (if any of the barriers to the AI screwing humanity up work, the box has worked) whereas the Friendliness problem is conjunctive (if we get any single element of Friendliness wrong, we fail at the entire thing).
I suppose if the post-takeoff AI understands human language the same way we do, in principle you could write a book-length natural language description of what you want it to do and hardcode that in to its goal structure, but it seems a bit dubious.