Viliam comments on Debunking Fallacies in the Theory of AI Motivation - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (343)
You ask
and you give me a helpful hint:
Well, first please note that ALL artifacts at the present time, including computer systems, cans of beans, and screwdrivers, are psychopaths because none of them are DESIGNED to possess empathy. So your hint contains zero information. :-)
What is the probability that an AI would be a psychopath if someone took the elementary step of designing it to have empathy? Probability would be close to 1, assuming the designers knew what empathy was, and knew how to design it.
But your question was probably meant to target the situation where someone built an AI and did not bother to give it empathy. I am afraid that that is outside the context we are examining here, because all of the scenarios talk about some kind of inevitable slide toward psychpathic behavior, even under the assumption that someone does their best to give the AI an empathic motivation.
But I will answer this: if someone did not even try to give it empathy, that would be like designing a bridge and not even trying to use materials that could hold up a person's weight. In both cases the hypothetical is not interesting, since designing failure into a system is something any old fool could do.
Your second remark is a classic mistake that everyone makes in the context of this kind of discussion. You mention that the phrase "benevolence toward humanity" means "benevolence" as defined by the computer code.
That is incorrect. Let's try, now, to be really clear about that, because if you don't get why it is incorrect we might waste a lot of time running around in circles. It is incorrect for two reasons. First, because I was consciously using the word to refer to the normal human usage, not the implementation inside the AI. Second, it is incorrect because the entire issue in the paper is that there is a discrepancy between the implementation inside the AI and normal usage, and that discrepancy is then examined in the rest of the paper. By simply asserting that the AI may believe, "correctly" that benevolence is the same as violence toward people, you are pre-empting the discussion.
In the remarks you make after that, you are reciting the standard line contained in all the scenarios that the paper is addressing. That standard line is analyzed in the rest of the paper, and a careful explanation is given for why it is incoherent. So when you simply repeat the standard line, you are speaking as if the paper did not actually exist.
I can address questions that refer to the arguments in the paper, but I cannot say anything if you only recite the standard line that is demolished in the course of the paper's argument. So if you could say something about the argument itself.....
Sorry, I admit I do not understand what exactly the argument is. Seems to me it is something like "if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid". Which I agree with.
Now the question is (1) what is the probability that we will not get the Friendly AI perfectly on the first attempt, and (2) what happens then? Suppose we got the "superintelligent" and "self-improving" parts correctly, and the "Friendly" part 90% correctly...
As to not understanding the argument - that's understandable, because this is a long and dense paper.
If you are trying to summarize the whole paper when you say "if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid", then that would not be right. The argument includes a statement that resembles that, but only as an aside.
As to your question about what happens next, or what happens if we only get the "Friendly" part 90% correct .... well, you are dragging me off into new territory, because that was not really within the scope of the paper. Don't get me wrong: I like being dragged off into that territory! But there just isn't time to write down and argue the whole domain of AI friendliness all in one sitting.
The preliminary answer to that question is that everything depends on the details of the motivation system design and my feeling (as a designer of AGI motivation systems) is that beyond a certain point the system is self-stabilizing. That is, it will understand its own limitations and try to correct them.
But that last statement tends to get (some other) people inflamed, because they do not realize that it comes within the "swarm relaxation" context, and they misunderstand the manner in which a system would self correct. Although I said a few things about swarm relaxation in the paper, I did not give enough detail to be able to address this whole topic here.
I understand your desire to stick to an exegesis of your own essay, but part of a critical examination of your essay is seeing whether or not it is on point, so these sorts of questions really are "about" your essay.
Regardng your preliminary answer, I by "correct" I assume you mean "correctly reflecting the desires of the human supervisors"? (In which case, this discussion feeds into our other thread.)
With the best will in the world, I have to focus on one topic at a time: I do not have the bandwidth to wander across the whole of this enormous landscape.
As your question: I was using "correct" as a verb, and the meaning was "self-correct" in the sense of bringing back to the previosuly specified course.
In this case this would be about the AI perceiving some aspects of its design that it noticed might cause it to depart from what it's goal was nominally supposed to be. In that case it would suggest modifications to correct the problem.