Vaniver comments on Debunking Fallacies in the Theory of AI Motivation - LessWrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (343)
Furcas, you say:
When I talked to Omohundro at the AAAI workshop where this paper was delivered, he accepted without hesitation that the Doctrine of Logical Infallibility was indeed implicit in all the types of AI that he and the others were talking about.
Your statement above is nonsensical because the idea of a DLI was '''invented''' precisely in order to summarize, in a short phrase, a range of absolutely explicit and categorical statements made by Yudkowsky and others, about what the AI will do if it (a) decides to do action X, and (b) knows quite well that there is massive, converging evidence that action X is inconsistent with the goal statement Y that was supposed to justify X. Under those circumstances, the AI will ignore the massive converging evidence of inconsistency and instead it will enforce the 'literal' interpretation of goal statement Y.
The fact that the AI behaves in this way -- sticking to the literal interpretation of the goal statement, in spite of external evidence that the literal interpretation is inconsistent with everything else that is known about the connection between goal statement Y and action X, '''IS THE VERY DEFINITION OF THE DOCTRINE OF LOGICAL INFALLIBILITY'''
Thank you for writing this comment--it made it clearer to me what you mean by the doctrine of logical infallibility, and I think there may be a clearer way to express it.
It seems to me that you're not getting at logical infallibility, since the AGI could be perfectly willing to be act humbly about its logical beliefs, but value infallibility or goal infallibility. An AI does not expect its goal statement to be fallible: any uncertainty in Y can only be represented by Y being a fuzzy object itself, not in the AI evaluating Y and somehow deciding "no, I was mistaken about Y."
In the case where the Maverick Nanny is programmed to "ensure the brain chemistry of humans resembles the state extracted from this training data as much as possible," there is no way to convince the Maverick Nanny that it is somehow misinterpreting its goal; it knows that it is are supposed to ensure perceptions about brain chemistry, and any statements you make about "true happiness" or "human rights" are irrelevant to brain chemistry, even though it might be perfectly willing to consider your advice on how to best achieve that value or manipulate the physical universe.
In the case where the AI is programmed to "do whatever your programmers tell you will make humans happy," the AI again thinks its values are infallible: it should do what its programmers tell it to do, so long as they claim it will make humans happy. It might be uncertain about what its programmers meant, and so it would be possible to convince this AI that it misunderstood their statements, and then it would change its behavior--but it won't be convinced by any arguments that it should listen to all of humanity, instead of its programmers.
But expressed this way, it's not clear to me where you think the inconsistency comes in. If the AI isn't programmed to have an 'external conscience' in its programmers or humanity as a whole, then their dissatisfaction doesn't matter. If it is programmed to use them as a conscience, but the way in which it does is exploitable, then that isn't very binding. Figuring out how to give it the right conscience / right values is the open problem that MIRI and others care about!
Which AI? As so often, an architecture dependent issue is being treated as a universal truth.
The other mostly aren't thinking in terms of "giving" ...hardcoding ....values. There is a valid critique to be made of that assumption.
This statement maps to "programs execute their code." I would be surprised if that were controversial.
This was covered by the comment about "meta-values" earlier, and "Y being a fuzzy object itself," which is probably not as clear as it could be. The goal management system grounds out somewhere, and that root algorithm is what I'm considering the "values" of the AI. If it can change its mind about what to value, the process it uses to change its mind is the actual fixed value. (If it can change its mind about how to change its mind, the fixedness goes up another level; if it can completely rewrite itself, now you have lost your ability to be confident in what it will do.)
Humans can fail to realise the implications of uncontroversial statements. Humans are failing to realise that goal stability is architecture dependent.
But you shouldn't be, at least in an un scare quoted sense of values. Goals and values aren't descriptive labels for de facto behaviour. The goal if a paperclipper is to make paperclips; if it crashes, as an inevitable result of executing its code, we don't say, " Aha! It had the goal to crash all along".
Goal stability doesn't mean following code, since unstable systems follow their code too....using the actual meaning of "goal".
Meta: trying to defend a claim by changing the meaning of its terms is doomed to failure.