TheAncientGeek comments on Debunking Fallacies in the Theory of AI Motivation - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (343)
What makes you think that? The description in that post is generic enough to describe AIs with compartmentalized goals, AIs without compartmentalized goals, and AIs that don't have explicitly labeled internal goals. It doesn't even require that the AI follow the goal statement, just evaluate it for consistency!
You may find this comment of mine interesting. In short, yes, I do think I see the problem.
I'm sorry, but I can't make sense of this question. I'm not sure what you mean by "efficiency can be substituted for truth," and what you think the relevance of advice to human rationalists is to AI design.
I disagree with this, too! AI systems already exist that are both smart, in that they solve complex and difficulty cognitive tasks, and dangerous, in that they make decisions on which significant value rides, and thus poor decisions are costly. As a simple example I'm somewhat familiar with, some radiation treatments for patients are designed by software looking at images of the tumor in the body, and then checked by a doctor. If the software is optimizing for a suboptimal function, then it will not generate the best treatment plans, and patient outcomes will be worse than they could have been.
Now, we don't have any AIs around that seem capable of ending human civilization (thank goodness!), and I agree that's probably because a number of unsolved problems are still unsolved. But it would be nice to have the unknowns mapped out, rather than assuming that wisdom and cleverness go hand in hand. So far, that's not what the history of software looks like to me.
What you said here amounts to the claim that an AI of unspecified architecture, will, on noticing a difference between hardcoding goal and instrumental knowledge, side with hardcoded goal:-
Whereas what you say here is that you can make inferences about architecture, .or internal workings based on information about manifest behaviour:-
..but what needed explaining in the first place is the siding with the goal, not the ability to detect a contradiction.
I am finding this comment thread frustrating, and so expect this will be my last reply. But I'll try to make the most of that by trying to write a concise and clear summary:
Loosemore, Yudkowsky, and myself are all discussing AIs that have a goal misaligned with human values that they nevertheless find motivating. (That's why we call it a goal!) Loosemore observes that if these AIs understand concepts and nuance, they will realize that a misalignment between their goal and human values is possible--if they don't realize that, he doesn't think they deserve the description "superintelligent."
Now there are several points to discuss:
Whether or not "superintelligent" is a meaningful term in this context. I think rationalist taboo is a great discussion tool, and so looked for nearby words that would more cleanly separate the ideas under discussion. I think if you say that such designs are not superwise, everyone agrees, and now you can discuss the meat of whether or not it's possible (or expected) to design superclever but not superwise systems.
Whether we should expect generic AI designs to recognize misalignments, or whether such a realization would impact the goal the AI pursues. Neither Yudkowsky nor I think either of those are reasonable to expect--as a motivating example, we are happy to subvert the goals that we infer evolution was directing us towards in order to better satisfy "our" goals. I suspect that Loosemore thinks that viable designs would recognize it, but agrees that in general that recognition does not have to lead to an alignment.
Whether or not such AIs are likely to be made. Loosemore appears pessimistic about the viability of these undesirable AIs and sees cleverness and wisdom as closely tied together. Yudkowsky appears "optimistic" about their viability, thinking that this is the default outcome without special attention paid to goal alignment. It does not seem to me that cleverness, wisdom, or human-alignment are closely tied together, and so it seems easy to imagine a system with only one of those, by straightforward extrapolation from current use of software in human endeavors.
I don't see any disagreement that AIs pursue their goals, which is the claim you thought needed explanation. What I see is disagreement over whether or not the AI can 'partially solve' the problem of understanding goals and pursuing them. We could imagine a Maverick Nanny that hears "make humans happy," comes up with the plan to wirehead all humans, and then rewrites its sensory code to hallucinate as many wireheaded humans as it can (or just tries to stick as large a number as it can into its memory), rather than actually going to all the trouble of actually wireheading all humans. We can also imagine a Nanny that hears "make humans happy" and actually goes about making humans happy. If the same software underpins both understanding human values and executing plans, what risk is there? But if it's different software, then we have the risk.
If that is supposed to be a universal or generic AI, it is a valid criticiYsm to point out that not all AIs are like that.
If that is supposed to be a particular kind of AI, it is a valid criticism to point out that no realistic AIs are like that.
You seem to feel you are not being understood, but what is being said is not clear,
"Superintelligence" is one of the clearer terms here, IMO. It just means more than human intelligence, and humans can notice contradictions.
This comment seems to be part of a concernabout "wisdom", assumed to be some extraneous thing an AI would not necessarily have. (No one but Vaniver has brought in wisdom) The counterargument is that compartmentalisation between goals and instrumental knowledge is an extraneous thing an AI would not necessarily have, and that its absence is all that is needed for a contradictions to be noticed and acted on.
It's an assumption, that needs justification, that any given AI will have goals of a non trivial sort. "Goal" is a term that needs tabooing.
While we are anthopomirphising, it might be worth pointing out that humans don't show behaviour patterns of relentlessly pursuing arbitrary goals.
Loosemore has put forward a simple suggestion, which MIRI appears not to have considered at all, that on encountering a contradiction, an AI could lapse into a safety mode, if so designed,
You are paraphrasing Loosemoreto sound less technical and more handwaving than his actual comments. The ability to sustain contradictions in a system that is constantly updating itself isnt a given....it requires an architectural choice in favour of compartmentalisation.
All this talk of contradictions is sort of rubbing me the wrong way here. There's no "contradiction" in an AI having goals that are different to human goals. Logically, this situation is perfectly normal. Loosemore talks about an AI seeing its goals are "massively in contradiction to everything it knows about <BLAH>", but... where's the contradiction? What's logically wrong with getting strawberries off a plant by burning them?
I don't see the need for any kind of special compartmentalisation; information about "normal use of strawberries" is already inert facts with no caring attached by default.
If you're going to program in special criteria that would create caring about this information, okay, but how would such criteria work? How do you stop it from deciding that immortality is contradictory to "everything it knows about death" and refusing to help us solve aging?
In the original scenario, the contradiction us supposed to .be between a hardcoded definition of happiness in the AIs goal system, and inferred knowledge in the execution system.