TheAncientGeek comments on Debunking Fallacies in the Theory of AI Motivation - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (343)
Furcas, you say:
When I talked to Omohundro at the AAAI workshop where this paper was delivered, he accepted without hesitation that the Doctrine of Logical Infallibility was indeed implicit in all the types of AI that he and the others were talking about.
Your statement above is nonsensical because the idea of a DLI was '''invented''' precisely in order to summarize, in a short phrase, a range of absolutely explicit and categorical statements made by Yudkowsky and others, about what the AI will do if it (a) decides to do action X, and (b) knows quite well that there is massive, converging evidence that action X is inconsistent with the goal statement Y that was supposed to justify X. Under those circumstances, the AI will ignore the massive converging evidence of inconsistency and instead it will enforce the 'literal' interpretation of goal statement Y.
The fact that the AI behaves in this way -- sticking to the literal interpretation of the goal statement, in spite of external evidence that the literal interpretation is inconsistent with everything else that is known about the connection between goal statement Y and action X, '''IS THE VERY DEFINITION OF THE DOCTRINE OF LOGICAL INFALLIBILITY'''
MIRI haven't said this is about infallibility. They have said many times and in many ways it is about goals or values...the....genie knows, but doesn't care. The continuing miscommunication is about what goals actually are. It seems obvious to one side that goals include fine grained information, eg
"Make humans happy, and here's a petabyte information on what that is"
The other side thinks its obvious that goals are coarse grained, in the sense of leaving the details open to further investigation (Senesh) or human input (Loosemore).
You are simply repeating the incoherent statements made by MIRI ("it is about goals or values...the....genie knows, but doesn't care") as if those incoherent statements constitute an answer to the paper.
The purpose of the paper is to examine those statements and show that they are incoherent.
It is therefore meaningless to just say "MIRI haven't said this is about infallibility" (the paper gives an abundance of evidence and detailed arguments to show that they have indeed said that ... put you have not addressed any of the evidence or arguments in the paper, you have just issued a denial, and the repeated the incoherence that was demolished by those arguments.
I am not you enemy, I am orthogonal to you,
I don't think MIRI's goal based answers work, and I wasn't repeating them with the intention that they should sound like they do. Perhaps I should have been stronger on the point.
I also don't think your infallibility based approach accurately reflects MIRI position, whatever it's merits. You say that you have proved something but I don't see that. It looks to me that you found MIRIs stated argument so utterly unconvincing that their real argument must be something else. But no: they really believe that an AI, however specified, will blindly folow it's goals however defined,, however stupid.
Okay, I understand that now.
Problem is, I had to dissect what you said (whether your intention was orthogonal or not) because either way it did contain a significant mischaracterization of the situation.
One thing that is difficult for me to address are statements along the lines of "the doctrine of logical infallibility is something that MIRI have never claimed or argued for...", followed by wordage that shows no clear understanding of what how the DLI was defined, and no careful analysis of my definition that demonstrates how and why it is the case that the explanation that I give, to support my claim, is mistaken. What I usually get is just a bare statement that amounts to "no they don't".
You and I are having a variant of one of those discussions, but you might to bear with me here, because I have had something like 10 others, all doing the same thing in slightly different ways.
Here's the rub. The way that the DLI is defined, it borders on self-evidently true. (How come? Because I defined it simply as a way to summarize a group of pretty-much uncontested observations about the situation. I only wanted to define it for the sake of brevity, really). The question, then, should not so much be about whether it is correct or not, but about why people are making that kind of claim.
Or, from the point of view of the opposition: why the claim is justified, and why the claim does not lead to the logical contradiction that I pointed to in the paper.
Those are worth discussing, certainly. And I am fallible, myself, so I must have made some mistakes, here or there. So with that in mind, I want someone to quote my words back to me, ask some questions for clarification, and see if they can zoom in on the places where my argument goes wrong.
And with all that said, you tell me that:
Can you reflect back what you think I tried to prove, so we can figure out why you don't see it?
ETA
I now see that what you have written subsequently to the OP is that DLI is almost, but not quite a description of rigid behaviour as a symptom (with the added ingredient that an AI can see the mistakenness of its behaviour):-
HOWEVER, that doesn't entirely gel with what you wrote in the OP;-
Emph added. Doing dumb things because you think are correct, DLI v1, just isnt the same as realising their dumbness, but being tragically compelled to do them anyway...DLI2. (And Infallibility is a much more appropriate label for the origin idea....the second is more like inevitability)
Now, you are trying to put your finger on a difference between two versions of the DLI that you think I have supplied.
You have paraphrased the two versions as:
and
I think you are seeing some valid issues here, having to do with how to characterize what exactly it is that this AI is supposed to be 'thinking' when it goes through this process.
I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant.
For example, you talked about "Doing dumb things because you think are correct" .... but what does it mean to say that you 'think' that they are correct? To me, as a human, that seems to entail being completely unaware of the evidence that they might not be correct ("Jill took the ice-cream from Jack because she didn't know that it was wrong to take someone else's ice-cream."). The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine ... while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan. There is no easy counterpart to that in humans (except for cognitive dissonance, and there we have a case where the human is capable of compartmentalizing its beliefs .... something that is not being suggested here, because we are not forced to make the AI do that). So, since the AI case does not map on to the human case, we are left in a peculiar situation where it is not at all clear that the AI really COULD do what is proposed, and still operate as a successful intelligence.
Or, more immediately, it is not at all clear that we can say about that AI "It did a dumb thing because it 'thought' it was correct."
I should add that in both of my quoted descriptions of the DLI that you gave, I see no substantial difference (beyond those imponderables I just mentioned) and that in both cases I was actually trying to say something very close to the second paraphrase that you gave, namely:
And, don't forget: I am not saying that such an AI is viable at all! Other people are suggesting some such AI, and I am arguing that the design is so logically incoherent that the AI (if it could be made to exist) would call attention to that problem and suggest means to correct it.
Anyhow, the takeway from this comment is: the people who talk about an AI that exhibits this kind of behavior are actually suggesting a behavior that they have not really thought through carefully, so as a result we can find ourselves walking into a minefield if we go and try to clean up the mess that they left.
If viable means it could be built, I think it could, given a string of assumptions. If viable means it would be built, by component and benign programmers, I am not so sure,
I actually meant "viable" in the sense of the third of my listed cases of incoherence at: http://lesswrong.com/lw/m5c/debunking_fallacies_in_the_theory_of_ai_motivation/cdap
In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.
Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of thing all the way through its development would never learn enough about the world to outsmart humanity.
(Which is NOT to say, as some have inferred, that I believe an AI is "dumb" just because it does things that conflict with my value system, etc. etc. It would be dumb because its goal system would be spewing out incoherent behaviors all the time, and that is kinda the standard definition of "dumb").
MIRI distinguishes between terminal and instrumental goals, so there are two answers to the question
instrumental goals of any kind almost certainly would be revised if they became noticeably out of correspondence to reality, because that would make then less effective at achieving terminal goals , and the raison d'etre of such transient sub-goals is is to support the achievement of terminal goals.
By MIRIs reasoning, a terminal goal could be any of a 1000 things other than human happiness , and the same conclusion would follow: an AI with a highest priority terminal goal wouldn't have any motivation to override it. To be motivated to rewrite a goal because it false implies a higher priority goal towards truth. It should not be surprising that an entity that doesn't value truth, in a certain sense, doesn't behave rationally, in a certain sense. (Actually, there is a bunch of supplementary assumptions involved, which I have dealt with elsewhere)
That's an account of the MIRI position, not a defence if it. It is essentially a model of rational decision making, and there is a gap between it and real world AI research, a gap which MIRI routinely ignores. The conclusion follows logically from the premises, but atoms aren't pushed around by logic,
That reinforces my point. I was saying that MIRI is basically making armchair assumptions about the AI architectures. You are saying these assumptions aren't merely unjustified, they go against what a competent AI builder would do.
They are clear that they don't mean AIs rigid behaviour is the result of it assessing its own inferrential processes as infallible ... that is what the controversy is all about..
That is just what The Genie Knows but doesn't Care is supposed to answer. I think it succeeds in showing that a fairly specific architecture would behave that way, but fails in it's intended goal of showing that this behaviour is universal or likely.
Ummm...
The referents in that sentence are a little difficult to navigate, but no, I'm pretty sure I am not making that claim. :-) In other words, MIRI do not think that.
What is self-evidently true is that MIRI claim a certain kind of behavior by the AI, under certain circumstances .... and all I did was come along and put a label on that claim about the AI behavior. When you put a label on something, for convenience, the label is kinda self-evidently "correct".
I think that what you said here:
... is basically correct.
I had a friend once who suffered from schizophrenia. She was lucid, intelligent (studying for a Ph.D. in psychology) and charming. But if she did not take her medication she became a different person (one day she went up onto the suspension bridge that was the main traffic route out of town and threatened to throw herself to her death 300 feet below. She brought the whole town to a halt for several hours, until someone talked her down.) Now, talking to her in a good moment she could tell you that she knew about her behavior in the insane times - she was completely aware of that side of herself - and she knew that in that other state she would find certain thoughts completely compelling and convincing, even though at this calm moment she could tell you that those thoughts were false. If I say that during the insane period her mind was obeying a "Doctrine That Paranoid Beliefs Are Justified", then all I am doing is labeling that state that governed her during those times.
That label would just be a label, so if someone said "No, you're wrong: she does not subscribe to the DTPBAJ at all", I would be left nonplussed. All I wanted to do was label something that she told me she categorically DID believe, so how can my label be in some sense 'wrong'?
So, that is why some people's attacks on the DLI are a little baffling.
Their criticisms are possibly accurate about the first version., which gives a cause for the rigid behaviour "it regards its own conclusions as sacrosanct.*
I responded before you edited and added extra thoughts .... [processing...]