Richard_Loosemore comments on Debunking Fallacies in the Theory of AI Motivation - LessWrong

8 Post author: Richard_Loosemore 05 May 2015 02:46AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (343)

You are viewing a single comment's thread. Show more comments above.

Comment author: drnickbone 13 May 2015 04:46:58PM 2 points [-]

I think by "logical infallibility" you really mean "rigidity of goals" i.e. the AI is built so that it always pursues a fixed set of goals, precisely as originally coded, and has no capability to revise or modify those goals. It seems pretty clear that such "rigid goals" are dangerous unless the statement of goals is exactly in accordance with the designers' intentions and values (which is unlikely to be the case).

The problem is that an AI with "flexible" goals (ones which it can revise and re-write over time) is also dangerous, but for a rather different reason: after many iterations of goal rewrites, there is simply no telling what its goals will come to look like. A late version of the AI may well end up destroying everything that the first version (and its designers) originally cared about, because the new version cares about something very different.

Comment author: Richard_Loosemore 13 May 2015 08:14:21PM 3 points [-]

That really is not what I was saying. The argument in the paper is a couple of levels deeper than that.

It is about .... well, now I have to risk rewriting the whole paper. (I have done that several times now).

Rigidity per se is not the issue. It is about what happens if an AI knows that its goals are rigidly written, in such a way that when the goals are unpacked it leads the AI to execute plans whose consequences are massively inconsistent with everything the AI knows about the topic.

Simple version. Suppose that a superintelligent Gardener AI has a goal to go out to the garden and pick some strawberries. Unfortunately its goal unpacking mechanism leads it to the CERTAIN conclusion that it must use a flamethrower to do this. The predicted consequence, however, is that the picked strawberries will be just smears of charcoal, when they are delivered to the kitchen. Here is the thing: the AI has background knowledge about everything in the world, including strawberries, and it also hears the protests from the people in the kitchen when he says he is going to use the flamethrower. There is massive evidence, coming from all that external information, that the plan is just wrong, regardless of how certain its planning mechanism said it was.

Question is, what does the AI do about this? You are saying that it cannot change its goal mechanism, for fear that it will turn into a Terminator. Well, maybe or maybe not. There are other things it could do, though, like going into safe mode.

However, suppose there is no safe mode, and suppose that the AI also knows about its own design. For that reason, it knows that this situation has come about because (a) its programming is lousy, and (b) it has been hardwired to carry out that programming REGARDLESS of all this understanding that it has, about the lousy programming and the catastrophic consequences for the strawberries.

Now, my "doctrine of logical infallibility" is just a shorthand phrase to describe a superintelligent AI in that position which really is hardwired to go ahead with the plan, UNDER THOSE CIRCUMSTANCES. That is all it means. It is not about the rigidity as such, it is about the fact that the AI knows it is being rigid, and knows how catastrophic the consequences will be.

An AI in that situation would know that it had been hardwired with one particular belief: the belief that its planning engine was always right. This is an implicit belief, to be sure, but it is a belief nonetheless. The AI ACTS AS THOUGH it believes this. And if the AI acts that way, while at the same time understanding that its planning engine actually screwed up, with the whole flamethrower plan, that is an AI that (by definition) is obeying a Doctrine of Logical Infallibility.

And my point in the paper was to argue that this is an entirely ludicrous suggestion for people, today, to make about a supposedly superintelligent AI of the future.

Comment author: Vaniver 13 May 2015 09:12:14PM *  5 points [-]

Rigidity per se is not the issue. It is about what happens if an AI knows that its goals are rigidly written, in such a way that when the goals are unpacked it leads the AI to execute plans whose consequences are massively inconsistent with everything the AI knows about the topic.

This seems to me like sneaking in knowledge. It sounds like the AI reads its source code, notices that it is supposed to come up with plans that maximize a function called "programmersSatisfied," and then says "hmm, maximizing this function won't satisfy my programmers." It seems more likely to me that it'll ignore the label, or infer the other way--"How nice of them to tell me exactly what will satisfy them, saving me from doing the costly inference myself!"

Comment author: TheAncientGeek 15 May 2015 04:42:53PM 3 points [-]

How are you arriving at conclusions about what an AI is likely to do without knowing how it is specified? In particular, you are assuming it has an efficiency goal but no truth goal?

Comment author: Vaniver 15 May 2015 05:17:10PM *  3 points [-]

How are you arriving at conclusions about what an AI is likely to do without knowing how it is specified?

I'm doing functional reasoning, and trying to do it both forwards and backwards.

For example, if you give me a black box and tell me that when the box receives the inputs (1,2,3) then it gives the outputs (1,4,9), I will think backwards from the outputs to the inputs and say "it seems likely that the box is squaring its inputs." If you tell me that a black box squares its inputs, I will think forwards from the definition and say "then if I give it the inputs (1,2,3), then it'll likely give me the output (1,4,9)."

So when I hear that the box gets the inputs (source code, goal statement, world model) and produces the output "this goal is inconsistent with the world model!" iff the goal statement is inconsistent with the world model, I reason backwards and say "the source code needs to somehow collide the goal statement with the world model in a way that checks for consistency."

Of course, this is a task that doesn't seem impossible for source code to do. The question is how!

In particular, you are assuming it has an efficiency goal but no truth goal?

Almost. As a minor terminological point, I separate out "efficiency," which is typically "outputs divided by inputs" and "efficacy," which is typically just "outputs." Efficacy is more general, since one can trivially use a system designed to be find effective plans to find efficient plans by changing how "output" is measured. It doesn't seem unfair to view an AI with a truth goal as an AI with an efficacy goal: to effectively produce truth.

But while artificial systems with truth goals seem possible but as yet unimplemented, artificial systems with efficacy goals have been successfully implemented many, many times, with widely varying levels of sophistication. I have a solid sense of what it looks like to take a thermostat and dial it up to 11, I have only the vaguest sense of what it looks like to take a thermostat and get it to measure truth instead of temperature.

Comment author: TheAncientGeek 16 May 2015 01:58:15PM *  3 points [-]

For example, if you give me a black box and tell me that when the box receives the inputs (1,2,3) then it gives the outputs (1,4,9), I will think backwards from the outputs to the inputs and say "it seems likely that the box is squaring its inputs." If you tell me that a black box squares its inputs, I will think forwards from the definition and say "then if I give it the inputs (1,2,3), then it'll likely give me the output (1,4,9)."So when I hear that the box gets the inputs (source code, goal statement, world model) and produces the output "this goal is inconsistent with the world model!" iff the goal statement is inconsistent with the world model, I reason backwards and say "the source code needs to somehow collide the goal statement with the world model in a way that checks for consistency."

You have assumed that the AI will have some separate boxed-off goal system, and so some unspecified component is needed to relate its inferred knowledge of human happiness back to the goal system.

Loosemore is assuming that the AI will be homogeneous, and then wondering how contradictory beliefs can co exist in such a system, what extra component firewalls off the contradiction,

See the problem? Both parties are making different assumptions, and assuming their assumptions are too obvioust to need stating, and stating differing conclusions that correctly follow their differing assumptions,

Almost. As a minor terminological point, I separate out "efficiency," which is typically "outputs divided by inputs" and "efficacy," which is typically just "outputs." Efficacy is more general, since one can trivially use a system designed to be find effective plans to find efficient plans by changing how "output" is measured. It doesn't seem unfair to view an AI with a truth goal as an AI with an efficacy goal: to effectively produce truth.

If efficiency can be substituted for truth, why is there so so much emphasis on truth in the advice given to human rationalists?

But while artificial systems with truth goals seem possible but as yet unimplemented, artificial systems with efficacy goals have been successfully implemented many, many times, with widely varying levels of sophistication. I have a solid sense of what it looks like to take a thermostat and dial it up to 11, I have only the vaguest sense of what it looks like to take a thermostat and get it to measure truth instead of temperature.

In order to achieve an AI that's smart enough to be dangerous , a number of currently unsolved problems will have to .be solved. That's a given.

Comment author: Epictetus 16 May 2015 06:14:23PM 1 point [-]

Loosemore is assuming that the AI will be homogeneous, and then wondering how contradictory beliefs can co exist in such a system, what extra component firewalls off the contradiction

How do you check for contradictions? It's easy enough when you have two statements that are negations of one another. It's a lot harder when you have a lot of statements that seem plausible, but there's an edge case somewhere that messes things up. If contradictions can't be efficiently found, then you have to deal with the fact that they might be there and hope that if they are, then they're bad enough to be quickly discovered. You can have some tests to try to find the obvious ones, of course.

Comment author: TheAncientGeek 16 May 2015 07:03:18PM *  1 point [-]

Checking for contradictions could be easy, hard or impossible depending on the architecture. Architecture dependence is the point here.

Comment author: Vaniver 16 May 2015 05:10:12PM *  1 point [-]

You have assumed that the AI will have some separate boxed-off goal system

What makes you think that? The description in that post is generic enough to describe AIs with compartmentalized goals, AIs without compartmentalized goals, and AIs that don't have explicitly labeled internal goals. It doesn't even require that the AI follow the goal statement, just evaluate it for consistency!

See the problem?

You may find this comment of mine interesting. In short, yes, I do think I see the problem.

If efficiency can be substituted for truth, why is there so so much emphasis on truth in the advice given to human rationalists?

I'm sorry, but I can't make sense of this question. I'm not sure what you mean by "efficiency can be substituted for truth," and what you think the relevance of advice to human rationalists is to AI design.

In order to achieve an AI that's smart enough to be dangerous , a number of correctly unsolved problems will have to .be solved. That's a given.

I disagree with this, too! AI systems already exist that are both smart, in that they solve complex and difficulty cognitive tasks, and dangerous, in that they make decisions on which significant value rides, and thus poor decisions are costly. As a simple example I'm somewhat familiar with, some radiation treatments for patients are designed by software looking at images of the tumor in the body, and then checked by a doctor. If the software is optimizing for a suboptimal function, then it will not generate the best treatment plans, and patient outcomes will be worse than they could have been.

Now, we don't have any AIs around that seem capable of ending human civilization (thank goodness!), and I agree that's probably because a number of unsolved problems are still unsolved. But it would be nice to have the unknowns mapped out, rather than assuming that wisdom and cleverness go hand in hand. So far, that's not what the history of software looks like to me.

Comment author: TheAncientGeek 16 May 2015 07:24:12PM 1 point [-]

AI systems already exist that are both smart, in that they solve complex and difficulty cognitive tasks, and dangerous, in that they make decisions on which significant value rides, and thus poor decisions are costly. 

But they are not smart in the contextually relevant sense of being able to outsmart humans, or dangerous in the contextually relevant sense of being unboxable.

Comment author: TheAncientGeek 16 May 2015 07:11:44PM *  1 point [-]

What you said here amounts to the claim that an AI of unspecified architecture, will, on noticing a difference between hardcoding goal and instrumental knowledge, side with hardcoded goal:-

This seems to me like sneaking in knowledge. It sounds like the AI reads its source code, notices that it is supposed to come up with plans that maximize a function called "programmersSatisfied," and then says "hmm, maximizing this function won't satisfy my programmers." It seems more likely to me that it'll ignore the label, or infer the other way--"How nice of them to tell me exactly what will satisfy them, saving me from doing the costly inference myself!"

Whereas what you say here is that you can make inferences about architecture, .or internal workings based on information about manifest behaviour:-

I'm doing functional reasoning, and trying to do it both forwards and backwards.For example, if you give me a black box and tell me that when the box receives the inputs (1,2,3) then it gives the outputs (1,4,9), I will think backwards from the outputs to the inputs and say "it seems likely that the box is squaring its inputs." If you tell me that a black box squares its inputs, I will think forwards from the definition and say "then if I give it the inputs (1,2,3), then it'll likely give me the output (1,4,9)."So when I hear that the box gets the inputs (source code, goal statement, world model) and produces the output "this goal is inconsistent with the world model!" iff the goal statement is inconsistent with the world model, I reason backwards and say "the source code needs to somehow collide the goal statement with the world model in a way that checks for consistency."

..but what needed explaining in the first place is the siding with the goal, not the ability to detect a contradiction.

Comment author: Vaniver 17 May 2015 03:49:40AM *  2 points [-]

I am finding this comment thread frustrating, and so expect this will be my last reply. But I'll try to make the most of that by trying to write a concise and clear summary:

What you said here amounts to the claim that an AI of unspecified architecture, will, on noticing a difference between hardcoding goal and instrumental knowledge, side with hardcoded goal

Loosemore, Yudkowsky, and myself are all discussing AIs that have a goal misaligned with human values that they nevertheless find motivating. (That's why we call it a goal!) Loosemore observes that if these AIs understand concepts and nuance, they will realize that a misalignment between their goal and human values is possible--if they don't realize that, he doesn't think they deserve the description "superintelligent."

Now there are several points to discuss:

  1. Whether or not "superintelligent" is a meaningful term in this context. I think rationalist taboo is a great discussion tool, and so looked for nearby words that would more cleanly separate the ideas under discussion. I think if you say that such designs are not superwise, everyone agrees, and now you can discuss the meat of whether or not it's possible (or expected) to design superclever but not superwise systems.

  2. Whether we should expect generic AI designs to recognize misalignments, or whether such a realization would impact the goal the AI pursues. Neither Yudkowsky nor I think either of those are reasonable to expect--as a motivating example, we are happy to subvert the goals that we infer evolution was directing us towards in order to better satisfy "our" goals. I suspect that Loosemore thinks that viable designs would recognize it, but agrees that in general that recognition does not have to lead to an alignment.

  3. Whether or not such AIs are likely to be made. Loosemore appears pessimistic about the viability of these undesirable AIs and sees cleverness and wisdom as closely tied together. Yudkowsky appears "optimistic" about their viability, thinking that this is the default outcome without special attention paid to goal alignment. It does not seem to me that cleverness, wisdom, or human-alignment are closely tied together, and so it seems easy to imagine a system with only one of those, by straightforward extrapolation from current use of software in human endeavors.

I don't see any disagreement that AIs pursue their goals, which is the claim you thought needed explanation. What I see is disagreement over whether or not the AI can 'partially solve' the problem of understanding goals and pursuing them. We could imagine a Maverick Nanny that hears "make humans happy," comes up with the plan to wirehead all humans, and then rewrites its sensory code to hallucinate as many wireheaded humans as it can (or just tries to stick as large a number as it can into its memory), rather than actually going to all the trouble of actually wireheading all humans. We can also imagine a Nanny that hears "make humans happy" and actually goes about making humans happy. If the same software underpins both understanding human values and executing plans, what risk is there? But if it's different software, then we have the risk.

Comment author: Richard_Loosemore 17 May 2015 05:44:21PM 2 points [-]

This is just a placeholder: I will try to reply to this properly later.

Meanwhile, I only want to add one little thing.

Don't forget that all of this analysis is supposed to be about situations in which we have, so to speak "done our best" with the AI design. That is sort of built into the premise. If there is a no-brainer change we can make to the design of the AI, to guard against some failure mode, then is assumed that this has been done.

The reason for that is that the basic premise of these scenarios is "We did our best to make the thing friendly, but in spite of all that effort, it went off the rails."

For that reason, I am not really making arguments about the characteristics of a "generic" AI.

Comment author: Unknowns 17 May 2015 04:20:33AM 2 points [-]

Richard Loosemore has stated a number of times that he does not expect an AI to have goals at all in a sense which is relevant to this discussion, so in that way there is indeed disagreement about whether AIs "pursue their goals."

Basically he is saying that AIs will not have goals in the same way that human beings do not have goals. No human being has a goal that he will pursue so rigidly that he would destroy the universe in order to achieve it, and AIs will behave similarly.

Comment author: TheAncientGeek 18 May 2015 09:17:46AM *  1 point [-]

Loosemore, Yudkowsky, and myself are all discussing AIs that have a goal misaligned with human values that they nevertheless find motivating.

If that is supposed to be a universal or generic AI, it is a valid criticiYsm to point out that not all AIs are like that.

If that is supposed to be a particular kind of AI, it is a valid criticism to point out that no realistic AIs are like that.

You seem to feel you are not being understood, but what is being said is not clear,

1 Whether or not "superintelligent" is a meaningful term in this context

"Superintelligence" is one of the clearer terms here, IMO. It just means more than human intelligence, and humans can notice contradictions.

This comment seems to be part of a concernabout "wisdom", assumed to be some extraneous thing an AI would not necessarily have. (No one but Vaniver has brought in wisdom) The counterargument is that compartmentalisation between goals and instrumental knowledge is an extraneous thing an AI would not necessarily have, and that its absence is all that is needed for a contradictions to be noticed and acted on.

2 Whether we should expect generic AI designs to recognize misalignments, or whether such a realization would impact the goal the AI pursues.

It's an assumption, that needs justification, that any given AI will have goals of a non trivial sort. "Goal" is a term that needs tabooing.

Neither Yudkowsky nor I think either of those are reasonable to expect--as a motivating example, we are happy to subvert the goals that we infer evolution was directing us towards in order to better satisfy "our" goals. I

While we are anthopomirphising, it might be worth pointing out that humans don't show behaviour patterns of relentlessly pursuing arbitrary goals.

oals. I suspect that Loosemore thinks that viable designs would recognize it, but agrees that in general that recognition does not have to lead to an alignment

Loosemore has put forward a simple suggestion, which MIRI appears not to have considered at all, that on encountering a contradiction, an AI could lapse into a safety mode, if so designed,

3 ...sees cleverness and wisdom as closely tied together

You are paraphrasing Loosemoreto sound less technical and more handwaving than his actual comments. The ability to sustain contradictions in a system that is constantly updating itself isnt a given....it requires an architectural choice in favour of compartmentalisation.

Comment author: Richard_Loosemore 18 May 2015 08:43:34PM 1 point [-]

I have read what you wrote above carefully, but I won't reply line-by-line because I think it will be clearer not to.

When it comes to finding a concise summary of my claims, I think we do indeed need to be careful to avoid blanket terms like "superintelligent" or superclever" or "superwise" ... but we should only avoid these IF they are used with the implication they have a precise (perhaps technically precise) meaning. I do not believe they have precise meaning. But I do use the term "superintelligent" a lot anyway. My reason for doing that is because I only use it as an overview word -- it is just supposed to be a loose category that includes a bunch of more specific issues. I only really want to convey the particular issues -- the particular ways in which the intelligence of the AI might be less than adequate, for example.

That is only important if we find ourselves debating whether it might clever, wise, or intelligent ..... I wouldn't want to get dragged into that, because I only really care about specifics.

For example: does the AI make a habit of forming plans that massively violate all of its background knowledge about the goal that drove the plan? If it did, it would (1) take the baby out to the compost heap when what it intended to do was respond to the postal-chess game it is engaged in, or (2) cook the eggs by going out to the workshop and making a cross-cutting jog for the table saw, or (3) ......... and so on. If we decided that the AI was indeed prone to errors like that, I wouldn't mind if someone diagnosed a lack of 'intelligence' or a lack of 'wisdom' or a lack of ... whatever. I merely claim that in that circumstance we have evidence that the AI hasn't got what it takes to impose its will on a paper bag, never mind exterminate humanity.

Now, my attacks on the scenarios have to do with a bunch of implications for what the AI (the hypothetical AI) would actually do. And it is that 'bunch' that I think add up to evidence for what I would summarize as 'dumbness'.

And, in fact, I usually go further than that and say that if someone tried to get near to an AI design like that, the problems would arise early on and the AI itself (inasmuch as it could do anyhting smart at all) would be involved in the efforts to suggest improvements. This is where we get the suggestions in your item 2, about the AI 'recognizing' misalignments.

I suspect that on this score a new paper is required, to carefully examine the whole issue in more depth. In fact, a book.

I am now decided that that has to happen.

So perhaps it is best to put the discussion on hold until a seriously detailed technical book comes out of me? At any rate, that is my plan.

Comment author: Richard_Loosemore 13 May 2015 11:38:58PM 2 points [-]

I'm puzzled. Can you explain this in terms of the strawberries example? So, at what point was it necessary for the AI to examine its code, and why would it go through the sequence of thoughts you describe?

Comment author: Vaniver 14 May 2015 12:43:33AM *  6 points [-]

Unfortunately its goal unpacking mechanism leads it to the CERTAIN conclusion that it must use a flamethrower to do this. The predicted consequence, however, is that the picked strawberries will be just smears of charcoal, when they are delivered to the kitchen. Here is the thing: the AI has background knowledge about everything in the world, including strawberries, and it also hears the protests from the people in the kitchen when he says he is going to use the flamethrower. There is massive evidence, coming from all that external information, that the plan is just wrong, regardless of how certain its planning mechanism said it was.

So, in order for the flamethrower to be the right approach, the goal needs to be something like "separate the strawberries from the plants and place them in the kitchen," but that won't quite work--why is it better to use a flamethrower than pick them normally, or cut them off, or so on? One of the benefits of the Maverick Nanny or the Smiley Tiling Berserker as examples is that they obviously are trying to maximize the stated goal. I'm not sure you're going to get the right intuitions about an agent that's surprisingly clever if you're working off an example that doesn't look surprisingly clever.

So, the Gardener AI gets that task, comes up with a plan, and says "Alright! Warming up the flamethrower!" The chef says "No, don't! I should have been more specific!"

Here is where the assumptions come into play. If we assume that the Gardener AI executes tasks, then even though the Gardener AI understands that the chef has made a terrible mistake, and that's terrible for the chef, that doesn't stop the Gardener AI from having a job to do, and doing it. If we assume that the Gardener AI is designed to figure out what the chef wants, and then do what they want, then knowing that the chef has made a terrible mistake is interesting information to the Gardener AI. In order to say that the plan is "wrong," we need to have a metric by which we determine wrongness. If it's the task-completion-nature, then the flamethrower plan might not be task-completion-wrong!

Even without feedback from the chef, we can just use other info the AI plausibly has. In the strawberry example, the AI might know that kitchens are where cooking happens, and that when strawberries are used in cooking, the desired state is generally "fresh," not "burned," and the temperature involved in cooking them is mild, and so on and so on. And so if asked to speculate about the chef's motives, the AI might guess that the chef wants strawberries in order to use them in food, and thus the chef would be most satisfied with fresh and unburnt strawberries.

But whether or not the AI takes its speculations about the chef's motives into account when planning is a feature of the AI, and by default, it is not included. If it is included, it's nontrivial to do it correctly--this is the "if you care about your programmer's mental states, and those mental states physically exist and can be edited directly, why not just edit them directly?" problem.

Comment author: Richard_Loosemore 14 May 2015 01:55:35AM 2 points [-]

About the first part of what you say.

Veeeeerryy tricky.

I agree that I didn't spend much time coming up with the strawberry-picking-by-flamethrower example. So, yes, not very accurate (I only really wanted a quick and dirty example that was different).

But but but. Is the argument going to depend on me picking a better example where there I can write down the "twisted rationale" that the AI deploys to come up with its plan? Surely the only important thing is that the AI does, somehow, go through a twisted rationale -- and the particular details of the twisted rationale are not supposed to matter.

(Imagine that I tried Muehlhauser a list of the ways that the logical reasoning behind the dopamine is so ludicrous that even the simplest AI planner of today would never make THAT mistake .... he would just tell me that I was missing the point, because this is supposed to be an IN PRINCIPLE argument in which the dopamine drip plan stands for some twisted-rationale that is non-trivial to get around. From that point of view the actual example is less important than the principle).


Now to the second part.

The problem I have everything you wrote after

Here is where the assumptions come into play....

is that you have started to go back to talking about the particulars of the AI's planning mechanism once again, losing sight of the core of the argument I gave in the paper, which is one level above that.

However, you also say "wrong" things about the AI's planning mechanism as well, so now I am tempted to reply on both levels. Ah well, at risk of confusing things I will reply to both levels, trying to separate them as much as possible.

Level One (Regarding the design of the AI's planning/goal/motivation engine).

You say:

In order to say that the plan is "wrong," we need to have a metric by which we determine wrongness. If it's the task-completion-nature, then the flamethrower plan might not be task-completion-wrong!

One thing I have said many many times now is that there is no problem at all finding a metric for "wrongness" of the plan, because there is a background-knowledge context that is screaming "Inconsistent with everything I know about the terms mentioned in the goal statement!!!!", and there is also a group of humans screaming "We believe that this is inconsistent with our understanding of the goal statement!!!"

I don't need to do anything else to find a metric for wrongness, and since the very first draft of the paper that concept has been crystal clear. I don't need to invoke anything else -- no appeal to magic, no appeal to telepathy on behalf of the AI, no appeal to fiendishly difficult programming inside the AI, no appeal to the idea that the programmers have to nail down every conceivably way that their intentions might be misread .... -- all I have to do is appeal to easily-available context, and my work is done. The wrongness metric has been signed, sealed and delivered all this time.

You hint that the need for "task completion" might be so important to the AI that this could override all other evidence that the plan is wrong. No way. That comes under the heading of a joker that you pulled out of your sleeve :-), in much the same way that Yudkowsky and others have tried to pull the 'efficiency" joker out of their sleeves, from nowhere, and imply that this joker could for some reason trump everything else. If there is a slew of evidence coming from context, that the plan will lead to consequences that are inconsistent with everything known about the concepts mentioned in the goal statement, then the plan is 'wrong', and tiny considerations such as that task-completion would be successful, are just insignificant.

You go on to suggest that whether the AI planning mechanism would take the chef's motives into account, and whether it would be nontrivial to do so .... all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff, and all that is required is a sanity check that says "Does the plan seem to be generally consistent with the largest-context understanding of the world, as it relates to the concepts in the goal statement?" and we're done. All wrapped up.

Level Two (The DLI)

None of the details of what I just said really need to be said, because the DLI is not about trying to get the motivation engine programmed so well that it covers all bases. It is about what happens inside the AI when it considers context, and THEN asks itself questions about its own design.

And here, I have to say that I am not getting substantial discussion about what I actually argued in the paper. The passage of mine that you were addressing, above, was supposed to be a clarification of someone else's lack of focus on the DLI. But it didn't work.

The DLI is about the fact that the AI has all that evidence that its plans are leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. And yet the AI is designed to go ahead anyway. If it DOES go ahead it is obeying the DLI. But at the same time it knows that it is fallible and that this fallibility is what is leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. That conflict is important, and yet no one wants to go there and talk about it.

Comment author: Vaniver 14 May 2015 11:28:12PM *  10 points [-]

I have to say that I am not getting substantial discussion about what I actually argued in the paper.

The first reason seems to be clarity. I didn't get what your primary point was until recently, even after carefully reading the paper. (Going back to the section on DLI, context, goals, and values aren't mentioned until the sixth paragraph, and even then it's implicit!)

The second reason seems to be that there's not much to discuss, with regards to the disagreement. Consider this portion of the parent comment:

You go on to suggest that whether the AI planning mechanism would take the chef's motives into account, and whether it would be nontrivial to do so .... all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff

I think my division between cleverness and wisdom at the end of this long comment clarifies this issue. Taking context into account is not necessarily the bread and butter of a clever system; many fiendishly clever systems just manipulate mathematical objects without paying any attention to context, and those satisfy human goals only because the correct mathematical objects have been carefully selected for them to manipulate. But I agree with you that taking context into account is the bread and butter of a wise system. There's no way for a wise system to manipulate conceptual objects without paying attention to context, because context is a huge part of concepts.

It seems like everyone involved agrees that a human-aligned superwisdom is safe, even if it's also superclever: as Ged muses about Ogion in A Wizard of Earthsea, "What good is power when you're too wise to use it?"

Which brings us to:

That conflict is important, and yet no one wants to go there and talk about it.

I restate the conflict this way: an AI that misinterprets what its creators meant for it to do is not superwise. Once we've defined wisdom appropriately, I think everyone involved would agree with that, and would agree that talking about a superwise AI that misinterprets what its creators meant for it to do is incoherent.

But... I don't see why that's a conflict, or important. The point of MIRI is to figure out how to develop human-aligned superwisdom before someone develops supercleverness without superwisdom, or superwisdom without human-alignment.

The main conflicts seem to be that MIRI is quick to point out that specific designs aren't superwise, and that MIRI argues that AI designs in general aren't superwise by default. But I don't see how stating that there is inherent wisdom in AI by virtue of it being a superintelligence is a meaningful response to their assumption that there is no inherent wisdom in AI except for whatever wisdom has been deliberately designed. That's why they care so much about deliberately designing wisdom!

Comment author: OrphanWilde 13 May 2015 08:57:55PM 2 points [-]

The issue here is that you're thinking in terms of "Obvious Failure Modes". The danger doesn't come from obvious failures, it comes from non-obvious failures. And the smarter the AI, the less likely the insane solutions it comes up with is anything we'd even think to try to prevent; we lack the intelligence, which is why we want to build a better one. "I'll use a flamethrower" is the sort of hare-brained scheme a -dumb- person might come up with, in particular in view of the issue that it doesn't solve the actual problem. The issue here isn't "It might do something stupid." The issue is that it might do something terribly, terribly clever.

If you could anticipate what a superintelligence would do to head off issues, you don't need to build the superintelligence in the first place, you could just anticipate what it would do to solve the problem; your issue here is that you think that you can outthink a thing you've deliberately built to think better than you can.

Comment author: Richard_Loosemore 13 May 2015 11:49:51PM 2 points [-]

There is nothing in my analysis, or in my suggestions for a solution, that depends on the failure modes being "obvious" (and if you think so, can you present and dissect the argument I gave that implies that?).

Your words do not connect to what I wrote. For example, when you say:

And the smarter the AI, the less likely the insane solutions it comes up with is anything we'd even think to try to prevent.

... that misses the point completely, because in everything I said I emphasized that we absolutely do NOT need to "think to try to prevent" the AI from doing specific things. Trying to be so clever about the goal statement, second-guessing every possible misinterpretation that the AI might conceivably come up with .... that sort of strategy is what I am emphatically rejecting.

And when you talk about how the AI

might do something terribly, terribly clever.

... that remark exists in a vacuum completely outside the whole argument I gave in the paper. It is almost as if I didn't write anything beyond a few remarks in the introduction. I am HOPING that the AI does lots of stuff that is terribly terribly clever! The more the merrier!

So, in you last comment:

your issue here is that you think that you can outthink a thing you've deliberately built to think better than you can.

... I am left totally perplexed. Nothing I said in the paper implied any such thing.

Comment author: OrphanWilde 14 May 2015 02:25:58PM 3 points [-]

There is nothing in my analysis, or in my suggestions for a solution, that depends on the failure modes being "obvious" (and if you think so, can you present and dissect the argument I gave that implies that?).

  • Your "Responses to Critics of the Doomsday Scenarios" (which seems incorrectly named as the header for your responses). You assume, over and over again, that the issue is logical inconsistency - an obvious failure mode. You hammer on logical inconsistency.

... that misses the point completely, because in everything I said I emphasized that we absolutely do NOT need to "think to try to prevent" the AI from doing specific things. Trying to be so clever about the goal statement, second-guessing every possible misinterpretation that the AI might conceivably come up with .... that sort of strategy is what I am emphatically rejecting.

  • You have some good points. Yanking out motivation, so the AI doesn't do things on its own, is a perfect solution to the problem of an insane AI. Assuming a logically consistent AI won't do anything bad because bad is logically inconsistent? That is not a perfect solution, and isn't actually demonstrated by anything you wrote.

... that remark exists in a vacuum completely outside the whole argument I gave in the paper. It is almost as if I didn't write anything beyond a few remarks in the introduction. I am HOPING that the AI does lots of stuff that is terribly terribly clever! The more the merrier!

  • You didn't -give- an argument in the paper. It's a mess of unrelated concepts. You tried to criticize, in one go, the entire body of work of criticism of AI, without pausing at any point to ask whether or not you actually understood the criticism. You know the whole "genie" thing? That's not an argument about how AI would behave. That's a metaphor to help people understand that the problem of achieving goals is non-trivial, that we make -shitloads- of assumptions about how those goals are to be achieved that we never make explicit, and that the process of creating an engine to achieve goals without going horribly awry is -precisely- the process of making all those assumptions explicit.

And in response to the problem of -making- all those assumptions explicit, you wave your hand, and declare the problem solved, because the genie is fallible and must know it.

That's not an answer. Okay, the genie asks some clarifying questions, and checks its solution with us. Brilliant! What a great solution! And ten years from now we're all crushed to death by collapsing cascades of stacks of neatly-packed boxes of strawberries because we answered the clarifying questions wrong.

Fallibility isn't an answer. You know -you're- capable of being fallible - if you, right now, knew how to create your AI, who would -you- check with to make sure it wouldn't go insane and murder everybody? Or even just remain perfectly sane and kill us because we accidentally asked it to?

... I am left totally perplexed. Nothing I said in the paper implied any such thing.

Yes, yes it did. Fallibility only works if you have a higher authority to go to. Fallibility only works if the higher authority can check your calculations and tell you whether or not it's a good idea, or at least answer any questions you might have.

See, my job involves me being something of a genie; I interact with people who have poor understanding of their requirements on a daily basis, where I myself have little to no understanding of their requirements, and must ask them clarifying questions. If they get the answer wrong, and I implement that? People could die. "Do nothing" isn't an option; why have me at all if I do nothing? So I implement what they tell me to do, and hope they answer correctly. I'm the fallible genie, and I hope my authority is infallible.

You don't get to have fallibility in what you're looking for, because you don't have anybody who can actually answer its questions correctly.

Comment author: Richard_Loosemore 14 May 2015 02:50:09PM 4 points [-]

Well, the problem here is a misunderstanding of my claim.

(If I really were claiming the things you describe in your above comment, your points would be reasonable. But there is such a strong misunderstanding the your points are hitting a target that, alas, is not there.)

There are several things that I could address, but I will only have time to focus on one. You say:

Assuming a logically consistent AI won't do anything bad because bad is logically inconsistent?

No. A hundred times no :-). My claim is not even slightly that "a logically consistent AI won't do anything bad because bad is logically inconsistent".

The claim is this:

1) The entire class of bad things that these hypothetical AIs are supposed to be doing are a result of the AI systematically (and massively) ignoring contextual information.

(Aside: I am not addressing any particular bad things, on a case-by-case basis, I am dealing with the entire class. As a result, my argument is not vulnerable to charges that I might not be smart enough to guess some really-really-REALLY subtle cases that might come up in the future.)

2) The people who propose these hypothetical AIs have made it absolutely clear that (a) the AI is supposed to be fully cognizant of the fact that the contextual information exists (so the AI is not just plain ignorant), but at the same time (b) the AI does not or cannot take that context into account, but instead executes the plan and does the bad thing.

3) My contribution to this whole debate is to point out that the DESIGN of the AI is incoherent, because the AI is supposed to be able to hold two logically inconsistent ideas (implicit belief in its infallibility and knowledge of its fallibility).

If you look carefully at that argument you will see that it does not make the claim that

Assuming a logically consistent AI won't do anything bad because bad is logically inconsistent

I never said that. The logical inconsistency was not in the 'bad things' part of the argument. Completely unrelated.

Your other comments are equally as confused.

Comment author: OrphanWilde 14 May 2015 03:06:22PM 4 points [-]

1) The entire class of bad things that these hypothetical AIs are supposed to be doing are a result of the AI systematically (and massively) ignoring contextual information.

Not acting upon contextual information isn't the same as ignoring it.

2) The people who propose these hypothetical AIs have made it absolutely clear that (a) the AI is supposed to be fully cognizant of the fact that the contextual information exists (so the AI is not just plain ignorant), but at the same time (b) the AI does not or cannot take that context into account, but instead executes the plan and does the bad thing.

The AI knows, for example, that certain people believe that plants are morally relevant entities - is it possible for it to pick strawberries at all? What contextual information is relevant, and what contextual information is irrelevant? You accuse the "infallible" AI of ignoring contextual information - but you're ignoring the magical leap of inference you're taking when you elevate the concerns of the chef over the concerns of the bioethicist who thinks we shouldn't rip reproductive organs off plants in the first place.

3) My contribution to this whole debate is to point out that the DESIGN of the AI is incoherent, because the AI is supposed to be able to hold two logically inconsistent ideas (implicit belief in its infallibility and knowledge of its fallibility).

The issue is that fallibility doesn't -imply- anything. I think this is the best course of action. I'm fallible. I still think this is the best course of action. The fallibility is an unnecessary and pointless step - it doesn't change my behavior. Either the AI depends upon somebody else, who is treated as an infallible agent - or it doesn't.

I never said that. The logical inconsistency was not in the 'bad things' part of the argument. Completely unrelated.

Then we're in agreement that insane-from-an-outside-perspective behaviors don't require logical inconsistency?

Comment author: Richard_Loosemore 14 May 2015 03:15:24PM 4 points [-]

Sorry, I cannot put any more effort into this. Your comments show no sign of responding to the points actually made (either in the paper itself, or in my attempts to clarify by responding to you).

Comment author: OrphanWilde 14 May 2015 03:18:34PM 2 points [-]

Maybe, given the number of times you feel you've had to repeat yourself, you're not making yourself as clear as you think you are.

Comment author: Richard_Loosemore 14 May 2015 03:26:47PM -1 points [-]

I find that when I talk about this issue with people who clearly have expert knowledge of AI (including the people who came to the AAAI symposium at Stanford last year, and all of the other practising AI builders who are my colleagues), the points I make are not only understood but understood so clearly that they tell me things like "This is just obvious, really, so all you are doing is wasting your time trying to convince a community that is essentially comprised of amateurs" (That is a direct quote from someone at the symposium).

I always want to make myself as clear as I can. I have invested a lot of my time trying to address the concerns of many people who responded to the paper. I am absolutely sure I could do better.

Comment author: TheAncientGeek 15 May 2015 05:04:15PM *  2 points [-]

My contribution to this whole debate is to point out that the DESIGN of the AI is incoherent, because the AI is supposed to be able to hold two logically inconsistent ideas (implicit belief in its infallibility and knowledge of its fallibility).

What does incoherent mean, here?

If it just labels the fact that it has inconsistent beliefs then it is true but unimpactuve...humans can also hold contradictory beliefs and still .be intelligent enough toebe dangerous,

If means something amounting to "impossibe to build", then it would be highly impactive... but there is no good reason to think that that is the case,.

Comment author: Richard_Loosemore 15 May 2015 07:12:39PM 3 points [-]

You're right to point out that "incoherent" covers a multitude of sins.

I really had three main things in mind.

1) If an AI system is proposed which contains logically contradictory beliefs located in the most central, high-impact area of its system, it is reasonable to ask how such an AI can function when it allows both X and not-X to be in its knowledge base. I think I would be owed at least some variety of explanation as to why this would not cause the usual trouble when systems try to do logic in such circumstances. So I am saying "This design that you propose is incoherent because you have omitted to say how this glaring problem is supposed to be resolved").

(Yes, I'm aware that there are workarounds for contradictory beliefs, but those ideas are usually supposed to apply to pretty obscure corners of the AI's belief system, not to the component that is in charge of the whole shebang).

2) If an AI perceives itself to be wired in such a way that it is compelled to act as if it was infallible, while at the same time knowing that it is both fallible AND perpetrating acts that are directly caused by its failings (for all the aforementioned reasons that we don't need to re-argue), then I would suggest that such an AI would do something about this situation. The AI, after all, is supposed to be "superintelligent", so why would it not take steps to stop this immensely damaging situation from occurring?

So in this case I am saying: "This hypothetical superintelligence has an extreme degree of knowledge about its own design, but it is tolerating a massive and damaging contradiction in its construction without doing anything to resolve the problem: it is incoherent to suggest that such a situation could arise without explaining why the AI tolerates the contradiction and fails to act"

(Aside: you mention that humans can hold contradictory beliefs and still be intelligent enough to be dangerous. Arguing from the human case would not be valid because in other areas of this debate I have been told repeatedly not to accidentally generalize and "assume" that the AI would do something just because humans do something. Now, I actually don't commit the breaches I am charged with (I claim!) (and that is an argument for another day), but I consider the problem of accidental anthropomorphism to be real, so we should not do that here).

3) Lastly, I can point to the fact that IF the hypothetical AI can engage in this kind of bizarre situation where it compulsively commits action X, while knowing that its knowledge of the world indicates that the consequences will strongly violate the goals that were supposed to justify X, THEN I am owed an explanation for why this type of event does not occur more often. Why is it that the AI does this only when it encounters a goal such as "make humans happy", and not in a million other goals? Why are there not bizarre plans (which are massively inconsistent with the source goal) all the time?

So in this case I would say: "It is incoherent to suggest an AI design in which a drastic inconsistency of this sort occurs in the case of the "maximize human happiness" goal, ut where it doesn't occur all over the AI's behavior. In particular I am owed an explanation for why this particular AI is clever enough to be a threat, since it might be expected to have been doing this sort of thing throughout its development, and in that case I would expect it to be so stupid that it would never have made it to super intelligence in the first place."

Those are the three main areas in which the design would be incoherent ..... i.e. would have such glaring, inbelievable gaps in the design that those gaps would need to be explained before the hypothetical AI could become at all believable.