TheAncientGeek comments on Debunking Fallacies in the Theory of AI Motivation - LessWrong

8 Post author: Richard_Loosemore 05 May 2015 02:46AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (343)

You are viewing a single comment's thread. Show more comments above.

Comment author: Vaniver 16 May 2015 05:10:12PM *  1 point [-]

You have assumed that the AI will have some separate boxed-off goal system

What makes you think that? The description in that post is generic enough to describe AIs with compartmentalized goals, AIs without compartmentalized goals, and AIs that don't have explicitly labeled internal goals. It doesn't even require that the AI follow the goal statement, just evaluate it for consistency!

See the problem?

You may find this comment of mine interesting. In short, yes, I do think I see the problem.

If efficiency can be substituted for truth, why is there so so much emphasis on truth in the advice given to human rationalists?

I'm sorry, but I can't make sense of this question. I'm not sure what you mean by "efficiency can be substituted for truth," and what you think the relevance of advice to human rationalists is to AI design.

In order to achieve an AI that's smart enough to be dangerous , a number of correctly unsolved problems will have to .be solved. That's a given.

I disagree with this, too! AI systems already exist that are both smart, in that they solve complex and difficulty cognitive tasks, and dangerous, in that they make decisions on which significant value rides, and thus poor decisions are costly. As a simple example I'm somewhat familiar with, some radiation treatments for patients are designed by software looking at images of the tumor in the body, and then checked by a doctor. If the software is optimizing for a suboptimal function, then it will not generate the best treatment plans, and patient outcomes will be worse than they could have been.

Now, we don't have any AIs around that seem capable of ending human civilization (thank goodness!), and I agree that's probably because a number of unsolved problems are still unsolved. But it would be nice to have the unknowns mapped out, rather than assuming that wisdom and cleverness go hand in hand. So far, that's not what the history of software looks like to me.

Comment author: TheAncientGeek 16 May 2015 07:11:44PM *  1 point [-]

What you said here amounts to the claim that an AI of unspecified architecture, will, on noticing a difference between hardcoding goal and instrumental knowledge, side with hardcoded goal:-

This seems to me like sneaking in knowledge. It sounds like the AI reads its source code, notices that it is supposed to come up with plans that maximize a function called "programmersSatisfied," and then says "hmm, maximizing this function won't satisfy my programmers." It seems more likely to me that it'll ignore the label, or infer the other way--"How nice of them to tell me exactly what will satisfy them, saving me from doing the costly inference myself!"

Whereas what you say here is that you can make inferences about architecture, .or internal workings based on information about manifest behaviour:-

I'm doing functional reasoning, and trying to do it both forwards and backwards.For example, if you give me a black box and tell me that when the box receives the inputs (1,2,3) then it gives the outputs (1,4,9), I will think backwards from the outputs to the inputs and say "it seems likely that the box is squaring its inputs." If you tell me that a black box squares its inputs, I will think forwards from the definition and say "then if I give it the inputs (1,2,3), then it'll likely give me the output (1,4,9)."So when I hear that the box gets the inputs (source code, goal statement, world model) and produces the output "this goal is inconsistent with the world model!" iff the goal statement is inconsistent with the world model, I reason backwards and say "the source code needs to somehow collide the goal statement with the world model in a way that checks for consistency."

..but what needed explaining in the first place is the siding with the goal, not the ability to detect a contradiction.

Comment author: Vaniver 17 May 2015 03:49:40AM *  2 points [-]

I am finding this comment thread frustrating, and so expect this will be my last reply. But I'll try to make the most of that by trying to write a concise and clear summary:

What you said here amounts to the claim that an AI of unspecified architecture, will, on noticing a difference between hardcoding goal and instrumental knowledge, side with hardcoded goal

Loosemore, Yudkowsky, and myself are all discussing AIs that have a goal misaligned with human values that they nevertheless find motivating. (That's why we call it a goal!) Loosemore observes that if these AIs understand concepts and nuance, they will realize that a misalignment between their goal and human values is possible--if they don't realize that, he doesn't think they deserve the description "superintelligent."

Now there are several points to discuss:

  1. Whether or not "superintelligent" is a meaningful term in this context. I think rationalist taboo is a great discussion tool, and so looked for nearby words that would more cleanly separate the ideas under discussion. I think if you say that such designs are not superwise, everyone agrees, and now you can discuss the meat of whether or not it's possible (or expected) to design superclever but not superwise systems.

  2. Whether we should expect generic AI designs to recognize misalignments, or whether such a realization would impact the goal the AI pursues. Neither Yudkowsky nor I think either of those are reasonable to expect--as a motivating example, we are happy to subvert the goals that we infer evolution was directing us towards in order to better satisfy "our" goals. I suspect that Loosemore thinks that viable designs would recognize it, but agrees that in general that recognition does not have to lead to an alignment.

  3. Whether or not such AIs are likely to be made. Loosemore appears pessimistic about the viability of these undesirable AIs and sees cleverness and wisdom as closely tied together. Yudkowsky appears "optimistic" about their viability, thinking that this is the default outcome without special attention paid to goal alignment. It does not seem to me that cleverness, wisdom, or human-alignment are closely tied together, and so it seems easy to imagine a system with only one of those, by straightforward extrapolation from current use of software in human endeavors.

I don't see any disagreement that AIs pursue their goals, which is the claim you thought needed explanation. What I see is disagreement over whether or not the AI can 'partially solve' the problem of understanding goals and pursuing them. We could imagine a Maverick Nanny that hears "make humans happy," comes up with the plan to wirehead all humans, and then rewrites its sensory code to hallucinate as many wireheaded humans as it can (or just tries to stick as large a number as it can into its memory), rather than actually going to all the trouble of actually wireheading all humans. We can also imagine a Nanny that hears "make humans happy" and actually goes about making humans happy. If the same software underpins both understanding human values and executing plans, what risk is there? But if it's different software, then we have the risk.

Comment author: Richard_Loosemore 17 May 2015 05:44:21PM 2 points [-]

This is just a placeholder: I will try to reply to this properly later.

Meanwhile, I only want to add one little thing.

Don't forget that all of this analysis is supposed to be about situations in which we have, so to speak "done our best" with the AI design. That is sort of built into the premise. If there is a no-brainer change we can make to the design of the AI, to guard against some failure mode, then is assumed that this has been done.

The reason for that is that the basic premise of these scenarios is "We did our best to make the thing friendly, but in spite of all that effort, it went off the rails."

For that reason, I am not really making arguments about the characteristics of a "generic" AI.

Comment author: Vaniver 17 May 2015 10:50:30PM *  0 points [-]

I will try to reply to this properly later.

Thanks, and take your time!

Don't forget that all of this analysis is supposed to be about situations in which we have, so to speak "done our best" with the AI design. That is sort of built into the premise. If there is a no-brainer change we can make to the design of the AI, to guard against some failure mode, then is assumed that this has been done.

I feel like this could be an endless source of confusion and disagreement; if we're trying to discuss what makes airplanes fly or crash, should we assume that engineers have done their best and made every no-brainer change? I'd rather we look for the underlying principles, we codify best practices, we come up with lists and tests.

Comment author: TheAncientGeek 18 May 2015 10:48:05AM 0 points [-]

If we're trying to discuss what makes airplanes fly or crash, should we assume that engineers have done their best and made every no-brainer change?

If you are in the business of pointing out to them potential problems they are not aware of, then yes, because they can be assumed to be aware of no brainer issues.

MIRI seeks to point out dangers in AI that aren't the result of gross incompetence or deliberate attempts to weaponise AI: it's banal to point out that these could read to danger.

Comment author: Richard_Loosemore 18 May 2015 02:16:18AM 0 points [-]

Maybe I could try to reduce possible confusion here. The paper was written to address a category of "AI Risk" scenarios in which we are told:

"Even if the AI is programmed with goals that are ostensibly favorable to humankind, it could execute those goals in such a way that would lead to disaster".

Given that premise, it would be a bait-and-switch if I proposed a fix for this problem, and someone objected with "But you cannot ASSUME that the programmers would implement that fix!"

The whole point of the problem under consideration is that even if the engineers tried, they could not get the AI to stay true.

Comment author: misterbailey 18 May 2015 09:19:47AM *  1 point [-]

The problem with you objecting to the particular scenarios Yudkowsky et al propose is that the scenarios are merely illustrative. Of course, you can probably guard against any specific failure mode. The claim is that there will be a lot of failure modes, and we can’t expect to guard against all of them by just sitting around thinking of as many exotic disaster scenarios as possible.

Mind you, I know your argument is more than just “I can see why these particular disasters could be avoided”. You’re claiming that certain features of AI will in general tend to make it careful and benevolent. Still, I don’t think it’s valid for you to complain about bait-and-switch, since that’s precisely the problem.

Comment author: Richard_Loosemore 18 May 2015 05:11:41PM 0 points [-]

I have explicitly addressed this point on many occasions. My paper had nothing in it that was specific to any failure mode.

The suggestion is that the entire class of failure modes suggested by Yudkowsky et al. has a common feature: they all rely on the AI being incapable of using a massive array of contextual constraints when evaluating plans.

By simply proposing an AI in which such massive constraint deployment is the norm, the ball is now in the other court: it is up to Yudkowsky et al. to come up with ANY kind of failure mode that could get through.

The scenarios I attacked in the paper have the common feature that they have been predicated on such a simplistic type of AI that they were bound to fail. They had failure built into them.

As soon as everyone moves on from those "dumb" superintelligences and starts to discuss the possible failure modes that could occur in a superintelligence that makes maximum use of constraints, we can start to talk about possible AI dangers. I'm ready to do that. Just waiting for it to happen, is all.

Comment author: OrphanWilde 18 May 2015 05:57:48PM 1 point [-]

Alright, I'll take you up on it:

Failure Mode I: The AI doesn't do anything useful, because there's no way of satisfying every contextual constraint.

Predicting your response: "That's not what I meant."

Failure Mode II: The AI weighs contextual constraints incorrectly and sterilizes all humans to satisfy the sort of person who believes in Voluntary Human Extinction.

Predicting your response: "It would (somehow) figure out the correct weighting for all the contextual constraints."

Failure Mode III: The AI weighs contextual constraints correctly (for a given value of "correctly") and sterilizes everybody of below-average intelligence or any genetic abnormalities that could impose costs on offspring, and in the process, sterilizes all humans.

Predicting your response: "It wouldn't do something so dumb."

Failure Mode IV: The AI weighs contextual constraints correctly and puts all people of minority ethical positions into mind-rewriting machines so that there's no disagreement anymore.

Predicting your response: "It wouldn't do something so dumb."

We could keep going, but the issue is that so far, you've defined -any- failure mode as "dumb"ness, and have argued that the AI wouldn't do anything so "dumb", because you've already defined that it is superintelligent.

I don't think you know what intelligence -is-. Intelligence does not confer immunity to "dumb" behaviors.

Comment author: TheAncientGeek 18 May 2015 07:16:47PM 2 points [-]

It's got to confer some degree of dumbness avoidance.

In any case, MIRI has already conceded that superintelligent AIs won't misbehave through stupidity. They maintain the problem is motivation ... the Genie KNOWS but doesn't CARE.

Comment author: OrphanWilde 18 May 2015 08:23:42PM 1 point [-]

It's got to confer some degree of dumbness avoidance.

Does it? On what grounds?

In any case, MIRI has already conceded that superintelligent AIs won't misbehave through stupidity. They maintain the problem is motivation ... the Genie KNOWS but doesn't CARE.

That's putting an alien intelligence in human terms; the very phrasing inappropriately anthropomorphizes the genie.

We probably won't go anywhere without an example.

Market economics ("capitalism") is an intelligence system which is very similar to the intelligence system Richard is proposing. Very, very similar; it's composed entirely of independent nodes (seven billion of them) which each provide their own set of constraints, and promote or demote information as it passes through them based on those constraints. It's an alien intelligence which follows Richard's model which we are very familiar with. Does the market "know" anything? Does it even make sense to suggest that market economics -could- care?

Does the market always arrive at the correct conclusions? Does it even consistently avoid stupid conclusions?

How difficult is it to program the market to behave in specific ways?

Is the market "friendly"?

Does it make sense to say that the market is "stupid"? Does the concept "stupid" -mean- anything when talking about the market?

Comment author: Richard_Loosemore 18 May 2015 08:01:38PM 0 points [-]

I will take them one at a time:

Failure Mode I: The AI doesn't do anything useful, because there's no way of satisfying every contextual constraint.

An elementary error. The constraints in question are referred to in the literature as "weak" constraints (and I believe I used that qualifier in the paper: I almost always do). Weak constraints never need to be ALL satisfied at once. No AI could ever be designed that way, and no-one ever suggested that it would. See the reference to McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) in the paper: that gives a pretty good explanation of weak constraints.

Predicting your response: "That's not what I meant."

That's an insult. But I will overlook it, since I know it is just your style.

Failure Mode II: The AI weighs contextual constraints incorrectly and sterilizes all humans to satisfy the sort of person who believes in Voluntary Human Extinction.

How exactly do you propose that the AI "weighs contextual constraints incorrectly" when the process of weighing constraints requires most of the constraints involved (probably thousands of them) to all suffer a simultaneous, INDEPENDENT 'failure' for this to occur?

That is implicit in the way that weak constraint systems are built. Perhaps you are not familiar with the details.

Predicting your response: "It would (somehow) figure out the correct weighting for all the contextual constraints."

Assuming this isn't more of the same, what you are saying here is isomorphic to the statement that somehow, a neural net might figure out the correct weighting for all the connections so that it produces the correctly trained output for a given input. That problem was solved in so many different NN systems that most NN people, these days, would consider your statement puzzling.

Failure Mode III: The AI weighs contextual constraints correctly (for a given value of "correctly") and sterilizes everybody of below-average intelligence or any genetic abnormalities that could impose costs on offspring, and in the process, sterilizes all humans.

A trivial variant of your second failure mode. The AI is calculating the constraints correctly, according to you, but at the same time you suggest that it has somehow NOT included any of the constraints that relate to the ethics of forced sterilization, etc. etc. You offer no explanation of why all of those constraints were not counted by your proposed AI, you just state that they weren't.

Predicting your response: "It wouldn't do something so dumb."

Yet another insult. This is getting a little tiresome, but I will carry on.

Failure Mode IV: The AI weighs contextual constraints correctly and puts all people of minority ethical positions into mind-rewriting machines so that there's no disagreement anymore.

This is identical to your third failure mode, but here you produce a different list of constraints that were ignored. Again, with no explanation of why a massive collection of constraints suddenly disappeared.

Predicting your response: "It wouldn't do something so dumb."

No comment.

We could keep going, but the issue is that so far, you've defined -any- failure mode as "dumb"ness, and have argued that the AI wouldn't do anything so "dumb", because you've already defined that it is superintelligent.

This is a bizarre statement, since I have said no such thing. Would you mind including citations, from now on, when you say that I "said" something? And please try not to paraphrase, because it takes time to correct the distortions in your paraphrases.

I don't think you know what intelligence -is-. Intelligence does not confer immunity to "dumb" behaviors.

Another insult, and putting words into my mouth, and showing no understanding of what a weak constraint system actually is.

Comment author: OrphanWilde 18 May 2015 08:49:17PM 3 points [-]

An elementary error. The constraints in question are referred to in the literature as "weak" constraints (and I believe I used that qualifier in the paper: I almost always do). Weak constraints never need to be ALL satisfied at once. No AI could ever be designed that way, and no-one ever suggested that it would. See the reference to McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) in the paper: that gives a pretty good explanation of weak constraints.

I understand the concept.

How exactly do you propose that the AI "weighs contextual constraints incorrectly" when the process of weighing constraints requires most of the constraints involved (probably thousands of them) to all suffer a simultaneous, INDEPENDENT 'failure' for this to occur?

I'd hazard a guess that, for any given position, less than 70% of humans will agree without reservation. The issue isn't that thousands of failures occur. The issue is that thousands of failures -always- occur.

Assuming this isn't more of the same, what you are saying here is isomorphic to the statement that somehow, a neural net might figure out the correct weighting for all the connections so that it produces the correctly trained output for a given input. That problem was solved in so many different NN systems that most NN people, these days, would consider your statement puzzling.

The problem is solved only for well-understood (and very limited) problem domains with comprehensive training sets.

A trivial variant of your second failure mode. The AI is calculating the constraints correctly, according to you, but at the same time you suggest that it has somehow NOT included any of the constraints that relate to the ethics of forced sterilization, etc. etc. You offer no explanation of why all of those constraints were not counted by your proposed AI, you just state that they weren't.

They were counted. They are, however, weak constraints. The constraints which required human extinction outweighed them, as they do for countless human beings. Fortunately for us in this imagined scenario, the constraints against killing people counted for more.

This is identical to your third failure mode, but here you produce a different list of constraints that were ignored. Again, with no explanation of why a massive collection of constraints suddenly disappeared.

Again, they weren't ignored. They are, as you say, weak constraints. Other constraints overrode them.

Another insult, and putting words into my mouth, and showing no understanding of what a weak constraint system actually is.

The issue here isn't my lack of understanding. The issue here is that you are implicitly privileging some constraints over others without any justification.

Every single conclusion I reached here is one that humans - including very intelligence humans - have reached. By dismissing them as possible conclusions an AI could reach, you're implicitly rejecting every argument pushed for each of these positions without first considering them. The "weak constraints" prevent them.

I didn't choose -wrong- conclusions, you see, I just chose -unpopular- conclusions, conclusions I knew you'd find objectionable. You should have noticed that; you didn't, because you were too concerned with proving that AI wouldn't do them. You were too concerned with your destination, and didn't pay any attention to your travel route.

If doing nothing is the correct conclusion, your AI should do nothing. If human extinction is the correct conclusion, your AI should choose human extinction. If sterilizing people with unhealthy genes is the correct conclusion, your AI should sterilize people with unhealthy genes (you didn't notice that humans didn't necessarily go extinct in that scenario). If rewriting minds is the correct conclusion, your AI should rewrite minds.

And if your constraints prevent the AI from undertaking the correct conclusion?

Then your constraints have made your AI stupid, for some value of "stupid".

The issue, of course, is that you have decided that you know better what is or is not the correct conclusion than an intelligence you are supposedly creating to know things better than you.

And that sums up the issue.

Comment author: TheAncientGeek 18 May 2015 10:58:11AM *  0 points [-]

The claim is that there will be a lot of failure modes, and we can’t expect to guard against all of them by just sitting around thinking of as many exotic disaster scenarios as possible.

I doubt that, since, coupled with claims of existential risk, the logical conclusion would be to halt AI research , but MIRI isnt saying that,

Comment author: misterbailey 18 May 2015 02:12:13PM 1 point [-]

There are other methods than "sitting around thinking of as many exotic disaster scenarios as possible" by which one could seek to make AI friendly. Thus, believing that "sitting around [...]" will not be sufficient does not imply that we should halt AI research.

Comment author: TheAncientGeek 18 May 2015 02:23:17PM *  0 points [-]

So where are the multiple solutions to the multiple failure modes?

Comment author: misterbailey 18 May 2015 09:16:17AM 1 point [-]

Yudkowsky et al don't argue that the problem is unsolvable, only that it is hard. In particular, Yudkowsky fears it may be harder than creating AI in the first place, which would mean that in the natural evolution of things, UFAI appears before FAI. However, I needn't factor what I'm saying through the views of Yudkowsky. For an even more modest claim, we don't have to believe that FAI is hard in hindsight in order to claim that AI will be unfriendly unless certain failure modes are guarded against. On this view of the FAI project, a large part of the effort is just noticing the possible failure modes that were only obvious in hindsight, and convincing people that the problem is important and won't solve itself.

Comment author: TheAncientGeek 18 May 2015 11:16:42AM *  0 points [-]

If no one is building AIs with utility functions, then the one kind of failure MIRI is talking about has solved itself,

Comment author: Unknowns 17 May 2015 04:20:33AM 2 points [-]

Richard Loosemore has stated a number of times that he does not expect an AI to have goals at all in a sense which is relevant to this discussion, so in that way there is indeed disagreement about whether AIs "pursue their goals."

Basically he is saying that AIs will not have goals in the same way that human beings do not have goals. No human being has a goal that he will pursue so rigidly that he would destroy the universe in order to achieve it, and AIs will behave similarly.

Comment author: TheAncientGeek 17 May 2015 09:09:33AM *  0 points [-]

Basically he is saying that AIs will not have goals in the same way that human beings do not have goals. No human being has a goal that he will pursue so rigidly that he would destroy the universe in order to achieve it, and AIs will behave similarly.

Arguably, humans don't do that shirt of thing because of goals towards self preservation, status and hedonism.

Richard Loosemore has stated a number of times that he does not expect an AI to have goals at all in a sense which is relevant to this discussion, so in that way there is indeed disagreement about whether AIs "pursue their goals."

The sense relevant to the discussion could be something specific, like direct normatively, ie building in detailed descriptions into goals.

Comment author: TheAncientGeek 18 May 2015 09:17:46AM *  1 point [-]

Loosemore, Yudkowsky, and myself are all discussing AIs that have a goal misaligned with human values that they nevertheless find motivating.

If that is supposed to be a universal or generic AI, it is a valid criticiYsm to point out that not all AIs are like that.

If that is supposed to be a particular kind of AI, it is a valid criticism to point out that no realistic AIs are like that.

You seem to feel you are not being understood, but what is being said is not clear,

1 Whether or not "superintelligent" is a meaningful term in this context

"Superintelligence" is one of the clearer terms here, IMO. It just means more than human intelligence, and humans can notice contradictions.

This comment seems to be part of a concernabout "wisdom", assumed to be some extraneous thing an AI would not necessarily have. (No one but Vaniver has brought in wisdom) The counterargument is that compartmentalisation between goals and instrumental knowledge is an extraneous thing an AI would not necessarily have, and that its absence is all that is needed for a contradictions to be noticed and acted on.

2 Whether we should expect generic AI designs to recognize misalignments, or whether such a realization would impact the goal the AI pursues.

It's an assumption, that needs justification, that any given AI will have goals of a non trivial sort. "Goal" is a term that needs tabooing.

Neither Yudkowsky nor I think either of those are reasonable to expect--as a motivating example, we are happy to subvert the goals that we infer evolution was directing us towards in order to better satisfy "our" goals. I

While we are anthopomirphising, it might be worth pointing out that humans don't show behaviour patterns of relentlessly pursuing arbitrary goals.

oals. I suspect that Loosemore thinks that viable designs would recognize it, but agrees that in general that recognition does not have to lead to an alignment

Loosemore has put forward a simple suggestion, which MIRI appears not to have considered at all, that on encountering a contradiction, an AI could lapse into a safety mode, if so designed,

3 ...sees cleverness and wisdom as closely tied together

You are paraphrasing Loosemoreto sound less technical and more handwaving than his actual comments. The ability to sustain contradictions in a system that is constantly updating itself isnt a given....it requires an architectural choice in favour of compartmentalisation.

Comment author: nshepperd 18 May 2015 09:45:52AM *  3 points [-]

All this talk of contradictions is sort of rubbing me the wrong way here. There's no "contradiction" in an AI having goals that are different to human goals. Logically, this situation is perfectly normal. Loosemore talks about an AI seeing its goals are "massively in contradiction to everything it knows about <BLAH>", but... where's the contradiction? What's logically wrong with getting strawberries off a plant by burning them?

I don't see the need for any kind of special compartmentalisation; information about "normal use of strawberries" is already inert facts with no caring attached by default.

If you're going to program in special criteria that would create caring about this information, okay, but how would such criteria work? How do you stop it from deciding that immortality is contradictory to "everything it knows about death" and refusing to help us solve aging?

Comment author: TheAncientGeek 18 May 2015 07:52:40PM 0 points [-]

In the original scenario, the contradiction us supposed to .be between a hardcoded definition of happiness in the AIs goal system, and inferred knowledge in the execution system.

Comment author: Richard_Loosemore 18 May 2015 08:43:34PM 1 point [-]

I have read what you wrote above carefully, but I won't reply line-by-line because I think it will be clearer not to.

When it comes to finding a concise summary of my claims, I think we do indeed need to be careful to avoid blanket terms like "superintelligent" or superclever" or "superwise" ... but we should only avoid these IF they are used with the implication they have a precise (perhaps technically precise) meaning. I do not believe they have precise meaning. But I do use the term "superintelligent" a lot anyway. My reason for doing that is because I only use it as an overview word -- it is just supposed to be a loose category that includes a bunch of more specific issues. I only really want to convey the particular issues -- the particular ways in which the intelligence of the AI might be less than adequate, for example.

That is only important if we find ourselves debating whether it might clever, wise, or intelligent ..... I wouldn't want to get dragged into that, because I only really care about specifics.

For example: does the AI make a habit of forming plans that massively violate all of its background knowledge about the goal that drove the plan? If it did, it would (1) take the baby out to the compost heap when what it intended to do was respond to the postal-chess game it is engaged in, or (2) cook the eggs by going out to the workshop and making a cross-cutting jog for the table saw, or (3) ......... and so on. If we decided that the AI was indeed prone to errors like that, I wouldn't mind if someone diagnosed a lack of 'intelligence' or a lack of 'wisdom' or a lack of ... whatever. I merely claim that in that circumstance we have evidence that the AI hasn't got what it takes to impose its will on a paper bag, never mind exterminate humanity.

Now, my attacks on the scenarios have to do with a bunch of implications for what the AI (the hypothetical AI) would actually do. And it is that 'bunch' that I think add up to evidence for what I would summarize as 'dumbness'.

And, in fact, I usually go further than that and say that if someone tried to get near to an AI design like that, the problems would arise early on and the AI itself (inasmuch as it could do anyhting smart at all) would be involved in the efforts to suggest improvements. This is where we get the suggestions in your item 2, about the AI 'recognizing' misalignments.

I suspect that on this score a new paper is required, to carefully examine the whole issue in more depth. In fact, a book.

I am now decided that that has to happen.

So perhaps it is best to put the discussion on hold until a seriously detailed technical book comes out of me? At any rate, that is my plan.

Comment author: Vaniver 18 May 2015 10:32:28PM 1 point [-]

So perhaps it is best to put the discussion on hold until a seriously detailed technical book comes out of me? At any rate, that is my plan.

That seems like a solid approach. I do suggest that you try to look deeply into whether or not it's possible to partially solve the problem of understanding goals, as I put it above, and make that description of why that is or isn't possible or likely long and detailed. As you point out, that likely requires book-length attention.