[draft] Concepts are Difficult, and Unfriendliness is the Default: A Scary Idea Summary

Kaj_Sotala

12 [draft] Concepts are Difficult, and Unfriendliness is the Default: A Scary Idea Summary

31st Mar 2012

4 min read

12

Here's my draft document Concepts are Difficult, and Unfriendliness is the Default. (Google Docs, commenting enabled.) Despite the name, it's still informal and would need a lot more references, but it could be written up to a proper paper if people felt that the reasoning was solid.

Here's my introduction:

In the "Muehlhauser-Goertzel Dialogue, Part 1", Ben Goertzel writes:

[Anna Salamon] gave the familiar SIAI argument that, if one picks a mind at random from “mind space”, the odds that it will be Friendly to humans are effectively zero.

I made the familiar counter-argument that this is irrelevant, because nobody is advocating building a random mind. Rather, what some of us are suggesting is to build a mind with a Friendly-looking goal system, and a cognitive architecture that’s roughly human-like in nature but with a non-human-like propensity to choose its actions rationally based on its goals, and then raise this AGI mind in a caring way and integrate it into society. Arguments against the Friendliness of random minds are irrelevant as critiques of this sort of suggestion.

[...] Over all these years, the SIAI community maintains the Scary Idea in its collective mind, and also maintains a great devotion to the idea of rationality, but yet fails to produce anything resembling a rational argument for the Scary Idea -- instead repetitiously trotting out irrelevant statements about random minds!!

Ben has a valid complaint here. Therefore, I'll attempt to formalize the arguments for the following conclusion:

Even if an AGI is explicitly built to have a Friendly-looking goal system, and a cognitive architecture that’s roughly human-like in nature but with a non-human-like propensity to choose its actions rationally based on its goals, and this AGI mind is raised in caring way in an attempt to integrate it into society, there is still a very large chance of creating a mind that is unFriendly.

First, I'll outline my argument, and then expand upon each specific piece in detail.

The premises in outline

0. There will eventually be a situation where the AGI's goals and behaviors are no longer under our control.

1. Whether or not the AGI will eventually come to understand what we wanted it to do is irrelevant, if that understanding does not guide its actions in ”the right way”.

2. Providing an AGI with the kind of understanding that'd guide its actions in ”the right way” requires some way of defining our intentions.

3. In addition to defining what counts as our intentions, we also need to define the concepts that make up those intentions.

4. Any difference between the way we understand concepts and the way that they are defined by the AGI is something that the AGI may exploit, with likely catastrophic results.

5. Common-sense concepts are complicated and allow for many degrees of freedom: fully satisfactory definitions for most concepts do not exist.

6. Even if an AGI seemed to learn our concepts, without human inductive biases it would most likely mislearn them.

7. AGI concepts are likely to be opaque and hard to understand, making proper verification impossible.

And here's my conclusion:

Above, I have argued that an AGI will only be Friendly if its goals are the kinds of goals that we would want it to have, and it will only have the kinds of goals that we would want it to have if the concepts that it bases its goals on are sufficiently similar to the concepts that we use. Even subtle differences in the concepts will quickly lead to drastic differences – even an AGI with most of its ontology basically correct, but with a differing definition regarding the concept of ”time”, might end up destroying humanity. I have also argued that human behavioral data severly underconstrains the actual models that could be generated about human concepts, that humans do not understand the concepts they use themselves, and that an AGI developing concepts that are subtly different from those of humans is therefore unavoidable. Furthermore, AGI concepts are themselves likely to be opaque in that they cannot simply be read off the AGI, but have to be inferred in the same way that an AGI tries to infer human concepts, so humans cannot even reliably know whether an AGI that seems Friendly really is Friendly. The most likely scenario is that it is not, but there is no safe way for the humans to test this.

Presuming that one accepts this chain of reasoning, it seems like

Even if an AGI is explicitly built to have a Friendly-looking goal system, and a cognitive architecture that’s roughly human-like in nature but with a non-human-like propensity to choose its actions rationally based on its goals, and this AGI mind is raised in caring way in an attempt to integrate it into society, there is still a very large chance of creating a mind that is unFriendly.

would be a safe conclusion to accept.

For the actual argumentation defending the various premises, see the linked document. I have a feeling that there are still several conceptual distinctions that I should be making but am not, but I figured that the easiest way to find the problems would be to have people tell me what points they find unclear or disagreeable.

Personal Blog

12

New Comment

Rendering 0/39 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 6:11 AM

Moderation Log

12 [draft] Concepts are Difficult, and Unfriendliness is the Default: A Scary Idea Summary

by Kaj_Sotala

31st Mar 2012

4 min read

12

Here's my introduction:

In the "Muehlhauser-Goertzel Dialogue, Part 1", Ben Goertzel writes:

[Anna Salamon] gave the familiar SIAI argument that, if one picks a mind at random from “mind space”, the odds that it will be Friendly to humans are effectively zero.

I made the familiar counter-argument that this is irrelevant, because nobody is advocating building a random mind. Rather, what some of us are suggesting is to build a mind with a Friendly-looking goal system, and a cognitive architecture that’s roughly human-like in nature but with a non-human-like propensity to choose its actions rationally based on its goals, and then raise this AGI mind in a caring way and integrate it into society. Arguments against the Friendliness of random minds are irrelevant as critiques of this sort of suggestion.

[...] Over all these years, the SIAI community maintains the Scary Idea in its collective mind, and also maintains a great devotion to the idea of rationality, but yet fails to produce anything resembling a rational argument for the Scary Idea -- instead repetitiously trotting out irrelevant statements about random minds!!

Ben has a valid complaint here. Therefore, I'll attempt to formalize the arguments for the following conclusion:

Even if an AGI is explicitly built to have a Friendly-looking goal system, and a cognitive architecture that’s roughly human-like in nature but with a non-human-like propensity to choose its actions rationally based on its goals, and this AGI mind is raised in caring way in an attempt to integrate it into society, there is still a very large chance of creating a mind that is unFriendly.

First, I'll outline my argument, and then expand upon each specific piece in detail.

The premises in outline

0. There will eventually be a situation where the AGI's goals and behaviors are no longer under our control.

1. Whether or not the AGI will eventually come to understand what we wanted it to do is irrelevant, if that understanding does not guide its actions in ”the right way”.

2. Providing an AGI with the kind of understanding that'd guide its actions in ”the right way” requires some way of defining our intentions.

3. In addition to defining what counts as our intentions, we also need to define the concepts that make up those intentions.

4. Any difference between the way we understand concepts and the way that they are defined by the AGI is something that the AGI may exploit, with likely catastrophic results.

5. Common-sense concepts are complicated and allow for many degrees of freedom: fully satisfactory definitions for most concepts do not exist.

6. Even if an AGI seemed to learn our concepts, without human inductive biases it would most likely mislearn them.

7. AGI concepts are likely to be opaque and hard to understand, making proper verification impossible.

And here's my conclusion:

Above, I have argued that an AGI will only be Friendly if its goals are the kinds of goals that we would want it to have, and it will only have the kinds of goals that we would want it to have if the concepts that it bases its goals on are sufficiently similar to the concepts that we use. Even subtle differences in the concepts will quickly lead to drastic differences – even an AGI with most of its ontology basically correct, but with a differing definition regarding the concept of ”time”, might end up destroying humanity. I have also argued that human behavioral data severly underconstrains the actual models that could be generated about human concepts, that humans do not understand the concepts they use themselves, and that an AGI developing concepts that are subtly different from those of humans is therefore unavoidable. Furthermore, AGI concepts are themselves likely to be opaque in that they cannot simply be read off the AGI, but have to be inferred in the same way that an AGI tries to infer human concepts, so humans cannot even reliably know whether an AGI that seems Friendly really is Friendly. The most likely scenario is that it is not, but there is no safe way for the humans to test this.

Presuming that one accepts this chain of reasoning, it seems like

Even if an AGI is explicitly built to have a Friendly-looking goal system, and a cognitive architecture that’s roughly human-like in nature but with a non-human-like propensity to choose its actions rationally based on its goals, and this AGI mind is raised in caring way in an attempt to integrate it into society, there is still a very large chance of creating a mind that is unFriendly.

would be a safe conclusion to accept.

Personal Blog

12

New Comment

Rendering 0/39 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 6:11 AM

Moderation Log

More from Kaj_Sotala

Curated and popular this week

39Comments

Comment Permalink

XiXiDu14y150

For what it's worth. Here are some possible objections that certain people might raise.

(Note: I am doing this to help you refine a document that was probably meant to convince critics that they are wrong. It is not an attempt to troll. Everything below this line is written in critique mode.)

The most basic drive of any highly efficient AGI is, in my opinion, the drive to act correctly. You seem to assume that AGI will likely be designed to judge any action with regard to a strict utility-function. You are assuming a very special kind of AGI design with a rigid utility-function that the AGI then cares to satisfy the way it was initially hardcoded. You assume that the AGI won't be able to, respectively does not want to, figure out what its true goals might be.

What makes you think that AGI's will be designed according to those criteria?

If an AGI acts according to a rigid utility-functions, then what makes you think that it won't try to interpret any vagueness in a way that most closely reflects the most probable way it was meant to be interpreted?

If the AGI's utility-function solely consisted of the English language sentence "Make people happy.", then what makes you think that it wouldn't be able to conclude what we actually meant by it and act accordingly? Why would it care to act in a way that does not reflect our true intentions?

My problem is that there seems to be a discontinuity between the superior intelligence of a possible AGI and its inability to discern irrelevant information from relevant information with respect to the correct interpretation of its utility-function.

Kaj_Sotala14y40

If an AGI acts according to a rigid utility-functions, then what makes you think that it won't try to interpret any vagueness in a way that most closely reflects the most probable way it was meant to be interpreted?
If the AGI's utility-function solely consisted of the English language sentence "Make people happy.", then what makes you think that it wouldn't be able to conclude what we actually meant by it and act accordingly? Why would it care to act in a way that does not reflect our true intentions?

Okay, I'm clearly not communicating the ess... (read more)

5Kaj_Sotala14y

Hmm. Actually, I'm not making any assumptions about the AGI's decision-making process (or at least I'm trying not to): it could have a formal utility function, but it could also have e.g. a more human-like system with various instincts that pull it in different directions, or pretty much any decision-making system that might be reasonable. You make a good point that this probably needs to be clarified. Could you point out the main things that give the impression that I'm presuming utility function -based decision making?

1wedrifid14y

Your critique will help Kaj refine his document so as to better persuade critics.

See in context