Gunnar_Zarncke comments on Debunking Fallacies in the Theory of AI Motivation - LessWrong

8 Post author: Richard_Loosemore 05 May 2015 02:46AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (343)

You are viewing a single comment's thread. Show more comments above.

Comment author: Richard_Loosemore 06 May 2015 04:08:48PM 7 points [-]

With all of the above in mind, a quick survey of some of the things that you just said, with my explanation for why each one would not (or probably would not) be as much of an issue as you think:

As humans, we have a good idea of what "giving choices to people" vs. "forcing them to do something" looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the "forceful" category (even though it can be done with only text).

For a massive-weak-constraint system, psychological manipulation would be automatically understood to be in the forceful category, because the concept of "psychological manipulation" is defined by a cluster of features that involve intentional deception, and since the "friendliness" concept would ALSO involve a cluster of weak constraints, it would include the extended idea of intentional deception. It would have to, because intentional deception is connected to doing harm, which is connected with unfriendly, etc.

Conclusion: that is not really an "edge" case in the sense that someone has to explicitly remember to deal with it.

Very likely, the concept space will be very complicated and difficult for humans to understand.

We will not need to 'understand' the AGI's concept space too much, if we are both using massive weak constraints, with convergent semantics. This point I addressed in more detail already.

This seems pretty similar to Paul's idea of a black-box human in the counterfactual loop. I think this is probably a good idea, but the two problems here are (1) setting up this (possibly counterfactual) interaction in a way that it approves a large class of good plans and rejects almost all bad plans (see the next section), and (2) having a good way to predict the outcome of this interaction usually without actually performing it. While we could say that (2) will be solved by virtue of the superintelligence being a superintelligence, in practice we'll probably get AGI before we get uploads, so we'll need some sort of semi-reliable way to predict humans without actually simulating them. Additionally, the AI might need to self-improve to be anywhere smart enough to consider this complex hypothetical, and so we'll need some kind of low-impact self-improvement system. Again, I think this is probably a good idea, but there are quite a lot of issues with it, and we might need to do something different in practice. Paul has written about problems with black-box approaches based on predicting counterfactual humans here and here. I think it's a good idea to develop both black-box solutions and white-box solutions, so we are not over-reliant on the assumptions involved in one or the other.

What you are talking about here is the idea of simulating a human to predict their response. Now, humans already do this in a massive way, and they do not do it by making gigantic simulations, but just by doing simple modeling. And, crucially, they rely on the masive-weak-constraints-with-convergent-semantics (you can see now why I need to coin the concise term "Swarm Relaxation") between the self and other minds to keep the problem manageable.

That particular idea - of predicting human response - was not critical to the argument that followed, however.

What language will people's questions about the plans be in? If it's a natural language, then the AI must be able to translate its concept space into the human concept space, and we have to solve a FAI-complete problem to do this.

No, we would not have to solve a FAI-complete problem to do it. We will be developing the AGI from a baby state up to adulthood, keeping its motivation system in sync all the way up, and looking for deviations. So, in other words, we would not need to FIRST build the AGI (with potentially dangerous alen semantics), THEN do a translation between the two semantic systems, THEN go back and use the translation to reconstruct the motivation system of the AGI to make sure it is safe.

Much more could be said about the process of "growing" and "monitoring" the AGI during the development period, but suffice it to say that this process is extremely different if you have a Swarm Relaxation system vs. a logical system of the sort your words imply.

We should also definitely be wary of a decision rule of the form "find a plan that, if explained to humans, would cause humans to say they understand it".

This hits the nail on the head. This comes under the heading of a strong constraint, or a point-source failure mode. The motivation system of a Swarm Relaxation system would not contain "decision rules" of that sort, precisely because they could have large, divergent effects on the behavior. If motivation is, instead, governed by large numbers of weak constraints, and in this case your decision rule would be seen to be a type of deliberate deception, or manipulation, of the humans. And that contradicts a vast array of constraints that are consistent with friendliness.

Again, it's quite plausible that the AI's concept space will contain some kind of concept that distinguishes between these different types of optimization; however, humans will need to understand the AI's concept space in order to pinpoint this concept so it can be integrated into the AI's decision rule.

Same as previous: with a design that does not use decision rules that are prone to point-source failure modes, the issue evaporates.

To summarize: much depends on an understanding of the concept of a weak constraint system. There are no really good readings I can send you (I know I should write one), but you can take a look at the introductory chapter of McClelland and Rumelhart that I gave in the references to the paper.

Also, there is a more recent reference to this concept, from an unexpected source. Yann LeCun has been giving some lectures on Deep Learning in which he came up with a phrase that could have been used two decades ago to describe exactly the sort of behavior to be expected from SA systems. He titles his lecture "The Unreasonable Effectiveness of Deep Learning". That is a wonderful way to express it: swarm relaxation systems do not have to work (there really is no math that can tell you that they should be as good as they are), but they do. They are "unreasonably effective".

There is a very deep truth buried in that phrase, and a lot of what I have to say about SA is encapsulated in it.

Comment author: Gunnar_Zarncke 06 May 2015 10:28:03PM 0 points [-]

That concept spaces can be matched without gotchas is reassuring and may point into a direction AGI can be made friendly. If the concepts are suitably matched in your proposed checking modules. If. And if no other errors are made.