Three Approaches to "Friendliness"

Wei Dai

I put "Friendliness" in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use "Friendly AI" to denote such an AI since that's the established convention.

I've often stated my objections MIRI's plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it's not because, as some have suggested while criticizing MIRI's FAI work, that we can't foresee what problems need to be solved. I think it's because we can largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for "trial and error", or both.

When people say they don't know what problems need to be solved, they may be mostly talking about "AI safety" rather than "Friendly AI". If you think in terms of "AI safety" (i.e., making sure some particular AI doesn't cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. "Friendly AI" on the other hand is really a very different problem, where we're trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I'm not sure. I'm hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what's causing it.

The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.

For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don't see any way to be confident of this ahead of time.) I'll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

The problem with Normative AI, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?

Black-Box Metaphilosophical AI is also risky, because it's hard to test/debug something that you don't understand. Besides that general concern, designs in this category (such as Paul Christiano's take on indirect normativity) seem to require that the AI achieve superhuman levels of optimizing power before being able to solve its philosophical problems, which seems to mean that a) there's no way to test them in a safe manner, and b) it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.

White-Box Metaphilosophical AI may be the most promising approach. There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don't think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.

To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can't be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren't that hard, relative to his abilities), but I'm not sure why some others say that we don't yet know what the problems will be.

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

So after giving this issue some thought: I'm not sure to what extent a white-box metaphilosopical AI will actually be possible.

For instance, consider the Repugnant Conclusion. Derek Parfit considered some dilemmas in population ethics, put together possible solutions at them, and then noted that the solutions led to an outcome which again seemed unacceptable - but also unavoidable. Once his results had become known, a number of other thinkers started considering the problem and trying to find a way way around those results.

Now, why was the Repugnant Conclusion considered unacceptable? For that matter, why were the dilemmas whose solutions led to the RC considered "dilemmas" in the first place? Not because any of them would have violated any logical rules of inference. Rather, we looked at them and thought "no, my morality says that that is wrong", and then (engaging in motivated cognition) began looking for a consistent way to avoid having to accept the result. In effect, our minds contained dynamics which rejected the RC as a valid result, but that rejection came from our subconscious values, not from any classical reasoning rule that you could implement in an algorithm. Or you could conceivably implement the rule in the algorithm if you had a thorough understanding of our values, but that's not of much help if the algorithm is supposed to figure out our values.

You can generalize this problem to all kinds of philosophy. In decision theory, we already have an intuitive value of what "winning" means, and are trying to find a way to formalize it in a way that fits our value. In epistemology, we have some standards about the kind of "truth" that we value, and are trying to come up with a system that obeys those standards. Etc.

The root problem is that classification and inference require values. As Watanabe (1974) writes:

According to the theorem of the Ugly Duckling, any pair of nonidentical objects share an equal number of predicates as any other pair of nonidentical objects, insofar as the number of predicates is finite [10], [12]. That is to say, from a logical point of view there is no such thing as a natural kind. In the case of pattern recognition, the new arrival shares the same number of predicates with any other paradigm of any class. This shows that pattern recognition is a logically indeterminate problem. The class-defining properties are generalizations of certain of the properties shared by the paradigms of the class. Which of the properties should be used for generalization is not logically defined. If it were logically determinable, then pattern recognition would have a definite answer in violation of the theorem of the Ugly Duckling.

This conclusion is somewhat disturbing because our empirical knowledge is based on natural kinds of objects. The source of the trouble lies in the fact that we were just counting the number of predicates in the foregoing, treating them as if they were all equally important. The fact is that some predicates are more important than some others. Objects are similar if they share a large number of important predicates.

Important in what scale? We have to conclude that a predicate is important if it leads to a classification that is useful for some purpose. From a logical point of view, a whale can be put together in the same box with a fish or with an elephant. However, for the purpose of building an elegant zoological theory, it is better to put it together with the elephant, and for classifying industries it is better to put it together with the fish. The property characterizing mammals is important for the purpose of theory building in biology, while the property of living in water is more important for the purpose of classification of industries.

The conclusion is that classification is a value-dependent task and pattern recognition is mechanically possible only if we smuggle into the machine the scale of importance of predicates. Alternatively, we can introduce into the machine the scale of distance or similarity between objects. This seems to be an innocuous set of auxiliary data, but in reality we are thereby telling the machine our value judgment, which is of an entirely extra-logical nature. The human mind has an innate scale of importance of predicates closely related to the sensory organs. This scale of importance seems to have been developed during the process of evolution in such a way as to help maintain and expand life [12], [14].

"Progress" in philosophy essentially means "finding out more about the kinds of things that we value, drawing such conclusions that our values say are correct and useful". I am not sure how one could make an AI make progress in philosophy if we didn't already have a clear understanding of what our values were, so "white-box metaphilosophy" seems to just reduce back to a combination of "normative AI" and "black-box metaphilosophy".

I always suspected that natural kinds depended on an underdetermined choice of properties, but I had no idea there was or could be a theorem saying so. Thanks for pointing this out.

Does a similar point apply to Solomonoff Induction? How does the minimum length of the program necessary to generate a proposition, vary when we vary the properties our descriptive language uses?

2Kaj_Sotala13y

Coincidentally, I ended up reading Evolutionary Psychology: Controversies, Questions, Prospects, and Limitations today, and noticed that it makes a number of points that could be interpreted in a similar light: in that humans do not really have a "domain-general rationality", and that instead we have specialized learning and reasoning mechanisms, each of which are carrying out a specific evolutionary purpose and which are specialized for extracting information that's valuable in light of the evolutionary pressures that (used to) prevail. In other words, each of them carries out inferences that are designed to further some specific evolutionary value that helped contribute to our inclusive fitness. The paper doesn't spell out the obvious implication, since that isn't its topic, but it seems pretty clear to me: since our various learning and reasoning systems are based on furthering specific values, our philosophy has also been generated as a combination of such various value-laden systems, and we can't expect an AI reasoner to develop a philosophy that we'd approve of unless its reasoning mechanisms also embody the same values. That said, it does suggest a possible avenue of attack on the metaphilosophy issue... figure out exactly what various learning mechanisms we have and which evolutionary purposes they had, and then use that data to construct learning mechanisms that carry out similar inferences as humans do. Quotes: [...]

37

Three Approaches to "Friendliness"

37

37

37

Three Approaches to "Friendliness"

37

37