Three Approaches to "Friendliness"

Wei Dai

I put "Friendliness" in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use "Friendly AI" to denote such an AI since that's the established convention.

I've often stated my objections MIRI's plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it's not because, as some have suggested while criticizing MIRI's FAI work, that we can't foresee what problems need to be solved. I think it's because we can largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for "trial and error", or both.

When people say they don't know what problems need to be solved, they may be mostly talking about "AI safety" rather than "Friendly AI". If you think in terms of "AI safety" (i.e., making sure some particular AI doesn't cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. "Friendly AI" on the other hand is really a very different problem, where we're trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I'm not sure. I'm hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what's causing it.

The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.

For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don't see any way to be confident of this ahead of time.) I'll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

The problem with Normative AI, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?

Black-Box Metaphilosophical AI is also risky, because it's hard to test/debug something that you don't understand. Besides that general concern, designs in this category (such as Paul Christiano's take on indirect normativity) seem to require that the AI achieve superhuman levels of optimizing power before being able to solve its philosophical problems, which seems to mean that a) there's no way to test them in a safe manner, and b) it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.

White-Box Metaphilosophical AI may be the most promising approach. There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don't think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.

To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can't be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren't that hard, relative to his abilities), but I'm not sure why some others say that we don't yet know what the problems will be.

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

Coincidentally, I ended up reading Evolutionary Psychology: Controversies, Questions, Prospects, and Limitations today, and noticed that it makes a number of points that could be interpreted in a similar light: in that humans do not really have a "domain-general rationality", and that instead we have specialized learning and reasoning mechanisms, each of which are carrying out a specific evolutionary purpose and which are specialized for extracting information that's valuable in light of the evolutionary pressures that (used to) prevail. In other words, each of them carries out inferences that are designed to further some specific evolutionary value that helped contribute to our inclusive fitness.

The paper doesn't spell out the obvious implication, since that isn't its topic, but it seems pretty clear to me: since our various learning and reasoning systems are based on furthering specific values, our philosophy has also been generated as a combination of such various value-laden systems, and we can't expect an AI reasoner to develop a philosophy that we'd approve of unless its reasoning mechanisms also embody the same values.

That said, it does suggest a possible avenue of attack on the metaphilosophy issue... figure out exactly what various learning mechanisms we have and which evolutionary purposes they had, and then use that data to construct learning mechanisms that carry out similar inferences as humans do.

Quotes:

Hypotheses about motivational priorities are required to explain empirically discovered phenomena, yet they are not contained within domain-general rationality theories. A mechanism of domain-general rationality, in the case of jealousy, cannot explain why it should be “rational” for men to care about cues to paternity certainty or for women to care about emotional cues to resource diversion. Even assuming that men “rationally” figured out that other men having sex with their mates would lead to paternity uncertainty, why should men care about cuckoldry to begin with? In order to explain sex differences in motivational concerns, the “rationality” mechanism must be coupled with auxiliary hypotheses that specify the origins of the sex differences in motivational priorities. [...]

The problem of combinatorial explosion. Domain-general theories of rationality imply a deliberate cal- culation of ends and a sample space of means to achieve those ends. Performing the computations needed to sift through that sample space requires more time than is available for solving many adaptive problems, which must be solved in real time. Consider a man coming home from work early and discovering his wife in bed with another man. This circumstance typically leads to immediate jealousy, rage, violence, and sometimes murder (Buss, 2000; Daly & Wilson, 1988). Are men pausing to rationally deliberate over whether this act jeopardizes their paternity in future offspring and ultimate reproductive fitness, and then becoming enraged as a consequence of this rational deliberation? The predictability and rapidity of men’s jealousy in response to cues of threats to paternity points to a specialized psychological circuit rather than a response caused by deliberative domain-general rational thought. Dedicated psychological adaptations, because they are activated in response to cues to their corresponding adaptive problems, operate more efficiently and effectively for many adaptive problems. A domain-general mechanism “must evaluate all alternatives it can define. Permutations being what they are, alternatives increase exponentially as the problem complexity increases” (Cosmides & Tooby, 1994, p. 94). Consequently, combinatorial explosion paralyzes a truly domain-general mechanism (Frankenhuis & Ploeger, 2007). [...]

In sum, domain-general mechanisms such as “rationality” fail to provide plausible alternative explanations for psychological phenomena discovered by evolutionary psychologists. They are invoked post hoc, fail to generate novel empirical predictions, fail to specify underlying motivational priorities, suffer from paralyzing combinatorial explosion, and imply the detection of statistical regularities that cannot be, or are unlikely to be, learned or deduced ontogenetically. It is important to note that there is no single criterion for rationality that is independent of adaptive domain. [...]

The term learning is sometimes used as an explana- tion for an observed effect and is the simple claim that something in the organism changes as a consequence of environmental input. Invoking “learning” in this sense, without further specification, provides no additional explanatory value for the observed phenomenon but only regresses its cause back a level. Learning requires evolved psychological adaptations, housed in the brain, that enable learning to occur: “After all, 3-pound cauliflowers do not learn, but 3-pound brains do” (Tooby & Cosmides, 2005, p. 31). The key explanatory challenge is to identify the nature of the underlying learning adaptations that enable humans to change their behavior in functional ways as a consequence of particular forms of environmental input.

Although the field of psychology lacks a complete understanding of the nature of these learning adaptations, enough evidence exists to draw a few reasonable conclu- sions. Consider three concrete examples: (a) People learn to avoid having sex with their close genetic relatives (learned incest avoidance); (b) people learn to avoid eating foods that may contain toxins (learned food aversions); (c) people learn from their local peer group which actions lead to increases in status and prestige (learned prestige criteria). There are compelling theoretical arguments and empirical evidence that each of these forms of learning is best explained by evolved learning adaptations that have at least some specialized design features, rather than by a single all-purpose general learning adaptation (Johnston, 1996). Stated differently, evolved learning adaptations must have at least some content-specialized attributes, even if they share some components. [...]

These three forms of learning—incest avoidance, food aversion, and prestige criteria—require at least some content-specific specializations to function properly. Each op- erates on the basis of inputs from different sets of cues: coresidence during development, nausea paired with food ingestion, and group attention structure. Each has different functional output: avoidance of relatives as sexual partners, disgust at the sight and smell of specific foods, and emulation of those high in prestige. It is important to note that each form of learning solves a different adaptive problem.

There are four critical conclusions to draw from this admittedly brief and incomplete analysis. First, labeling something as “learned” does not, by itself, provide a satisfactory scientific explanation any more than labeling something as “evolved” does; it is simply the claim that environmental input is one component of the causal process by which change occurs in the organism in some way. Second, “learned” and “evolved” are not competing explanations; rather, learning requires evolved psychological mechanisms, without which learning could not occur. Third, evolved learning mechanisms are likely to be more numerous than traditional conceptions have held in psychology, which typically have been limited to a few highly general learning mechanisms such as classical and operant conditioning. Operant and classical conditioning are important, of course, but they contain many specialized adaptive design features rather than being domain general (Ohman & Mineka, 2003). And fourth, evolved learning mechanisms are at least somewhat specific in nature, containing particular design features that correspond to evolved solutions to qualitatively distinct adaptive problems.

37

Three Approaches to "Friendliness"

37

37

37

Three Approaches to "Friendliness"

37

37