Three Approaches to "Friendliness"

Wei Dai

I put "Friendliness" in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use "Friendly AI" to denote such an AI since that's the established convention.

I've often stated my objections MIRI's plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it's not because, as some have suggested while criticizing MIRI's FAI work, that we can't foresee what problems need to be solved. I think it's because we can largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for "trial and error", or both.

When people say they don't know what problems need to be solved, they may be mostly talking about "AI safety" rather than "Friendly AI". If you think in terms of "AI safety" (i.e., making sure some particular AI doesn't cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. "Friendly AI" on the other hand is really a very different problem, where we're trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I'm not sure. I'm hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what's causing it.

The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.

For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don't see any way to be confident of this ahead of time.) I'll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

The problem with Normative AI, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?

Black-Box Metaphilosophical AI is also risky, because it's hard to test/debug something that you don't understand. Besides that general concern, designs in this category (such as Paul Christiano's take on indirect normativity) seem to require that the AI achieve superhuman levels of optimizing power before being able to solve its philosophical problems, which seems to mean that a) there's no way to test them in a safe manner, and b) it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.

White-Box Metaphilosophical AI may be the most promising approach. There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don't think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.

To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can't be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren't that hard, relative to his abilities), but I'm not sure why some others say that we don't yet know what the problems will be.

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

When you say "fast takeoff" do you mean the speed of the takeoff (how long it takes from start to superintelligence) or the timing of it (how far away it is from now)?

I mean speed. It seems like you are relying on an assumption of a rapid transition from a world like ours to a world dominated by superhuman AI, whereas typically I imagine a transition that lasts at least years (which is still very fast!) during which we can experiment with things, develop new approaches, etc. In this regime many more approaches are on the table.

Superintelligent AIs controlled by human owners, even if it's possible, seem like a terrible idea, because humans aren't smart or wise enough to handle such power without hurting themselves. I wouldn't even trust myself to control such an AI, much less a more typical, less reflective human. It seems like you are packing a wide variety of assumptions in here, particularly about the nature of control and about the nature of the human owners.

or by quickly bootstrapping to a better prepared society Not sure what you mean by this. Can you expand?

Even given shaky solutions to the control problem, it's not obvious that you can't move quickly to a much better prepared society, via better solutions to the control problem, further AI work, brian emulations, significantly better coordination or human enhancement, etc.

Regarding your parenthetical "because of", I think the "need" to design such a singleton comes from the present opportunity to build such a singleton, which may not last. For example, suppose your scenario of superintelligent AIs controlled by human owners become reality (putting aside my previous objection). At that time we can no longer directly build a singleton, and those AI/human systems may not be able to, or want to, merge into a singleton. They may instead just spread out into the universe in an out of control manner, burning the cosmic commons as they go.

This is an interesting view (in that it isn't what I expected). I don't think that the AIs are doing any work in this scenario, i.e., if we just imagined normal humans going on their way without any prospect of building much smarter descendants, you would make similar predictions for similar reasons? If so, this seems unlikely given the great range of possible coordination mechanisms many of which look like they could avert this problem, the robust historical trends in increasing coordination ability and scale of organization, etc. Are there countervailing reasons to think it is likely, or even very plausible? If not, I'm curious about how the presence of AI changes the scenario.

There are all kinds of ways for this to go badly wrong, which have been extensively discussed by Eliezer and others on LW. To summarize, the basic problem is that human concepts are too fuzzy and semantically dependent on how human cognition works. Given complexity and fragility of value and likely alien nature of AI cognition, it's unlikely that AIs will share our concepts closely enough for it to obtain a sufficiently accurate model of "what I would want" through this method. (ETA: Here is a particularly relevant post by Eliezer.)

I don't find these arguments particularly compelling as a case for "there is very likely to be a problem," though they are more compelling as an indication of "there might be a problem."

Fragility and complexity of value doesn't seem very relevant. The argument is never that you can specify value directly. Instead we are saying that you can capture concepts about respecting intentions, offering further opportunities for reflection, etc. (or in the most extreme case, concepts about what we would want upon reflection). These concepts are also fragile, which is why there is something to discuss here.
There are many concepts that seem useful (and perhaps sufficient) which seem to be more robust and not obviously contingent on human cognition, such as deference, minimal influence, intentions, etc. In particular, we might expect that we can formulate concepts in such a way that they are unambiguous in our current environment, and then maintain them. Whether you can get access to those concepts, or use them in a useful enough way, is again not clear.
The arguments given there (and elsewhere) just don't consider most of the things you would actually do, even the ones we can currently foresee. This is a special case of the next point. For example, if an agent is relatively risk averse, and entertains uncertainty about what is "good," then it may tend to pick a central example from the concept of good instead of an extreme one (depending on details of the specification, but it is easy to come up with specifications that do this). So saying "you always get extreme examples of a concept when you use it as a value for a goal-seeking agent" is an interesting observation and a cause for concern, but it is so far from a tight argument that I don't even think of it as trying.
All of the arguments here are extremely vague (on both sides). Again, this is fine if we want to claim "there may be a problem." Indeed, I would even agree that any particular proposal is very unlikely to work, and any class of proposals is pretty unlikely to work, etc. (I would say the same thing about approaches to AI itself). But it seems like it doesn't entitle "there is definitely a problem," especially to the extent that we are relying on the conjunction of many claims of the form "This won't look robustly viable once we know more."

In general, it seems that the burden of proof is on someone who claims "Surely X" in an environment which is radically unlike any environment we have encountered before. I don't think that any very compelling arguments have been offered here, just vague gesturing. I think it's possible that we should focus on some of these pessimistic possibilities because we can have a larger impact there. But your (and Eliezer's) claims go further than this, suggesting that it isn't worth investing in interventions that would modestly improve our ability of coping with difficulties (respectively clarifying understanding of AI and human empowerment, both of which slightly speed up AI progress), because the probability is so low. I think this is a plausible view, but it doesn't look like the evidence supports it to me.

Since you seem to bring up ideas that others have already considered and rejected, I wonder if perhaps you're underestimating how much we've thought about this? (Or were you already aware of their rejection and just wanted to indicate your disagreement?)

I'm certainly aware of the points you've raised, and at least a reasonable fraction of the thinking that has been done in this community on these topics. Again, I'm happy with these arguments (and have made many of them myself) as a good indication that the issue is worth taking seriously. But I think you are taking this "rejection" much too seriously in this context. If someone said "maybe X will work" and someone else said "maybe X won't work," I won't then leave X off of (long) lists of reasons why things might work, even if I agreed with them.

This is getting a bit too long for a point-by-point response, so I'll pick what I think are the most productive points to make. Let me know if there's anything in particular you'd like a response on.

It seems like you are relying on an assumption of a rapid transition from a world like ours to a world dominated by superhuman AI.

I try not to assume this, but quite possibly I'm being unconsciously biased in that direction. If you see any place where I seem to be implicitly assuming this, please point it out, but I think my argument applies even if the tr... (read more)

37

Three Approaches to "Friendliness"

37

37

37

Three Approaches to "Friendliness"

37

37