Three Approaches to "Friendliness"

Wei Dai

37 Three Approaches to "Friendliness"

17th Jul 2013

4 min read

37

I put "Friendliness" in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use "Friendly AI" to denote such an AI since that's the established convention.

I've often stated my objections MIRI's plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it's not because, as some have suggested while criticizing MIRI's FAI work, that we can't foresee what problems need to be solved. I think it's because we can largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for "trial and error", or both.

When people say they don't know what problems need to be solved, they may be mostly talking about "AI safety" rather than "Friendly AI". If you think in terms of "AI safety" (i.e., making sure some particular AI doesn't cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. "Friendly AI" on the other hand is really a very different problem, where we're trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I'm not sure. I'm hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what's causing it.

The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.

For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don't see any way to be confident of this ahead of time.) I'll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

The problem with Normative AI, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?

Black-Box Metaphilosophical AI is also risky, because it's hard to test/debug something that you don't understand. Besides that general concern, designs in this category (such as Paul Christiano's take on indirect normativity) seem to require that the AI achieve superhuman levels of optimizing power before being able to solve its philosophical problems, which seems to mean that a) there's no way to test them in a safe manner, and b) it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.

White-Box Metaphilosophical AI may be the most promising approach. There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don't think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.

To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can't be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren't that hard, relative to his abilities), but I'm not sure why some others say that we don't yet know what the problems will be.

Meta-PhilosophyAI

Personal Blog

37

New Comment

Rendering 0/86 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 3:16 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

37 Three Approaches to "Friendliness"

by Wei Dai

17th Jul 2013

4 min read

37

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

Meta-PhilosophyAI

Personal Blog

37

Mentioned in

96Executable philosophy as a failed totalizing meta-worldview

83Some Thoughts on Metaphilosophy

74Trying to understand my own cognitive edge

67The Argument from Philosophical Difficulty

New Comment

Rendering 0/86 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 3:16 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from Wei Dai

Curated and popular this week

86Comments

Comment Permalink

Wei Dai11y30

If I understand correctly, in order for your designs to work, you must first have a question-answerer or predictor that is much more powerful than a human (i.e., can answer much harder questions that a human can). For example, you are assuming that the AI would be able to build a very accurate model of an arbitrary human overseer from sense data and historical responses and predict their "considered judgements", which is a superhuman ability. My concern is that when you turn on such an AI in order to test it, it might either do nothing useful (i.e., output very low quality answers that give no insights to how safe it would eventually be) because it's not powerful enough to model the overseer, or FOOM out of control due to a bug in the design or implementation and the amount of computing power it has. (Also, how are you going to stop others from making use of such powerful answers/predictors in a less safe, but more straightforward and "efficient" way?)

With a white-box metaphilosophical AI, if such a thing was possible, you could slowly increase its power and hopefully observe a corresponding increase in the quality of its philosophical output, while fixing any bugs that are detected and knowing that the overall computing power it has is not enough for it to vastly outsmart humans and FOOM out of control. It doesn't seem to require access to superhuman amounts of computing power just to start to test its safety.

Jiro11y10

Google Maps answers better than a human "how do I get from point A to point B". I don't think it does nothing useful just because it's not powerful enough to model the overseer.

3paulfchristiano11y

I don’t think that the question-answerer or reinforcement learner needs to be superhuman. I describe them as using human-level abilities rather than superhuman abilities, and it seems like they could also work with subhuman abilities. Concretely, if we imagine applying those designs with a human-level intelligence acting in the interests of a superhuman overseer, they seem (to me) to work fine. I would be interested in problems you see with this use case. Your objection to the question-answering system seemed to be that the AI may not recognize that human utterances are good evidence about what the overseer would ultimately do (even if they were), and that it might not be possible or easy to teach this. If I’m remembering right and this is still the problem you have in mind, I’m happy to disagree about it in more detail. But it seems that this objection couldn’t really apply to the reinforcement learning approach. It seems like these systems could be within a small factor of optimal efficiency (certainly within a factor of 2, say, but hopefully much closer). I would consider a large efficiency loss to be failure.

See in context