Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

Zack_M_Davis

Previously: AGI and Friendly AI in the dominant AI textbook (2011), Stuart Russell: AI value alignment problem must be an "intrinsic part" of the field's mainstream agenda (2014)

The 4th edition of Artificial Intelligence: A Modern Approach came out this year. While the 3rd edition published in 2009 mentions the Singularity and existential risk, it's notable how much the 4th edition gives the alignment problem front-and-center attention as part of the introductory material (speaking in the authorial voice, not just "I.J. Good (1965) says this, Yudkowsky (2008) says that, Omohundro (2008) says this" as part of a survey of what various scholars have said). Two excerpts—

1.1.5 Beneficial machines

The standard model has been a useful guide for AI research since its inception, but it is probably not the right model in the long run. The reason is that the standard model assumes that we will supply a fully specified objective to the machine.

For an artificially defined task such as chess or shortest-path computation, the task comes with an objective built in—so the standard model is applicable. As we move into the real world, however, it becomes more and more difficult to specify the objective completely and correctly. For example, in designing a self-driving car, one might think that the objective is to reach the destination safely. But driving along any road incurs a risk of injury due to other errant drivers, equipment failure, and so on; thus, a strict goal of safety requires staying in the garage. There is a tradeoff between making progress towards the destination and incurring a risk of injury. How should this tradeoff be made? Furthermore, to what extent can we allow the car to take actions that would annoy other drivers? How much should the car moderate its acceleration, steering, and braking to avoid shaking up the passenger? These kinds of questions are difficult to answer a priori. They are particularly problematic in the general area of human–robot interaction, of which the self-driving car is one example.

The problem of achieving agreement between our true preferences and the objective we put into the machine is called the value alignment problem: the values or objectives put into the machine must be aligned with those of the human. If we are developing an AI system in the lab or in a simulator—as has been the case for most of the field's history—there is an easy fix for an incorrectly specified objective: reset the system, fix the objective, and try again. As the field progresses towards increasingly capable intelligent systems that are deployed in the real world, this approach is no longer viable. A system deployed with an incorrect objective will have negative consequences. Moreover, the more intelligent the system, the more negative the consequences.

Returing to the apparently unproblematic example of chess consider what happens if the machine is intelligent enough to reason and act beyond the confines of the chessboard. In that case, it might attempt to increase its chances of winning by such ruses as hypnotizing or blackmailing its opponent or bribing the audience to make rustling noises during its opponents thinking time.³ It might also attempt to hijack additional computing power for itself. These behaviors are not "unintelligent" or "insane"; they are a logical consequence of defining winning as the sole objective for the machine.

It is impossible to anticipate all the ways in which a machine persuing a fixed objective might misbehave. There is good reason, then, to think that the standard model is inadequate. We don't want machines that are intelligent in the sense of pursuing their objectives; we want them to pursue our objectives. If we cannot transfer those objectives perfectly to the machine, tghen we need a new formulation—one in which the machine is pursuing our objectives, but is necessarily uncertain as to what they are. When a machine knows that it doesn't know the complete objective, it has an incentive to act cautiously, to ask permission, to learn more about our preferences through observation, and to defer to human control. Ultimately, we want agents that are provably beneficial to humans. We will return to this topic in Section 1.5.

And in Section 1.5, "Risks and Benefits of AI"—

At around the same time, concerns were raised that creating artificial superintelligence or ASI—intelligence that far surpasses human ability—might be a bad idea (Yudkowsky, 2008; Omohundro 2008). Turing (1996) himself made the same point in a lecture given in Manchester in 1951, drawing on earlier ideas from Samuel Butler (1863):¹⁵

It seems probably that once the machine thinking method had started, it would not take long to outstrip our feeble powers. ... At some stage therefore we should have to expect the machines to take control, in the way that is mentioned in Samuel Butler's Erewhon.

These concerns have only become more widespread with recent advances in deep learning, the publication of books such as Superintelligence by Nick Bostrom (2014), and public pronouncements from Stephen Hawking, Bill Gates, Martin Rees, and Elon Musk.

Experiencing a general sense of unease with the idea of creating superintelligent machines is only natural. We might call this the gorilla problem: about seven million year ago, a now-extinct primate evolved, with one branch leading to gorillas and one to humans. Today, the gorillas are not too happy about the human branch; they have essentially no control over their future. If this is the result of success in creating superhuman AI—that humans cede control over their future—then perhaps we should stop work on AI and, as a corollary, give up the benefits it might bring. This is the essence of Turing's warning: it is not obvious that we can control machines that are more intelligent than us.

If superhuman AI were a black box that arrived from outer space, then indeed it would be wise to exercise caution in opening the box. But it is not: we design the AI systems, so if they do end up "taking control," as Turing suggests, it would be the result of a design failure.

To avoid such an outcome, we need to understand the source of potential failure. Norbert Weiner (1960), who was motivated to consider the long-term future of AI after seeing Arthur Samuel's checker-playing program learn to beat its creator, had this to say:

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively ... we had better be quite sure that the purpose put into the machine is the purpose which we really desire.

Many cultures have myths of humans who ask gods, genies, magicians, or devils for something. Invariably, in these stories, they get what they literally ask for, and then regret it. The third wish, if there is one, is to undo the first two. We will call this the King Midas problem: Midas, a legendary King in Greek mythology, asked that everything he touched should turn to gold, but then regretted it after touching his food, drink, and family members.¹⁶

We touched on this issue in Section 1.1.5, where we pointed out the need for a significant modification to the standard model of putting fixed objectives into the machine. The solution to Weiner's predicament is not to have a definite "purpose put into the machine" at all. Instead, we want machines that strive to achieve human objectives but know that they don't know for certain exactly what those objectives are.

It is perhaps unfortunate that almost all AI research to date has been carried out within the standard model, which means that almost all of the technical material in this edition reflects that intellectual framework. There are, however, some early results within the new framework. In Chapter 16, we show that a machine has a positive incentive to allow itself to be switched off if and only if it is uncertain about the human objective. In Chapter 18, we formulate and study assistance games, which describe mathematically the situation in which a human has an objective and a machine tries to achieve it, but is initially uncertain about what it is. In Chapter 22, we explain the methods of inverse reinforcement learning that allow machines to learn more about human preferences from observations of the choices that humans make. In Chapter 27, we explore two of the principal difficulties: first, that our choices depend on our preferences through a very complex cognitive architecture that is hard to invert; and, second, that we humans may not have consistent preferences in the first place—either individually or as a group—so it may not be clear what AI systems should be doing for us.

Can I also point to this as (some amount of) evidence against concerns that "we" (members of this stupid robot cult that I continue to feel contempt for but don't know how to quit) shouldn't try to have systematically truthseeking discussions about potentially sensitive or low-status subjects because guilt-by-association splash damage from those conversations will hurt AI alignment efforts, which are the most important thing in the world? (Previously: 1 2 3.)

Like, I agree that some nonzero amount of splash damage exists. But look! The most popular AI textbook, used in almost fifteen hundred colleges and universities, clearly explains the paperclip-maximizer problem, in the authorial voice, in the first chapter. "These behaviors are not 'unintelligent' or 'insane'; they are a logical consequence of defining winning as the sole objective for the machine." Italics in original! I couldn't transcribe it, but there's even one of those pay-attention-to-this triangles (◀) in the margin, in teal ink.

Everyone who gets a CS degree from this year onwards is going to know from the teal ink that there's a problem. If there was a marketing war to legitimize AI risk, we won! Now can "we" please stop using the marketing war as an excuse for lying?!

some predictable counterpoints: maybe we won because we were cautious; we could have won harder; many relevant thinkers still pooh-pooh the problem; it's not just the basic problem statement that's important, but potentially many other ideas that aren't yet popular; picking battles isn't lying; arguing about sensitive subjects is fun and I don't think people are very tempted to find excuses to avoid it; there are other things that are potentially the most important in the world that could suffer from bad optics; I'm not against systematically truthseeking discussions of sensitive subjects, just if it's in public in a way that's associated with the rationalism brand

(This extended runaround on appeals to consequences is at least a neat microcosm of the reasons we expect unaligned AIs to be deceptive by default! Having the intent to inform other agents of what you know without trying to take responsibility for controlling their decisions is an unusually anti-natural shape for cognition; for generic consequentialists, influence-seeking behavior is the default.)

In Chapter 16, we show that a machine has a positive incentive to allow itself to be switched off if and only if it is uncertain about the human objective.

Surely he only meant if it is uncertain?

This sentence really makes no sense to me. The proof that it can have an incentive to allow itself to be switched off even if it isn't uncertain is trivial.

Just create a utility function that assigns intrinsic reward to shutting itself off, or create a payoff matrix that punishes it really hard if it doesn't turn itself off. In this context using this kind of technical language feels actively deceitful to me, since it's really obvious that the argument he is making in that chapter cannot actually be a proof.

In general, I... really don't understand Stuart Russell's thoughts on AI Alignment. The whole "uncertainty over utility functions" thing just doesn't really help at all with solving any part of the AI Alignment problem that I care about, and I do find myself really frustrated with the degree to which both this preface and Human Compatible repeatedly indicate that it somehow is a solution to the AI Alignment problem (not only like, a helpful contribution, but both this and Human Compatible repeatedly say things that to me read like "if you make the AI uncertain about the objective in the right way, then the AI Alignment problem is solved", which just seems obviously wrong to me, since it doesn't even deal with inner alignment problems, and it also doesn't solve really any major outer alignment problems, but that requires a bit more writing).

My read of Russel's position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.

I do agree that this doesn't seem to help with inner-alignment stuff though, but I'm still trying to wrap my head around this area.

If it's certain about the human objective, then it would be certain that it knows what's best, so there would be no reason to let a human turn it off. (Unless humans have a basic preference to turn it off, in which case it could prefer to be shut off.)

We know of many ways to get shut-off incentives, including the indicator utility function on being shut down by humans (which theoretically exists), and the AUP penalty term, which strongly incentivizes accepting shutdown in certain situations - without even modeling the human. So, it's not an if-and-only-if.

Sure, but the theorem he proves in the setting where he proves it probably is if and only if. (I have not read the new edition, so, not really sure.)

It also seems to me like Stuart Russell endorses the if-and-only-if result as what's desirable? I've heard him say things like "you want the AI to prevent its own shutdown when it's sufficiently sure that it's for the best".

Of course that's not technically the full if-and-only-if (it needs to both be certain about utility and think preventing shutdown is for the best), but it suggests to me that he doesn't think we should add more shutoff incentives such as AUP.

Keep in mind that I have fairly little interaction with him, and this is based off of only a few off-the-cuff comments during CHAI meetings.

My point here is just that it seems pretty plausible that he meant "if and only if".

My point here is just that it seems pretty plausible that he meant "if and only if".

Sure. To clarify: I'm more saying "I think this statement is wrong, and I'm surprised he said this". In fairness, I haven't read the mentioned section yet either, but it is a very strong claim. Maybe it's better phrased as "a CIRL agent has a positive incentive to allow shutdown iff it's uncertain [or the human has a positive term for it being shut off]", instead of "a machine" has a positive incentive iff.

It is an "iff" in §16.7.2 "Deference to Humans", but the toy setting in which this is shown is pretty impoverished. It's a story problem about a robot Robbie deciding whether to book an expensive hotel room for busy human Harriet, or whether to ask Harriet first.

Formally, let be Robbie's prior probability density over Harriet's utility for the proposed action a. Then the value of going ahead with a is

$E U (a) = \int_{- \infty}^{\infty} P (u) \cdot u d u = \int_{- \infty}^{0} P (u) \cdot u d u + \int_{0}^{\infty} P (u) \cdot u d u$

(We will see shortly why the integral is split up this way.) On the other hand, the value of action d, deferring to Harriet, is composed of two parts: if u > 0 then Harriet lets Robbie go ahead, so the value is us, but if u < 0 then Harriet switches Robbie off, so the value is 0:

$E U (d) = \int_{- \infty}^{0} P (u) \cdot 0 d u + \int_{0}^{\infty} P (u) \cdot u d u$

Comparing the expressions for EU(a) and EU(d), we see immediately that

$E U (d) \geq E U (a)$

because the expression for EU(d) has the negative-utility region zeroed out. The two choices have equal value only when the negative region has zero probability—that is, when Robbie is already certain that Harriet likes the proposed action.

(I think this is fine as a topic-introducing story problem, but agree that the sentence in Chapter 1 referencing it shouldn't have been phrased to make it sound like it applies to machines-in-general.)

Maybe it's better phrased as "a CIRL agent has a positive incentive to allow shutdown iff it's uncertain [or the human has a positive term for it being shut off]", instead of "a machine" has a positive incentive iff.

I would further charitably rewrite it as:

"In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective."

A CIRL agent should be capable of believing that humans terminally value pressing buttons, in which case it might allow itself to be shut off despite being 100% sure about values. So it's just the particular incentive examined that's iff.

In Chapter 16, we show that a machine has a positive incentive to allow itself to be switched off if and only if it is uncertain about the human objective.

Surely he only meant if it is uncertain?

This sentence really makes no sense to me. The proof that it can have an incentive to allow itself to be switched off even if it isn't uncertain is trivial.

I do agree that this doesn't seem to help with inner-alignment stuff though, but I'm still trying to wrap my head around this area.

Sure, but the theorem he proves in the setting where he proves it probably is if and only if. (I have not read the new edition, so, not really sure.)

Keep in mind that I have fairly little interaction with him, and this is based off of only a few off-the-cuff comments during CHAI meetings.

My point here is just that it seems pretty plausible that he meant "if and only if".

My point here is just that it seems pretty plausible that he meant "if and only if".

Formally, let be Robbie's prior probability density over Harriet's utility for the proposed action a. Then the value of going ahead with a is

$E U (a) = \int_{- \infty}^{\infty} P (u) \cdot u d u = \int_{- \infty}^{0} P (u) \cdot u d u + \int_{0}^{\infty} P (u) \cdot u d u$

(We will see shortly why the integral is split up this way.) On the other hand, the value of action d, deferring to Harriet, is composed of two parts: if u > 0 then Harriet lets Robbie go ahead, so the value is us, but if u < 0 then Harriet switches Robbie off, so the value is 0:

$E U (d) = \int_{- \infty}^{0} P (u) \cdot 0 d u + \int_{0}^{\infty} P (u) \cdot u d u$

Comparing the expressions for EU(a) and EU(d), we see immediately that

$E U (d) \geq E U (a)$

because the expression for EU(d) has the negative-utility region zeroed out. The two choices have equal value only when the negative region has zero probability—that is, when Robbie is already certain that Harriet likes the proposed action.

Maybe it's better phrased as "a CIRL agent has a positive incentive to allow shutdown iff it's uncertain [or the human has a positive term for it being shut off]", instead of "a machine" has a positive incentive iff.

I would further charitably rewrite it as:

"In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective."

LESSWRONG
LW

LESSWRONG
LW

72

Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

72

1.1.5 Beneficial machines

72

72