Can I also point to this as (some amount of) evidence against concerns that "we" (members of this stupid robot cult that I continue to feel contempt for but don't know how to quit) shouldn't try to have systematically truthseeking discussions about potentially sensitive or low-status subjects because guilt-by-association splash damage from those conversations will hurt AI alignment efforts, which are the most important thing in the world? (Previously: 1 2 3.)
Like, I agree that some nonzero amount of splash damage exists. But look! The most popular AI textbook, used in almost fifteen hundred colleges and universities, clearly explains the paperclip-maximizer problem, in the authorial voice, in the first chapter. "These behaviors are not 'unintelligent' or 'insane'; they are a logical consequence of defining winning as the sole objective for the machine." Italics in original! I couldn't transcribe it, but there's even one of those pay-attention-to-this triangles (◀) in the margin, in teal ink.
Everyone who gets a CS degree from this year onwards is going to know from the teal ink that there's a problem. If there was a marketing war to legitimize AI risk, we won! Now can "we" please stop using the marketing war as an excuse for lying?!
some predictable counterpoints: maybe we won because we were cautious; we could have won harder; many relevant thinkers still pooh-pooh the problem; it's not just the basic problem statement that's important, but potentially many other ideas that aren't yet popular; picking battles isn't lying; arguing about sensitive subjects is fun and I don't think people are very tempted to find excuses to avoid it; there are other things that are potentially the most important in the world that could suffer from bad optics; I'm not against systematically truthseeking discussions of sensitive subjects, just if it's in public in a way that's associated with the rationalism brand
(This extended runaround on appeals to consequences is at least a neat microcosm of the reasons we expect unaligned AIs to be deceptive by default! Having the intent to inform other agents of what you know without trying to take responsibility for controlling their decisions is an unusually anti-natural shape for cognition; for generic consequentialists, influence-seeking behavior is the default.)
In Chapter 16, we show that a machine has a positive incentive to allow itself to be switched off if and only if it is uncertain about the human objective.
Surely he only meant if it is uncertain?
This sentence really makes no sense to me. The proof that it can have an incentive to allow itself to be switched off even if it isn't uncertain is trivial.
Just create a utility function that assigns intrinsic reward to shutting itself off, or create a payoff matrix that punishes it really hard if it doesn't turn itself off. In this context using this kind of technical language feels actively deceitful to me, since it's really obvious that the argument he is making in that chapter cannot actually be a proof.
In general, I... really don't understand Stuart Russell's thoughts on AI Alignment. The whole "uncertainty over utility functions" thing just doesn't really help at all with solving any part of the AI Alignment problem that I care about, and I do find myself really frustrated with the degree to which both this preface and Human Compatible repeatedly indicate that it somehow is a solution to the AI Alignment problem (not only like, a helpful contribution, but both this and Human Compatible repeatedly say things that to me read like "if you make the AI uncertain about the objective in the right way, then the AI Alignment problem is solved", which just seems obviously wrong to me, since it doesn't even deal with inner alignment problems, and it also doesn't solve really any major outer alignment problems, but that requires a bit more writing).
My read of Russel's position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.
I do agree that this doesn't seem to help with inner-alignment stuff though, but I'm still trying to wrap my head around this area.
If it's certain about the human objective, then it would be certain that it knows what's best, so there would be no reason to let a human turn it off. (Unless humans have a basic preference to turn it off, in which case it could prefer to be shut off.)
We know of many ways to get shut-off incentives, including the indicator utility function on being shut down by humans (which theoretically exists), and the AUP penalty term, which strongly incentivizes accepting shutdown in certain situations - without even modeling the human. So, it's not an if-and-only-if.
Sure, but the theorem he proves in the setting where he proves it probably is if and only if. (I have not read the new edition, so, not really sure.)
It also seems to me like Stuart Russell endorses the if-and-only-if result as what's desirable? I've heard him say things like "you want the AI to prevent its own shutdown when it's sufficiently sure that it's for the best".
Of course that's not technically the full if-and-only-if (it needs to both be certain about utility and think preventing shutdown is for the best), but it suggests to me that he doesn't think we should add more shutoff incentives such as AUP.
Keep in mind that I have fairly little interaction with him, and this is based off of only a few off-the-cuff comments during CHAI meetings.
My point here is just that it seems pretty plausible that he meant "if and only if".
My point here is just that it seems pretty plausible that he meant "if and only if".
Sure. To clarify: I'm more saying "I think this statement is wrong, and I'm surprised he said this". In fairness, I haven't read the mentioned section yet either, but it is a very strong claim. Maybe it's better phrased as "a CIRL agent has a positive incentive to allow shutdown iff it's uncertain [or the human has a positive term for it being shut off]", instead of "a machine" has a positive incentive iff.
It is an "iff" in §16.7.2 "Deference to Humans", but the toy setting in which this is shown is pretty impoverished. It's a story problem about a robot Robbie deciding whether to book an expensive hotel room for busy human Harriet, or whether to ask Harriet first.
Formally, let be Robbie's prior probability density over Harriet's utility for the proposed action a. Then the value of going ahead with a is
(We will see shortly why the integral is split up this way.) On the other hand, the value of action d, deferring to Harriet, is composed of two parts: if u > 0 then Harriet lets Robbie go ahead, so the value is us, but if u < 0 then Harriet switches Robbie off, so the value is 0:
Comparing the expressions for EU(a) and EU(d), we see immediately that
because the expression for EU(d) has the negative-utility region zeroed out. The two choices have equal value only when the negative region has zero probability—that is, when Robbie is already certain that Harriet likes the proposed action.
(I think this is fine as a topic-introducing story problem, but agree that the sentence in Chapter 1 referencing it shouldn't have been phrased to make it sound like it applies to machines-in-general.)
Maybe it's better phrased as "a CIRL agent has a positive incentive to allow shutdown iff it's uncertain [or the human has a positive term for it being shut off]", instead of "a machine" has a positive incentive iff.
I would further charitably rewrite it as:
"In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective."
A CIRL agent should be capable of believing that humans terminally value pressing buttons, in which case it might allow itself to be shut off despite being 100% sure about values. So it's just the particular incentive examined that's iff.
Previously: AGI and Friendly AI in the dominant AI textbook (2011), Stuart Russell: AI value alignment problem must be an "intrinsic part" of the field's mainstream agenda (2014)
The 4th edition of Artificial Intelligence: A Modern Approach came out this year. While the 3rd edition published in 2009 mentions the Singularity and existential risk, it's notable how much the 4th edition gives the alignment problem front-and-center attention as part of the introductory material (speaking in the authorial voice, not just "I.J. Good (1965) says this, Yudkowsky (2008) says that, Omohundro (2008) says this" as part of a survey of what various scholars have said). Two excerpts—
And in Section 1.5, "Risks and Benefits of AI"—