Frequent arguments about alignment

[-]Richard_Ngo5yΩ11170

Skeptic: It seems to me that the distinction between "alignment" and "misalignment" has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: "AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)". Now people are using the word in sense 2: "AIs not quite doing what we want them to do". But when our current AIs aren't doing quite what we want them to do, is that mainly evidence that future, more general systems will be misaligned1 (which I agree is bad) or misaligned2?

Advocate: Concepts like agency are continuous spectra. GPT-3 is a little bit agentic, and we'll eventually build AGIs that are much more agentic. Insofar as GPT-3 is trying to do something, it's trying to do the wrong thing. So we should expect future systems to be trying to do the wrong thing in a much more worrying way (aka be misaligned1) for approximately the same reason: that we trained them on loss functions that incentivised the wrong thing.

Skeptic: I agree that this is possible. But what should our update be after observing large language models? You could look at the difficulties of making GPT-3 do exactly what we want, and see this as evidence that misalignment is a big deal. But actually, large language models seem like evidence against misalignment1 being a big deal (because they seem to be quite intelligent without being very agentic, but the original arguments for worrying about misalignment1 relied on the idea that intelligence and agency are tightly connected, making it very hard to build superintelligent systems which don't have large-scale goals).

Advocate: Even if that's true for the original arguments, it's not for more recent arguments.

Skeptic: These newer arguments rely on assumptions about economic competition and coordination failures which seem quite speculative to me, and which haven't been vetted very much.

Advocate: These assumptions seem like common sense to me - e.g. lots of people are already worried about the excesses of capitalism. But even if they're speculative, they're worth putting a lot of effort into understanding and preparing for.

In case it wasn't clear from inside the dialogue, I'm quite sympathetic to both sides of this conversation (indeed, it's roughly a transcript of a debate that I've had with myself a few times). I think more clarity on these topics would be very valuable.

[-]TAG4y20

It seems to me that the distinction between “alignment” and “misalignment” has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: “AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)”. Now people are using the word in sense 2: “AIs not quite doing what we want them to do”.

There's an identical problem with "friendliness". Sometimes unfriendliness means we all die, sometimes it means we don't get utopia.

[-]Rohin Shah5yΩ11170

On 2, I would probably make a stronger additional claim, that even the parts of alignment research that are "just capabilities" don't seem to happen by default, and the vast majority of work done in this space (at least with large models) seems to have been driven by people motivated to work on alignment. Yes, in some ideal world, those working towards AI-based products would have done things like RL from human feedback, but empirically they don't seem to be doing that.

(I also agree with the response you've given in the post.)

[-]Richard_Ngo5yΩ470

Whether this is a point for the advocate or the skeptic depends on whether advances in RL from human feedback unlock other alignment work more than they unlock other capabilities work. I think there's room for reasonable disagreement on this question, although I favour the former.

[-]John Schulman4yΩ440

Yeah that's also good point, though I don't want to read too much into it, since it might be a historical accident.

[-]Jonathan Uesato5yΩ8120

Thanks for writing this. I've been having a lot of similar conversations, and found your post clarifying in stating a lot of core arguments clearly.

Is there an even better critique that the Skeptic could make?

Focusing first on human preference learning as a subset of alignment research: I think most ML researchers "should" agree on the importance of simple human preference learning, both from a safety and capabilities perspective. If we take the narrower question "should we do human preference learning, or is pretraining + minimal prompt engineering enough?", I feel confident in the answer you give as Advocate: To the extent prompt engineering works, it's because it's preference learning in disguise, and leaning into preference learning (including supervised / RL finetuning) will work much better. Both the theoretical and empirical pictures to-date agree with this.

(My sense is that not all ML researchers immediately agree with this / maybe just haven't considered the question in this frame, but that most researchers are pretty receptive to it and will agree in discussion.)

So I think a more challenging Skeptic might say: "Perhaps simple human preference learning is enough, and we can focus all alignment research there. Why do we need the other research directions in the alignment portfolio like handling inaccessible information, deceptive mesa-optimizers, or interpretability?" Here, "simple" human preference learning is referring to something like supervised (your step 1 for Question 1) + RL finetuning (step 2) + ad hoc ways of making it easier for humans to supervise models (limited versions of step 3).

I again side with Advocate here, but I think making the case is more difficult (and also perhaps requires different arguments for different research directions). I don't have a response for this as short or convincing as what you have here. My typical response would expand on your points that more capable models will be more dangerous and that alignment might turn out to be very hard, so it's important to consider these potential difficulties in advance. The hardness claim would probably involve failure stories (along these lines) or more abstract hardness arguments (along these lines).

[-]John Schulman4yΩ220

Agree with what you've written here -- I think you put it very well.

[-]Beth Barnes5yΩ8100

I think this is a really useful post, thanks for making this! I maybe have a few things I'd add but broadly I agree with everything here.

[-]Beth Barnes5yΩ110

You might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' where you're discussing alignment research that can be done with current models

[-]John Schulman5yΩ110

yup, added a sentence about it

[-]Rohin Shah4yΩ460

Planned summary for the Alignment Newsletter:

This post outlines three AI alignment skeptic positions and corresponding responses from an advocate. Note that while the author tends to agree with the advocate’s view, they also believe that the skeptic makes good points.
1. The alignment problem gets easier as models get smarter, since they start to learn the difference between, say, human smiles and human well-being. So all we need to do is to prompt them appropriately, e.g. by setting up a conversation with “a wise and benevolent AI advisor”.
_Response:_ We can do a lot better than prompting: in fact, <@a recent paper@>(@True Few-Shot Learning with Language Models@) showed that prompting is effectively (poor) finetuning, so we might as well finetune. Separately from prompting itself, alignment does get easier in some ways as models get smarter, but it also gets harder: for example, smarter models will game their reward functions in more unexpected and clever ways.
2. What’s the difference between alignment and capabilities anyway? Something like <@RL from human feedback for summarization@>(@Learning to Summarize with Human Feedback@) could equally well have been motivated through a focus on AI products.
_Response:_ While there’s certainly overlap, alignment research is usually not the lowest-hanging fruit for building products. So it’s useful to have alignment-focused teams that can champion the work even when it doesn’t provide the best near-term ROI.
3. We can’t make useful progress on aligning superhuman models until we actually have superhuman models to study. Why not wait until those are available?
_Response:_ If we don’t start now, then in the short term, companies will deploy products that optimize simple objectives like revenue and engagement, which could be improved by alignment work. In the long term, it is plausible that alignment is very hard, such that we need many conceptual advances that we need to start on now to have them ready by the point that we feel obligated to use powerful AI systems. In addition, empirically there seem to be many alignment approaches that aren’t bottlenecked by the capabilities of models -- see for example <@this post@>(@The case for aligning narrowly superhuman models@).

Planned opinion:

I generally agree with these positions and responses, and in particular I’m especially happy about the arguments being specific to the actual models we use today, which grounds out the discussion a lot more and makes it easier to make progress. On the second point in particular, I’d also [say](https://www.alignmentforum.org/posts/6ccG9i5cTncebmhsH/frequent-arguments-about-alignment?commentId=cwgpCBfwHaYravLpo) that empirically, product-focused people don’t do e.g. RL for human feedback, even if it could be motivated that way.

[-]Michaël Trazzi5yΩ260

Thanks for the post, it's a great idea to have both arguments.

My personal preference would be to have both arguments to be the same length to properly compare the strength of the arguments (skeptic is one paragraph, advocate is 3-6x longer), and not always in the same order skeptic then advocate, but also advocate -> skeptic or even skeptic -> advocate --> skeptic -> ..., so it does not appear like one is the "haven't thought about it much" view.

[-]Daniel Kokotajlo5yΩ340

This post has two purposes. First, I want to cache good responses to these questions, so I don't have to think about them each time the topic comes up. Second, I think it's useful for people who work on safety and alignment to be ready for the kind of pushback they'll get when pitching their work to others.

Great idea, thanks for writing this!

[-]Charlie Steiner5yΩ130

For #2, not sure if this is a skeptic or an advocate point: why have a separate team at all? When designing a bridge you don't have one team of engineers making the bridge, and a separate team of engineers making sure the bridge doesn't fall down. Within openAI, isn't everyone committing to good things happening, and not just strictly picking the lowest-hanging fruit? If alignment-informed research is better long-term, why isn't the whole company the "safety team" out of simple desire to do their job?

We could make this more obviously skeptical by rephrasing it as a wisdom-of-the-crowds objection. You say we need people focused on alignment because it's not always the lowest-hanging fruit. But other people aren't dumb and want things to go well - are you saying they're making a mistake?

And then you have to either say yes, they're making a mistake because people are (e.g.) both internally and externally over-incentivized to do things that have flashy results now, or no, they're not making a mistake, in fact having a separate alignment group is a good idea even in a perfect world because of (e.g.) specialization of basic research, or some combination of the two.

[-]John Schulman4yΩ240

In my experience, you need separate teams doing safety research because specialization is useful -- it's easiest to make progress when both individuals and teams specialize a bit and develop taste and mastery of a narrow range of topics.

[-]adamShimi4yΩ120

Thanks for this post! I have to admit that I took some time to read it because I believed that it would be basic, but I really like the focus on more current techniques (which makes sense since you cofounded and work at OpenAI).

Let's start with the wise AI advisor. Even if our model has internal knowledge about the truth and human wellbeing, that doesn't mean that it'll act on that knowledge the way we want. Rather, the model has been trained to imitate the training corpus, and therefore it'll repeat the misconceptions and flaws of typical authors, even if it knows that they're mistaken about something.

That doesn't feel as bad as you describe it for me. Sure, if you literally call a "wise old man" from the literature (or god forbid, reddit), that might end up pretty badly. But we might go for a tighter control around the sort of "language producer" were trying to instantiate. Or go microscope AI.

All these do require more alignment focused work though. I'm particularly excited of perspectives of language models as simulators of many small models of things producting/influencing language, and of techniques related to that view, like meta-prompts or counterfactual parsing.

I also feel like this answer from the Advocate disparage a potentially very big deal for language models: the fact that they might pick up the human abstractions because they learn to model language, and our use of language is littered with these abstractions. This is a potentially strong version of the natural abstraction hypothesis, which seems like it makes the problem easier in some ways. For example, we have more chance of understanding what the model might do because it's trying to predict a system (language) that we use constantly at that level of granularity, as opposed to images that we never think of pixels by pixels.

Optimize the right objective, which is usually hard to measure and optimize, and is not the logprob of the human-provided answer. (We'll need to use reinforcement learning.)

I want to point out that from an alignment standpoint, this looks like a very dangerous step. One thing Language Models have for them is that what they optimize for isn't what we use them for exactly, and so they avoid potential issues like goodharting. This would be completely destroyed by adding an explicit optimization step at the end.

Returning to the original question, there was the claim that alignment gets easier as the models get smarter. It does get easier in some ways, but it also gets harder in others. Smarter models will be better at gaming our reward functions in unexpected and clever ways -- for example, producing the convincing illusion of being insightful or helpful, while actually being the opposite. And eventually they'll be capable of intentionally deceiving us.

I think this is definitely an important point that goes beyond the special case of language models that you mostly discuss before.

While alignment and capabilities aren't distinct, they correspond to different directions that we can push the frontier of AI. Alignment advances make it easier to optimize hard-to-measure objectives like being helpful or truthful. Capabilities advances also sometimes make our models more helpful and more accurate, but they also make the models more potentially dangerous.

On thing I would want to point out is that another crucial difference lies in the sort of conceptual research that is done in alignment. Deconfusion of ideas like power-seeking, enlightened judgment and goal-directedness are rarely that useful for capabilities, but I'm pretty convinced they are crucial for understanding better the alignment risks and how to deal with them.

[-]jeffreycaruso2y10

Are there other forums for AI Alignment or AI Safety and Security besides this one where your article could be published for feedback from perspectives that haven't been shaped by Rationalist thinking or EA?

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

103

Frequent arguments about alignment

103

Ω 49

103

Ω 49

1. Does alignment get solved automatically as our models get smarter?

2. Is alignment research distinct from capabilities research?

3. Is it worth doing alignment research now?