This is interesting and I'm glad Anthropic did it. I quoted interesting-to-me parts and added some unjustified commentary.


Tom Brown at 20:00 

the US treats the Constitution as like the holy document—which I think is just a big thing that strengthens the US, like we don't expect the US to go off the rails in part because just like every single person in the US is like The Constitution is a big deal, and if you tread on that, like, I'm mad. I think that the RSP, like, it holds that thing. It's like the holy document for Anthropic. So it's worth doing a lot of iterations getting it right.

Daniela Amodei at 20:26

Some of what I think has been so cool to watch about the RSP development at Anthropic is it feels like it has gone through so many different phases and there's so many different skills that are needed to make it work, right? There's the big ideas; Dario and Paul and Sam and Jared and so many others are like, what are the principles? Like what are we trying to say? How do we know if we're right? But there's also this very operational approach to just iterating where we're like, well, we thought that we were gonna see this at this safety level, and we didn't, so should we change it so that we're making sure that we're holding ourselves accountable? And then there's all kinds of organizational things, right? We just were like, let's change the structure of the RSP organization for clearer accountability. And I think my sense is that for a document that's as important as this, right, I love the Constitution analogy, it's like there's all of these bodies and systems that exist in the US to make sure that we follow the Constitution, right? There's the courts, there's the Supreme Court, there's the presidency, there's, you know, both houses of Congress and they do all kinds of other things, of course, but there's like all of this infrastructure that you need around this one document, and I feel like we're also learning that lesson here.

Does the new RSP promote "clearer accountability"? I guess a little; per the new RSP:

Internal and external accountability: We have made a number of changes to our previous “procedural commitments.” These include expanding the duties of the Responsible Scaling Officer; adding internal critique and external expert input on capability and safeguard assessments; new procedures related to internal governance; and maintaining a public page for overviews of past Capability and Safeguard Reports, RSP-related updates, and future plans.

But mostly the new RSP is just "more flexible and nuanced," I think.

Also, minor:

well, we thought that we were gonna see this at this safety level, and we didn't, so should we change it so that we're making sure that we're holding ourselves accountable?

I don't really understand (like, I can't imagine an example that would be well-described by this) but I'm slightly annoyed because it suggests a vibe of we have made the RSP stronger at least once.

Sam McCandlish at 21:30

[The RSP] sort of reflects a view a lot of us have about safety, which is that it's a solvable problem. It's just a very, very hard problem that's gonna take tons and tons of work. All of these institutions that we need to build up, like there's all sorts of institutions built up around like automotive safety, built up over many, many years. But we're like, Do we have the time to do that? We've gotta go as fast as we can to figure out what the institutions we need for AI safety are, and build those and try to build them here first, but make it exportable.

Dario Amodei at 22:00 

[The RSP] forces unity because if any part of the org is not kind of in line with our safety values, it shows up through the RSP, right? The RSP is gonna block them from doing what they want to do, and so it's a way to remind everyone over and over again, basically, to make safety a product requirement, part of the product planning process. And so it's not just a bunch of bromides that we repeat; it's something that you actually, if you show up here and you're not aligned, you actually run into it. And you either have to learn to get with the program or it doesn't work out.

I would agree if the RSP was stronger.

Daniela Amodei at 23:25

In addition to driving alignment, [the RSP] also drives clarity. Because it's written down what we're trying to do and it's legible to everyone in the company, and it's legible externally what we think we're supposed to be aiming towards from a safety perspective. It's not perfect. We're iterating on it, we're making it better, but I think there's some value in saying like, this is what we're worried about, this thing over here. Like you can't just use this word to sort of derail something in either direction, right? To say, oh, because of safety, we can't do X, or because of safety, we have to do X. We're really trying to make it clearer what we mean.

Dario adds:

Yeah, it prevents you from worrying about every last little thing under the sun. Because it's actually the fire drills that damage the cause of safety in the long run. I've said, like, If there's a building and the fire alarm goes off every week, that's a really unsafe building. Because when there's actually a fire, you're just gonna be like oh, it just goes off all the time. So it's very important to be calibrated.

Chris Olah at 24:20

I think that RSP creates healthy incentives at a lot of levels. So I think internally it aligns the incentives of every team with safety because it means that if we don't make progress on safety, we're gonna block. I also think that externally it creates a lot of healthier incentives than other possibilities, at least that I see, because it means that if we at some point have to take some kind of dramatic action, like if at some point we have to say we've reached some point and we can't yet make a model safe, it aligns that with sort of the point where there's evidence that supports that decision and there's sort of a preexisting framework for thinking about it, and it's legible. And so I think there's a lot of levels at which the RSP, I think in ways that maybe I didn't initially understand when we were talking about the early versions of it, it creates a better framework than any of the other ones that I've thought about.

I would agree if the RSP was stronger.

Daniela Amodei at 29:04

the way that I think about the RSP the most is what it sounds like, just like the tone. I think we just did a big rewrite of the tone of the RSP because it felt overly technocratic and even a little bit adversarial. I spend a lot of time thinking about how do you build a system that people want to be a part of. It's so much better if [] everyone in the company can walk around and tell you [] what are the top goals of the RSP, how do we know if we're meeting them, what AI safety level are we at right now—are we at ASL-2, are we at ASL-3—that people know what to look for because that is how you're going to have good common knowledge of if something's going wrong. If it's overly technocratic and it's something that only particular people in the company feel is accessible to them, [that's bad], and it's been really cool to watch it transition into this document where I think most if not everyone at the company, regardless of their role, could read it and say this feels really reasonable; I want to make sure that we're building AI in the following ways and I see why I would be worried about these things and I also kind of know what to look for if I bump into something. It's almost like, make it simple enough that if you're working at a manufacturing plant and you're like huh it looks like the safety seatbelt on this should connect this way but it doesn't connect — that you can spot it and that there's healthy feedback flow between leadership and the board and the rest of the company and the people that are actually building it because I actually think the way this stuff goes wrong in most cases is just like the wires get crossed. And that would be a really sad way for things to go wrong. It's all about operationalizing it, making it easy for people to understand.

I feel bad about this.

Jared Kaplan at 25:11

pushes back a little: all of the above were reasons to be excited about the RSP ex ante but it's been surprisingly hard and complicated to determine evals and thresholds; in AI there's a big range where you don't know whether a model is safe. (Kudos.)

Dario Amodei at 41:38

"Race to the top" works in practice:

a few months after we came out with our RSP, the 3 most prominent AI companies had one; interpretability research, that's another area we've done it; just the focus on safety overall, like collaboration with the AI Safety Institutes, other areas.

Jack Clark adds: "the Frontier Red Team got cloned almost immediately" by other labs.

My take: Anthropic gets a little credit for RSP adoption; focus on interp isn't clearly good; Anthropic doesn't get credit for collaboration with AISIs (did it do better than GDM and OpenAI?); on red-teaming I'm not familiar with Anthropic's timeline and interested in takes but, like, I think GDM wrote Model evaluation for extreme risks before any Anthropic-racing-to-the-top on evals

Daniela Amodei at 42:08

Says customers say they prefer Claude because it's safer (in terms of hallucinations and jailbreaks).

Is it true that Claude is safer? Would be news to me.

Dario Amodei at 48:30

He thinks about places where there's (apparent) consensus, what everyone wise thinks, and then it breaks. He thinks that's about to happen in interp, among other places.

I think interpretability is both the key to steering and making safe AI systems . . . , and interpretability contains insights about intelligent optimization problems and about how the human brain works.

I'd bet against + I wish Anthropic's alignment bets were less interp-y. (But largely because of vibes about what everyone wise thinks.)


(I claim Anthropic's RSP is not very ambitious and is quite insufficient to prevent catastrophe from Anthropic models, especially because ASL-4 hasn't been defined yet but also I worry that the ASL-3 standard will not be sufficient for upper-ASL-3 models.)

New Comment
10 comments, sorted by Click to highlight new comments since:

Jack Clark adds: "the Frontier Red Team got cloned almost immediately" by other labs.

Wait what? I didn't hear about this. What other companies have frontier red teams? Where can I learn about them?

I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).

Thanks for posting this. Editing feedback: I think the post would look quite a bit better if you used headings and LW quotes. This would generate a timestamped and linkable table of contents, and also more clearly distinguish the quotes from your commentary. Example:

Tom Brown at 20:00:

the US treats the Constitution as like the holy document—which I think is just a big thing that strengthens the US, like we don't expect the US to go off the rails in part because just like every single person in the US is like The Constitution is a big deal, and if you tread on that, like, I'm mad. I think that the RSP, like, it holds that thing. It's like the holy document for Anthropic. So it's worth doing a lot of iterations getting it right.

<your commentary>

Tiny editing issue: "[] everyone in the company can walk around and tell you []" -> The parentheses are empty. Maybe these should be for italicized formatting?

I use empty brackets similar to ellipses in this context; they denote removed unimportant text.

(Can you edit out all the "like"s, or give permission for an admin to do edit it out? I think in written text it makes speakers sound, for lack of a better word, unflatteringly moronic) 

I already edited out most of the "like"s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn't exact. You are free to post your own version but not to edit mine.

Edit: actually I did another pass and edited out several more; thanks for the nudge.

Okay, well, I'm not going to post "Anthropic leadership conversation [fewer likes]" 😂

I did something similar when I made this transcript: leaving in verbal hedging particularly in the context of contentious statements etc., where omitting such verbal ticks can give a quite misleading impression.