Jack Clark adds: "the Frontier Red Team got cloned almost immediately" by other labs.
Wait what? I didn't hear about this. What other companies have frontier red teams? Where can I learn about them?
I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).
Thanks for posting this. Editing feedback: I think the post would look quite a bit better if you used headings and LW quotes. This would generate a timestamped and linkable table of contents, and also more clearly distinguish the quotes from your commentary. Example:
the US treats the Constitution as like the holy document—which I think is just a big thing that strengthens the US, like we don't expect the US to go off the rails in part because just like every single person in the US is like The Constitution is a big deal, and if you tread on that, like, I'm mad. I think that the RSP, like, it holds that thing. It's like the holy document for Anthropic. So it's worth doing a lot of iterations getting it right.
<your commentary>
(Can you edit out all the "like"s, or give permission for an admin to do edit it out? I think in written text it makes speakers sound, for lack of a better word, unflatteringly moronic)
I already edited out most of the "like"s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn't exact. You are free to post your own version but not to edit mine.
Edit: actually I did another pass and edited out several more; thanks for the nudge.
I did something similar when I made this transcript: leaving in verbal hedging particularly in the context of contentious statements etc., where omitting such verbal ticks can give a quite misleading impression.
This is interesting and I'm glad Anthropic did it. I quoted interesting-to-me parts and added some unjustified commentary.
Tom Brown at 20:00
Daniela Amodei at 20:26
Does the new RSP promote "clearer accountability"? I guess a little; per the new RSP:
But mostly the new RSP is just "more flexible and nuanced," I think.
Also, minor:
I don't really understand (like, I can't imagine an example that would be well-described by this) but I'm slightly annoyed because it suggests a vibe of we have made the RSP stronger at least once.
Sam McCandlish at 21:30
Dario Amodei at 22:00
I would agree if the RSP was stronger.
Daniela Amodei at 23:25
Dario adds:
Chris Olah at 24:20
I would agree if the RSP was stronger.
Daniela Amodei at 29:04
I feel bad about this.
Jared Kaplan at 25:11
pushes back a little: all of the above were reasons to be excited about the RSP ex ante but it's been surprisingly hard and complicated to determine evals and thresholds; in AI there's a big range where you don't know whether a model is safe. (Kudos.)
Dario Amodei at 41:38
"Race to the top" works in practice:
Jack Clark adds: "the Frontier Red Team got cloned almost immediately" by other labs.
My take: Anthropic gets a little credit for RSP adoption; focus on interp isn't clearly good; Anthropic doesn't get credit for collaboration with AISIs (did it do better than GDM and OpenAI?); on red-teaming I'm not familiar with Anthropic's timeline and interested in takes but, like, I think GDM wrote Model evaluation for extreme risks before any Anthropic-racing-to-the-top on evals
Daniela Amodei at 42:08
Says customers say they prefer Claude because it's safer (in terms of hallucinations and jailbreaks).
Is it true that Claude is safer? Would be news to me.
Dario Amodei at 48:30
He thinks about places where there's (apparent) consensus, what everyone wise thinks, and then it breaks. He thinks that's about to happen in interp, among other places.
I'd bet against + I wish Anthropic's alignment bets were less interp-y. (But largely because of vibes about what everyone wise thinks.)
(I claim Anthropic's RSP is not very ambitious and is quite insufficient to prevent catastrophe from Anthropic models, especially because ASL-4 hasn't been defined yet but also I worry that the ASL-3 standard will not be sufficient for upper-ASL-3 models.)