All views are my own, not Anthropic’s. This post assumes Anthropic’s announcement of RSP v3.0 as background. Today, Anthropic released its Responsible Scaling Policy 3.0. The official announcement discusses the high-level thinking behind it. This is a more detailed post giving my own takes on the update. First, the big...
This is a linkpost for a new research paper from the Alignment Evaluations team at Anthropic and other researchers, introducing a new suite of evaluations of models' abilities to undermine measurement, oversight, and decision-making. Paper link. Abstract: > Sufficiently capable models could subvert human oversight and decision-making in important contexts....
Last year, I posted a call for case studies on social-welfare-based standards for companies and products (including standards imposed by regulation). The goal was to gain general context on standards to inform work on possible standards and/or regulation for AI. This resulted[1] in several dozen case studies that I found...
Yes, this is my first post in almost a year. I’m no longer prioritizing this blog, but I will still occasionally post something. I wrote ~2 years ago that it was hard to point to concrete opportunities to help the most important century go well. That’s changing. There are a...
Views are my own, not Open Philanthropy’s. I am married to the President of Anthropic and have a financial interest in both Anthropic and OpenAI via my spouse. Over the last few months, I’ve spent a lot of my time trying to help out with efforts to get responsible scaling...
One of the biggest reasons alignment might be hard is what I’ll call threat obfuscation: various dynamics that might make it hard to measure/notice cases where an AI system has problematic misalignment (even when the AI system is in a controlled environment and one is looking for signs of misalignment)....
I sometimes hear people asking: “What is the plan for avoiding a catastrophe from misaligned AI?” This post gives my working answer to that question - sort of. Rather than a plan, I tend to think of a playbook.1 * A plan connotes something like: “By default, we ~definitely fail....