(Many of these ideas developed in conversation with Ryan Greenblatt)
In a shortform, I described some different levels of resources and buy-in for misalignment risk mitigations that might be present in AI labs:
*The “safety case” regime.* Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it’s pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people are usually talking about when they discuss making safety cases. I usually operationalize this as the AI developer wanting to have <1% chance that their AIs escape in the first year of deployment, and <5% conditional on the model trying pretty hard to subvert their safety measures. (I think some AI developers are hoping to get lower levels of risk than this.)
*The rushed reasonable developer regime.* The much riskier regimes I expect, where even relatively reasonable AI developers are in a huge rush and so are much less able to implement interventions carefully or to err on the side of caution.
*The rushed unreasonable developer regime.* You’re advising an unreasonable AI developer that is incompetent at implementing safety measures and unwilling to put much effort into implementing these measures.
*The existential war regime*. You’re in an existential war with an enemy and you’re indifferent to AI takeover vs the enemy defeating you. This might happen if you’re in a war with a nation you don’t like much, or if you’re at war with AIs. Obviously this is a regime where you should be much more aggressive.
I want to flesh out one particular rushed unreasonable developer scenario that I’ve been thinking about lately: there’s ten people inside the AI company who are really concerned about catastrophic risk from misalignment. The AI company as a whole pays lip service to AI risk broadly construed and talks occasionally about risk from AGI, but they don’t take misalignment risk in particular (perhaps especially risk from schemers) very seriously.
I think this scenario (and similarly pessimistic scenarios) seem important to target with technical research and planning: it seems pretty likely that we’ll only have this level of political will in short timelines (at least within a subset of competitive AI companies) and it seems possible to substantially improve the situation. I worry that a variety of AI safety thinking and planning focuses on overly optimistic scenarios where a responsible developer has a substantial lead and I think more focus on pessimistic scenarios at the margin would be useful.
What should these people try to do? The possibilities are basically the same as the possibilities for what a responsible developer might do:
- Build concrete evidence of risk, to increase political will towards reducing misalignment risk
- Implement safety measures that reduce the probability of the AI escaping or causing other big problems
- Do alignment research with the goal of producing a model that you’re as comfortable as possible deferring to.
The main focus of my research is on safety measures, so I’ve thought particularly about what safety measures they should implement. I'll give some more flavor on what I imagine this scenario is like: The company, like many startups, is a do-ocracy, so these 10 people have a reasonable amount of free rein to implement safety measures that they want. They have to tread lightly. They don’t have much political capital. All they can do is make it so that it’s easier for the company to let them do their thing than to fire them. So the safety measures they institute need to be:
- Cheap: They don’t have a big compute budget. I sometimes imagine them having 3% of the compute budget of the whole company.
- Low compliance overhead: Whatever techniques they implement need to be very easy to comply with, because if the techniques make someone else’s workflow harder, they’ll be rolled back. So if they propose techniques that are tightly coupled to anyone else’s work, those techniques will probably have to be very simple. Though the team will still be able to implement whatever complicated stuff they want if it’s just for internal use inside the team! Thinking through this has made me more excited about research on control techniques that allow most of the complexity to be internalized to a single team. (I think many but not all control techniques allow this.)
I think it’s scarily plausible that we’ll end up in a situation like this. Two different versions of this:
- Irresponsible leader: If there’s a clear leader in AI capabilities, I think it’s pretty plausible that that leader is as irresponsible as I’m describing.
- Some irresponsible equals: And in worlds where many different companies are simultaneously deploying powerful AI, it seems very likely that at least some of those companies will be this careless about misalignment risks.
What should we do based on this?
- Ten is much better than zero. I think that these ten people could substantially reduce the risk posed by the AI company, if they were really thoughtful about it. It would be good if every major AI developer had at least some staff who would work hard and smart on mitigating misalignment risk given a minimal budget.
- Plan for exportable research and other types of external assistance. During the risky period here, there will be many more x-risk-concerned people outside the company than inside it. It seems great for external people (who might be at other AI companies or at independent orgs) to force-multiply those ten people. The strategy here is different depending on whether we’re talking about the “some irresponsible equals” version or the “irresponsible leader” version. I’ll give some basic ideas:
- In the “irresponsible leader” scenario, responsible lagging companies should maybe prioritize doing safety research that can be imported to the leader, rather than prioritizing reducing the risks coming from their own AI deployments.
- In the “some irresponsible equals” version, people at the irresponsible company should focus on preventing that company from causing catastrophic outcomes, and give up on using their access to AI to produce public goods that reduce risk (because that can be done at other AI companies).
- Do research now on safety techniques that can be implemented with a low budget. This is an important part of the motivation for the control research that we work on at Redwood.
FWIW I think of "OpenAI leadership being untrustworthy" (a significant factor in me leaving) as different from "OpenAI having bad safety policies" (not a significant factor in me leaving). Not sure if it matters, I expect that Scott was using "safety policies" more expansively than I do. But just for the sake of clarity:
I am generally pretty sympathetic to the idea that it's really hard to know what safety policies to put in place right now. Many policies pushed by safety people (including me, in the past) have been mostly kayfabe (e.g. being valuable as costly signals, not on the object level). There are a few object-level safety policies that I really wish OpenAI would do right now (most clearly, implementing better security measures) but I didn't leave because of that (if I had, I would have tried harder to check before I left what security measures OpenAI did have, made specific objections internally about them before I left, etc).
This may just be a semantic disagreement, it seems very reasonable to define "don't make employees sign non-disparagements" as a safety policy. But in my mind at least stuff like that is more of a lab governance policy (or maybe a meta-level safety policy).