Thanks for posting this, I think these are valuable lessons and I agree it would be valuable for someone to do a project looking into successful emergency response practices. One thing this framing does also highlight is that, as Quintin Pope discussed in his recent post on alignment optimism, the “security mindset” is not appropriate for the default alignment problem. We are only being optimized against once we have failed to align the AI; until then, we are mostly held to the lower bar of reliability, not security. There is also the problem of malicious human actors where security problems could pop up before the AI becomes misaligned, but this failure mode seems less likely and less x-risk inducing than misalignment, and involves a pretty different set of measures (info sharing policies, encrypting the weights as opposed to training techniques or evals).
I recently read the first chapter of Building Secure & Reliable Systems (the book from the EA infosec book club). The chapter is titled “The intersection of security and reliability.”
I found it to be a helpful introduction to some security concepts. I’m including some of my notes below. I also include some initial musings about how I’m relating these concepts to AI risk.
The difference between reliability and security is the presence of an adversary
Implication for AI risk: AI labs will have reliability needs and security needs. Many of the worries I’m most concerned about are best modeled as security needs. A situationally aware AI is an adversary. Methods that merely prevent nonmalicious or nonadversarial failures are unlikely to be sufficient when we’re dealing with AIs that might actively try to exploit vulnerabilities.
Simplicity
Implication for AI risk: It’s extremely worrying that our current systems are highly complex and we can’t understand them.
Malicious insiders
Implication for AI risk: Conjecture’s internal infohazard policy seems like a reasonable attempt to reduce risks from malicious (or negligent) insiders. The principle of least privilege (or a “need-to-know” policy) sounds valuable. I would be excited for labs like OpenAI, DeepMind, and Anthropic to adopt similar policies.
Additionally, it seems likely that threats from malicious and negligent insiders will grow over time, and some of these threats will be existential. Multi-party authorization for major decisions seems like a promising and intuitive intervention.
Emergency plans
Implication for AI: AI could strike extremely quickly. A superintelligent AI would cut through any of our defenses like butter. But I expect the first systems capable of overpowering humanity to be considerably weaker.
There might be an AI governance/security research project here: Examine case studies of emergency plans that have been employed in other industries, identify best practices, adapt those for AI takeover threat models, and propose recommendations (or an entire plan) to OpenAI.
My best-guess is that labs already have emergency plans, but it wouldn’t surprise me if they could be improved by an intelligent person/team performing a project like the one described above. There are lots of things to do, and many dropped balls. (The expected impact of this project would be more valuable if the person/team performing it had a way to reach members of lab governance/security teams. This may be less difficult than you think.)
Recovering from security failures
Implication for AI risk: This makes me think about evals. Once an eval is failed, it would be great if labs responded slowly and cautiously. There might be pressure, however, to pass the eval as quickly as possible. This will “ultimately come down to a risk assessment and a business decision.”
I empathize a bit with lab folks who say things like “current systems aren’t going to destroy humanity, so we’re moving quickly now, but we really do plan to be more cautious and slow down later.” At the same time, I think it’s reasonable for AI safety folks to be like “your current plans involve making lots of complicated decisions once things get more dangerous, and your current behavior/culture seems inconsistent with a cautious approach, and it’s pretty hard to suddenly shift from a culture of acceleration/progress to a culture of cautious/security. Maybe you should instill those norms right now. Also, you’re allowed to decide not to, but then you shouldn’t be surprised if we don’t trust you as much. Trust is earned by demonstrating a track record of making responsible risk assessments and showing that you balance those risk assessments against business interests in reasonable ways.”
I'm grateful to Jeffrey Ladish for feedback on this post.