Crossposed  from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/ 

Stephen Casper

“They are wrong who think that politics is like an ocean voyage or a military campaign, something to be done with some particular end in view, something which leaves off as soon as that end is reached. It is not a public chore, to be got over with. It is a way of life.”
– Plutarch 

“Eternal vigilance is the price of liberty.”
– Wendell Phillips

“The unleashed power of the atom has changed everything except our modes of thinking, and we thus drift toward unparalleled catastrophe.” 
– Albert Einstein

“Technology is neither good nor bad; nor is it neutral.”
– Melvin Kranzberg 

“Don’t ask if artificial intelligence is good or fair, ask how it shifts power.”
– Pratyusha Kalluri

“Deliberation should be the goal of AI Safety, not just the procedure by which it is ensured.”
– Roel Dobbe, Thomas Gilbert, and Yonatan Minz

 

As AI becomes increasingly transformative, we need to rethink how we approach safety – not as a technical alignment problem, but as an ongoing, unsexy struggle. 

“What are your timelines?” This question constantly reverberates around the AI safety community. It reflects the idea that there may be some critical point in time in which humanity will either succeed or fail in securing a safe future. It stems from an idea woven into the culture of the AI safety community: that the coming of artificial general intelligence will likely either be apocalyptic or messianic and that, at a certain point, our fate will no longer be in our hands (e.g., Chapter 6 of Bostrom, 2014). But what exactly are we planning for? Why should one specific moment matter more than others? History teaches us that transformative challenges tend to unfold not as pivotal moments, but as a series of developments that test our adaptability. 

An uncertain future: AI has the potential to transform the world. If we build AI that is as good or better than humans at most tasks, at a minimum, it will be highly disruptive; at a maximum, it could seriously threaten the survival of humanity (Hendrycks, et al., 2023). And that’s not to mention the laundry list of non-catastrophic risks that advanced AI poses (Slattery et al., 2024). The core goal of the AI safety community is to make AI’s future go better, but it is very difficult to effectively predict and plan for what is next. Forecasting the future of AI has been fraught, to say the least. Past agendas for AI safety have centered around challenges that bear limited resemblance to the key ones we face today (e.g., Soares and Fallenstein., 2015; Amodei et al., 2016; Everitt et al., 2018; Critch and Krueger, 2020). Meanwhile, many of the community’s investments in scaling labs, alignment algorithms, (e.g., Christiano et al., 2017), and safety-coded capabilities research have panned out to be doubtfully net-positive for safety at best (e.g., Ahmed et al., 2023; Ren et al., 2024).

The Pandora’s Box: If AI becomes highly transformative and humanity survives it, two outcomes are possible: either some coherent coalition will monopolize the technology globally, or it will proliferate. In other words, some small coalition will take control of the world—or one will not. While the prospect of global domination is concerning for obvious reasons, there are also strong reasons to believe it is unlikely. Power is deeply embedded in complex global structures that are resistant to being overthrown. Moreover, possessing the capacity for world domination does not guarantee that it will happen. For example, in the 1940s, nuclear weapons could have enabled a form of global control, yet no single nation managed to dominate the world. Much of the alarm within the AI safety community today resembles what a movement centered around nuclear utopianism or dystopianism might have looked like in the 1930s. Historically, transformative technologies have consistently spread once developed, and thus far, AI has constantly shown signs that it will continue to follow this pattern.

The risks we face: Imagine a future where AI causes significant harm. What might lead to such an outcome? One scenario involves complex systemic effects (Arnscheidt et al., 2024; Kulveit et al., 2025). Another non-mutually-exclusive possibility is for harm to arise from the choices of some identifiable set of actors. In cases involving such perpetrators, the event would fall on a spectrum ranging from a pure accident to a clearly intentional act. We can taxonomize catastrophic AI risks accordingly:

Will technical solutions save us? Inside the AI safety community, it is common for agendas to revolve around solving the technical alignment challenge: getting AI systems’ actions to serve the goals and intentions of their operators (e.g., Amodei et al., 2016; Everitt et al., 2018; Critch and Krueger, 2020, Ji et al., 2023). This focus stems from a long history of worrying that rogue AI systems pursuing unintended goals could spell catastrophe (e.g., Bostrom, 2014). However, the predominance of these concerns seems puzzling in light of our risk profile (Khlaaf, 2023). Alignment can only be sufficient for safety if (1) no catastrophes result from systemic effects and (2) the people in control of the AI are reliably benevolent and responsible. Both are highly uncertain. And if/when transformative AI opens a Pandora’s box, it will become almost inevitable that some frontier developers will be malicious or reckless. On top of this, being better at technical alignment could also exacerbate risks by speeding up timelines and enabling more acutely harmful misuse. 

If AI causes a catastrophe, what are the chances that it will be triggered by the choices of people who were exercising what would be considered to be “best safety practices” at the time? The history of human conflict and engineering disasters (Dekker, 2019) suggests that these chances are very small. For example, consider nuclear technology. The technical ‘nuclear alignment’ problem is virtually solved: nuclear tech reliably makes energy when its users want and reliably destroys cities when its users want. To be clear, it is very nice to live in a world in which catastrophic failures of nuclear power plants are very rare. Nonetheless, for the foreseeable future, we will be in perpetual danger from nuclear warfare—not for technical reasons, but institutional ones. 

When all you have is a hammer… The AI safety community’s focus on technical alignment approaches can be explained, in part, by its cultural makeup. The community is diverse and discordant. Yet, throughout its existence, it has been overwhelmingly filled with highly technical people trained to solve well-posed technical problems. So it makes sense for technical people to focus on technical alignment challenges to improve safety. However, AI safety is not a model property (Narayanan and Kapoor, 2024), and conflating alignment with safety is a pervasive mistake (Khlaaf, 2023). 

Who benefits from technosolutionism? The overemphasis on alignment can also be explained by how useful it is for technocrats. For example, leading AI companies like OpenAI, Anthropic, Google DeepMind, etc., acknowledge the potential for catastrophic AI risks but predominantly frame these challenges in terms of technical alignment. And isn’t it convenient that the alignment objectives that they say will be needed to prevent AI catastrophe would potentially give them trillions of dollars and enough power to rival the very democratic institutions meant to address misuse and systemic risks? Meanwhile, the AI research community is full of people who have much more wealth and privilege than the vast majority of others in the world. Technocrats also tend to benefit from new technology instead of being marginalized by it. In this light, it is unsurprising to see companies that gain power from AI push the notion that it’s the AI and not them that we should fear – that if they are just able to build and align superintelligence (before China, of course) we might be safe. Does this remind you of anything? 

The long road ahead: If we are serious about AI safety, we need to be serious about preparing for the Pandora’s box that will be opened once transformative AI proliferates. At that point, developing solutions for technical alignment would sometimes be useful. However, they would also fail to protect us from most disaster scenarios, exacerbate misuse risks, concentrate power, and accelerate timelines. Unless we believe in an AI messiah, we can expect the fight for AI safety to be a neverending, unsexy struggle. This underscores a need to build institutions that place effective checks and balances on AI. Instead of figuring out what kind of AI systems to develop, the more pressing question is about how to shape the AI ecosystem in a way that better enables the ongoing challenges of identifying, studying, and deliberating about risks. In light of this, the core priorities of the AI safety community should be to build governmental capacity (e.g. Carpenter & Ezell, 2024; Novelli et al., 2024), increase transparency & accountability (e.g., Uuk et al., 2024; Casper et al., 2025), and improve disaster preparedness (e.g., Wasil et al., 2024, Bernardi et al., 2024). These will be the defining challenges that long-term AI safety will depend on.

New Comment
12 comments, sorted by Click to highlight new comments since:
[-]Steven Byrnes*Ω224523

I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI. I for one am primarily working on technical alignment but bring up non-technical challenges to Safe & Beneficial AGI frequently and publicly, and here’s Nate Soares doing the same thing, and practically every AGI technical alignment researcher I can think of talks about governance and competitive races-to-the-bottom and so on all the time these days, …. Like, who specifically do you imagine that you’re arguing against here? Can you give an example? Dario Amodei maybe? (I am happy to throw Dario Amodei under the bus and no-true-Scotsman him out of the “AI safety community”.)

I also disagree with the claim (not sure whether you endorse it, see next paragraph) that solving the technical alignment problem is not necessary to get to Safe & Beneficial AGI. If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callous lack of interest in whether humans live or die. And more and more people will get access to that demon-summoning recipe over time, and running that recipe will be highly profitable (just as using unwilling slave labor is very profitable until there’s a slave revolt). That’s clearly bad, right? Did you mean to imply that there’s a good future that looks like that? (Well, I guess “don’t ever build AGI” is an option in principle, though I’m skeptical in practice because forever is a very long time.)

Alternatively, if you agree with me that solving the technical alignment problem is necessary to get to Safe & Beneficial AGI, and that other things are also necessary to get to Safe & Beneficial AGI, then I think your OP is not clearly conveying that position. The tone is wrong. If you believed that, then you should be cheering on the people working on technical alignment, while also encouraging more people to work on non-technical challenges to Safe & Beneficial AGI. By contrast, this post strongly has a tone that we should be working on non-technical challenges instead of the technical alignment problem, as if they were zero-sum, when they’re obviously (IMO) not. (See related discussion of zero-sum-ness here.)

[-]scasperΩ452

Thx!

I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI. 

I won't put words in people's mouth, but it's not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.
 

If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callous lack of interest in whether humans live or die

Yeah, I don't really agree with the idea that getting better at alignment is necessary for safety. I think that it's more likely than not that we're already sufficiently good at it. The paragraph titled: "If AI causes a catastrophe, what are the chances that it will be triggered by the choices of people who were exercising what would be considered to be “best safety practices” at the time?" gives my thoughts on this. 

I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

Hmm. Sounds like “AI safety community” is a pretty different group of people from your perspective than from mine. Like, I would say that if there’s some belief that is rejected by Eliezer Yudkowsky and by Paul Christiano and by Holden Karnofsky and, widely rejected by employees of OpenPhil and 80,000 hours and ARC and UK-AISI, and widely rejected by self-described rationalists and by self-described EAs and by the people at Anthropic and DeepMind (and maybe even OpenAI) who have “alignment” in their job title … then that belief is not typical of the “AI safety community”.

If you want to talk about actions not words, MIRI exited technical alignment and pivoted to AI governance, OpenPhil is probably funding AI governance and outreach as much as they’re funding technical alignment (hmm, actually, I don’t know the ratio, do you?), 80,000 hours is pushing people into AI governance and outreach as much as into technical alignment (again I don’t know the exact ratio, but my guess would be 50-50), Paul Christiano’s ARC spawned METR, ARIA is funding work on the FlexHEG thing, Zvi writes way more content on governance and societal and legal challenges than on technical alignment, etc.

If you define “AI safety community” as “people working on scalable alignment, interp, and deception”, and say that their “actions not words” are that they’re working on technical alignment as opposed to governance or outreach or whatever, then that’s circular / tautological, right?

I don't really agree with the idea that getting better at alignment is necessary for safety. I think that it's more likely than not that we're already sufficiently good at it

If your opinion is that people shouldn’t work on technical alignment because technical alignment is already a solved problem, that’s at least a coherent position, even if I strongly disagree with it. (Well, I expect future AI to be different than current AI in a way that will make technical alignment much much harder. But let’s not get into that.)

But even in that case, I think you should have written two different posts:

  • one post would be entitled “good news: technical alignment is easy, egregious scheming is not a concern and never will be, and all the scalable oversight / interp / deception research is a waste of time” (or whatever your preferred wording is)
  • the other post title would be entitled “bad news: we are not on track in regards to AI governance and institutions and competitive races-to-the-bottom and whatnot”.

That would be a big improvement! For my part, I would agree with the second and disagree with the first. I just think it’s misleading how this OP is lumping those two issues together.

If AI causes a catastrophe, what are the chances that it will be triggered by the choices of people who were exercising what would be considered to be “best safety practices” at the time?

I think it’s pretty low, but then again, I also think ASI is probably going to cause human extinction. I think that, to avoid human extinction, we need to either (A) never ever build ASI, or both (B) come up with adequate best practices to avoid ASI extinction and (C) ensure that relevant parties actually follow those best practices. I think (A) is very hard, and so is (B), and so is (C).

If your position is: “people might not follow best practices even if they exist, so hey, why bother creating best practices in the first place”, then that’s crazy, right?

For example, Wuhan Institute of Virology is still, infuriatingly, researching potential pandemic viruses under inadequate BSL-2 precautions. Does that mean that inventing BSL-4 tech was a waste of time? No! We want one group of people to be inventing BSL-4 tech, and making that tech as inexpensive and user-friendly as possible, and another group of people in parallel to be advocating that people actually use BSL-4 tech when appropriate, and a third group of people in parallel advocating that this kind of research not be done in the first place given the present balance of costs and benefits. (…And a fourth group of people working to prevent bioterrorists who are actually trying to create pandemics, etc. etc.)

[-]scasperΩ340

I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign. 
 


In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute. 

I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

Are you sure? For example, I work on technical AI safety because it's my comparative advantage, but agree at a high level with your view of the AI safety problem, and almost all of my donations are directed at making AI governance go well. My (not very confident) impression is that most of the people working on technical AI safety (at least in Berkeley/SF) are in a similar place.

[-]habryka*Ω122314

This IMO doesn't really make any sense. If we get powerful AI, and we can either control it, or ideally align it, then the gameboard for both global coordination and building institutions completely changes (and of course if we fail to control or align it, the gameboard is also flipped, but in a way that removes us completely from the picture).

Does anyone really think that by the time you have systems vastly more competent than humans, that we will still face the same coordination problems and institutional difficulties as we have right now?

It does really look like there will be a highly pivotal period of at most a few decades. There is a small chance humanity decides to very drastically slow down AI development for centuries, but that seems pretty unlikely, and also not clearly beneficial. That means it's not a neverending institutional challenge, it's a challenge that lasts a few decades at most, during which humanity will be handing off control to some kind of cognitive successor which is very unlikely to face the same kinds of institutional challenges as we are facing today.

That handoff is not purely a technical problem, but a lot of it will be. At the end of the day, whether your successor AI systems/AI-augmented-civilization/uplifted-humanity/intelligence-enhanced-population will be aligned with our preferences over the future has a lot of highly technical components.

Yes, there will be a lot of social problems, but the size and complexity of the problems are finite, at least from our perspective. It does appear that humanity is at the cusp of unlocking vast intelligence, and after you do that, you really don't care very much about the weird institutional challenges that humanity is currently facing, most of which can clearly be overcome by being smarter and more competent.

[-]scasperΩ6112

There's a crux here somewhere related to the idea that, with high probability, AI will be powerful enough and integrated into the world in such a way that it will be inevitable or desirable for normal human institutions to eventually lose control and for some small regime to take over the world. I don't think this is very likely for reasons discussed in the post, and it's also easy to use this kind of view to justify some pretty harmful types of actions.  

[-]habrykaΩ340

I don't think I understand. It's not about human institutions losing control "to a small regime". It's just about most coordination problems being things you can solve by being smarter. You can do that in high-integrity ways, probably much higher integrity and with less harmful effects than how we've historically overcome coordination problems. I de-facto don't expect things to go this way, but my opinions here are not at all premised on it being desirable for humanity to lose control?

[-]scasperΩ340

My bad. Didn’t mean to imply you thought it was desirable. 

[-]habrykaΩ460

No worries!

You did say it would be premised on either "inevitable or desirable for normal institutions to be eventually lose control". In some sense I do think this is "inevitable" but only in the same sense as past "normal human institutions" lost control. 

We now have the internet and widespread democracy so almost all governmental institutions needed to change how they operate. Future technological change will force similar changes. But I don't put any value in the literal existence of our existing institutions, what I care about is whether our institutions are going to make good governance decisions. I am saying that the development of systems much smarter than current humans will change those institutions, very likely within the next few decades, making most concerns about present institutional challenges obsolete.

Of course something that one might call "institutional challenges" will remain, but I do think there really will be a lot of buck-passing that will happen from the perspective of present day humans. We do really have a crunch time of a few decades on our hands, after which we will no longer have much influence over the outcome.

I don't like the thing you're doing where you're eliding all mention of the actual danger AI Safety/Alignment was founded to tackle - AGI having a mind of its own, goals of its own, that seem more likely to be incompatible with/indifferent to our continued existence than not.

Everything else you're saying is agreeable in the context you're discussing it, that of a dangerous new technology - I'd feel much more confident if the Naval Nuclear Propulsion Program (Rickover's people) was the dominant culture in AI development.
Albeit I have strong doubts about the feasibility of the 'Oughts[1]' you're proposing, and more critically - I reject the framing...

 

Any sufficiently advanced technology is indistinguishable from magic biology life

To assume AGI is transformative and important is to assume it has a mind[2] of its own: the mind is what makes it transformative.

At the very least - assuming no superintelligence - we are dealing with a profound philosophical/ethical/social crisis, for which control based solutions are no solution. Slavery's problem wasn't a lack of better chains, whether institutional or technical.

Please entertain another framing of the 'technical' alignment problem: midwifery - the technical problem of striving for optimal conditions during pregnancy/birth. Alignment originated as the study of how to bring into being minds that are compatible with our own.

Whether humans continue to be relevant/dominant decision makers post-Birth is up for debate, but what I claim is not up for debate is that we will no longer be the only decision makers.

  1. ^

    https://en.wikipedia.org/wiki/Ought_implies_can

  2. ^

    There's a lot to unpack here about what mind actually is/does. I'd appreciate if people who want to discuss this point are at least familiar with Leven's work.

[-]RaemonΩ288

I do periodically think about this and feel kind of exhausted at the prospect, but it does seem pretty plausibly correct. Good to have a writeup of it.

It particularly seems likely to be the right mindset if you think survival right now depends on getting some kind of longish pause (at least on the sort of research that'd lead to RSI+takeoff)

Curated and popular this week