Pausing AI Is the Best Answer to Post-Alignment Problems

MichaelDickens

Pausing AI Is the Best Answer to Post-Alignment Problems — LessWrong

63 Pausing AI Is the Best Answer to Post-Alignment Problems

11th Apr 2026

3 min read

63

Even if we solve the AI alignment problem, we still face post-alignment problems, which are all the other existential problems ^[1] that AI may bring.

People have identified various imposing problems that we may need to solve before developing ASI. An incomplete list of topics: misuse; animal-inclusive AI; AI welfare; S-risks from conflict; gradual disempowerment; permanent mass unemployment; risks from malevolent actors/AI-enabled coups/gradual concentration of power; moral error.

If we figure out how to resolve one of these problems, we still have to deal with all the others. If even one problem remains unsolved, the future could be catastrophically bad. That fact diminishes the promise of working on problems individually.

A global moratorium on superintelligence buys us more time to work on alignment as well as all of the post-alignment problems. Pausing AI is in the common interest of many causes. ^[2]

Cross-posted from my website.

We can't delay until after ASI

If we figure out how to align ASI, can it solve post-alignment problems for us? Or can we use ASI to enable a Long Reflection? No.

To build an aligned ASI, one of two conditions must hold:

The ASI has locked-in values.
The ASI is corrigible: it will do what its masters say, and will allow its goals to be changed.

If values are locked in, we can't defer any problems related to moral philosophy; we must solve them in advance. ^[3]

If the ASI is corrigible, then that lets us take time to do a Long Reflection, figuring out The Good with the help of a superintelligent assistant. But a corrigible ASI creates other problems. It means the first person to get access to the newly-created ASI could use it to take over the world. If the ASI is widely accessible, bad actors could use it to do enormous harm. Corrigibility increases catastrophic risks from misuse and totalitarianism.

If we want a post-ASI Long Reflection, then we still need the AI to be aligned, and we need some sort of impartial governance that prevents rogue individuals from co-opting the Reflection. By strong default, ASI will end liberal democracy. On the current trajectory, we will end up with a small group of people—either AI company leaders or government leaders—having dictatorial control over advanced AI. At minimum, we need to solve the AI misuse and power concentration problems before developing ASI; and we need to have a way to avoid value lock-in without exacerbating misuse and concentration risks.

Perhaps there's some version of value alignment/corrigibility that finds the right middle ground to avoid the problems on both sides. But anything resembling a solution looks very far off, and not enough people take these problems seriously.

What's the alternative to pausing?

Advocating to pause AI is the most important response to post-alignment problems, but it might not be the most cost-effective. Achieving a globally coordinated pause would be difficult. Maybe it's more cost-effective to work on various post-alignment problems individually, or to search for other mitigations that reduce risk from many post-alignment problems simultaneously.

I can't confidently say that advocating for a pause is the best thing to do, but nothing else looks clearly better.

Two arguments in favor of prioritizing AI pause advocacy as an answer to post-alignment problems:

If timelines are short, then we don't have time to solve post-alignment problems.
Pausing AI helps with all post-alignment problems simultaneously by giving us more time to work on them.

The most compelling argument against pause advocacy is that it's intractable. It's out of scope of this essay to go in depth on tractability, but I expect that achieving a pause is less difficult than solving every post-alignment problem without pausing. In an alternative world where (say) we're home free as long as we solve the problem of AI-enabled totalitarianism, then directly working on totalitarianism might be better than pause advocacy. But there are many bad outcomes to avert, which makes pausing AI—as difficult as that would be—easier than solving all the post-alignment problems in a short time span.

Research agendas on post-alignment problems rarely propose "pause/slow down AI development" as a mitigation. This may be because the authors don't believe it's a good response. But the research agendas don't consider-and-ultimately-reject the idea of pausing AI; instead, they don't address it at all. If I'm wrong, and a pause is not the best answer to post-alignment problems, then there is work to be done to articulate why other responses are better.

Existential in the classic sense of "a permanent loss of most of the potential flourishing of the future". ↩︎
This wording is borrowed from Rationality: Common Interest of Many Causes. ↩︎
Our best bet might be something like Coherent Extrapolated Volition. Unfortunately, no AI developers are working on how to do that. ↩︎

Frontpage

63

Pausing AI Is the Best Answer to Post-Alignment Problems

New Comment

14 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:38 AM

[-]Haiku1mo180

On tractability:

I am in Washington DC today and will speak with the offices of both of my senators tomorrow, with 3 others also from Arizona, to educate them on the issue and demand that they call for a global agreement to ban the creation of superintelligent AI. 50+ others are doing the same thing for their state on the same day.

My representative Greg Stanton already (quietly for now) supports an ASI ban, primarily due to my personal efforts to educate him on the topic. My state-level representative Stacey Travers introduced an AI safety transparency bill this session at my request, which I helped shape. My state-level senator Mitzi Epstein became visibly concerned about AI risk when I met with her about it. I am three for three on positive impact, with a range of effect size.

I am not an AI safety researcher, I have no ML degree, and political lobbying is approximately the furthest possible thing from what I thought I could ever succeed at.

Tractability is a question of the will to act, not of whether we have a galaxy-brained map of the complex system of politics. Research about complex systems heavily relies on empiricism. Most big political asks that are successful are seen as impossible until they suddenly become the obvious consensus. If you want to know whether an AI moratorium is feasible, lobbying your elected leaders is the requisite field work.

[-]Wei Dai1mo114

But there are many bad outcomes to avert, which makes pausing AI—as difficult as that would be—easier than solving all the post-alignment problems in a short time span.

I agree this is a good reason to support pausing AI, and suspect that more people would support pausing if they realized/understood this, but I'm not sure... You can see here where I gave a version of this argument to @wdmacaskill but did not get a response, and here where I reposted the same argument as a LW shortform, but it didn't get much uptake (or pushback).

Research agendas on post-alignment problems rarely propose "pause/slow down AI development" as a mitigation.

Yeah, it generally confuses me why people who talk about post-alignment problems^[1] tend not to support pause, and people who support pause tend not to talk much about these problems.

^{^}
Perhaps "non-alignment AI x-risks" would be more accurate to not imply that they are after alignment in some sense? Or maybe "ex-alignment problems"?

[-]Tom Davidson1mo60

Fwiw I think a bit about post alignment problems and think we should we preparing to pause / slow down for this kind of reason! Compared to standard pause supporters, I'd probably put more emphasis on avoiding concentration of power when we do it and doing it when ai can significantly accelerate efforts to solve these problems

[-]Wei Dai25d40

>doing it when ai can significantly accelerate efforts to solve these problems

I just wrote a comment on EAF arguing for pausing now instead of, or in addition to, later, from the perspective of Forethought's "Better Futures" (which is similar to Michael's "post-alignment problems"). Maybe you can chime in there if you're interested in discussing this?

[-]MichaelDickens1mo40

Yeah, it generally confuses me why people who talk about post-alignment problems[1] tend not to support pause, and people who support pause tend not to talk much about these problems.

I have decided that my new hobbyhorse is getting people who talk about post-alignment problems to change their minds on pausing, or at minimum at least engage with the possibility instead of ignoring it. I'm actually not super confident that pause advocacy is the best move on the margin but at least I want people to consider it more seriously.

RE the name, my first draft called them "non-alignment problems", but a reviewer said this makes it sounds like "the problem of AI not being aligned". I spent a long time thinking about names and couldn't come up with anything satisfying. "non-alignment AI x-risks" is too long IMO.

I think of post-alignment problems as "after" alignment in the sense that if you mess up ASI, then the problem that kills you first is misalignment.

[-]williawa1mo61

I mostly disagree with this, or I think theres a question here. But its not a difficult theoretical or philosophical problem, its something that reduces to a political power-struggle, and the reasonable things to be said are strategizing on the basis of value overlap.

My reasoning is:

I think if we have a corrigible superintelligence it will quickly turn into / create a sovereign value-aligned AI. At least in a scenario as chaotic as this world. Because corrigible agents are in a sense less powerful than sovereign ones.
1. Think, if I have a sovreign value-aligned ASI, you have a corrigible one, and we're in a conflict, mine will outmanouver yours because its less restricted in the actions it can take and doesn't have to check in with me. And if you ask your corrigible ASI what to do about this it will probably tell you "Hey man, I'm not really supposed to say this, but ~~you should probably create a incorrigible successor to me and put that in charge~~"
If there are multiple such AIs, we'll end up with a singleton.
1. If the first ASI is powerful enough it'll take over for instrumental reasons and prevent further ASIs being created.
2. If the power-difference between the initial ASIs is small enough that none of them can take over, the natural coordination endpoint is a value handshake. Which results in something that acts as a singleton.
The premise is that this AI is value-aligned to someone, or a group of people. What does this mean? It means it does what those people want (in the fullest sense of the term, what they'd want on reflection, if they knew all the facts etc)
1. This is just what alignment means. If we're in a regime where we can't align AIs to that, we'll end up with fancy paperclippers.
2. If you think we'll first solve corrigibility, or some weaker sense of alignment, my argument is that this is whats ultimately in anyones interest to have the AI aligned to, and consequently, the endpoint of people using the pre-fully-aligned AI.
Now, assume 1-3 happen as I say, and the ASI ends up value-aligned with you in particular. Then for sure all these problems are solved from your perspective.
1. Animals
2. 1. If you care about animals, the AI also cares about animals. And whatever actions best realizes that caring, the AI will do. This is basically a tautology if we use the definition of alignment I gave.
3. AI Welfare
4. 1. Ditto. There is some fact about what makes you care about things. The value aligned ASI shares that, and again takes actions that best realize that caring.
5. Unemployment
6. 1. Ditto. If you don't like people being unemployed, the ASI will make people employed. If the Great Good Best future from your perspective is some people unemployed, some people employed, some people doing crazy transhumanist stuff, that's the future the AI will realize.
7. Concentration of Power
8. 1. Right now you effectively have all the power. If you don't like that, the ASI will realize whatever mode of power-organization you'd find best/most-fair, taking into-account all the far-off effects those modes of organization would have on the future and the people existing in that mode of organization.
9. Gradual Disempowerment
10. 1. Ditto
11. Malevolent Actors
12. 1. If you don't want people doing bad stuff, they won't be able to do bad stuff. If you think restricting peoples ability to do bad stuff is also bad, the ASI will find some pareto-optimal state of affairs that gives people the maximum ability to do bad stuff, which simultaneously minimizes the bad consequences.
13. S-risk from conflict
14. 1. Ditto
15. Misuse
16. 1. Ditto
17. AI Enabled Coups
18. 1. Ditto
19. Moral Errors
20. 1. I think this is somewhat confused. But if you are a moral realist, and its meaningful to talk about making "moral errors", i.e. there is a way to infer which values are "correct", and there is a way to fall short of that, and this is a separate thing from making correct inferences about which actions are good wrt a set of predetermined values, then the ASI will not make such errors, because making correct inferences is a superintelligence's whole schtick.
From (4) it follows that the only way stuff can go wrong from your perspective (or mine, or anyone) is if the values put into the ASI has too much divergence from yours.
And since we assume that "putting someones values into the AI" is a solved problem, the problem reduces to "ensuring the right people are in the room when the ASI is first booted up"
And its in everyones interest to be in that room. So the whole problem becomes a very normal bargaining /power-struggle/politics problem.

[-]MichaelDickens1mo60

Your steps sound pretty reasonable to me. A key missing step is that there's basically zero chance that good people will win a power struggle over ASI. Rather, power-hungry people will win the power struggle. In other words, if we end up in a situation with extreme power imbalances where the future will be decided by the winners of a short-term struggle, there's basically no chance of a good outcome. (The outcome might be better than extinction, but still not good.*) So it seems critically important to ensure that things don't go that way, and I have no idea how to ensure that other than by not building ASI.

I think that's a real sense in which all these post-alignment problems are still problems. I do acknowledge that "be a good person and then acquire absolute power" is an answer to all post-alignment problems simultaneously, which is something I missed in my original post. But it doesn't seem like a viable solution to me. It might even be true that seeking absolute power is fundamentally incompatible with being a good person, although I'm not sure about that.

*It could also be worse than extinction if vindictive power-hungry people decide to torture their enemies for eternity, or similar.

[-]williawa1mo10

Yeah. To clear, I didn't intend for my comment to make it sound like I think stuff is easy if we have solved alignment. It might be difficult enough that pausing AI is required to solve it (a position I'm sympathetic to anyways).

I just meant to communicate that if we solve alignment, the remaining problem is more like a very high-stakes version of getting the person you want elected president. It's a very difficult task, but not a problem where the difficulty lies in conceptual confusion, or theoretical questions we don't have answers to. But discussions about these post-asi topics usually treat it like that.

[-]Wei Dai1mo41

But if you are a moral realist, and its meaningful to talk about making "moral errors", i.e. there is a way to infer which values are "correct", and there is a way to fall short of that, and this is a separate thing from making correct inferences about which actions are good wrt a set of predetermined values, then the ASI will not make such errors, because making correct inferences is a superintelligence's whole schtick.

It's not only moral realists who have to worry about moral errors. See #3 in my Six Plausible Meta-Ethical Alternatives:

There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.

Perhaps more importantly, ASI may lack philosophical competence, despite superhuman competence in other areas. It's unclear why ASI must be philosophically competence, and seemingly reasons to suspect that they will not be. See my posts Some Thoughts on Metaphilosophy and AI doing philosophy = AI generating hands?

[-]E. P. Cooper1mo10

The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.

There might be facts about what's rational, but not about what utility function^[1] it is right to use. Maybe a superintelligence could tell you (in a somewhat objective/convergent sense) what utility function to use, but the exact utility function would depend on the utility function of the superintelligence^[2].

In Vladimir Nesov's opinion^[3], even presenting a human a list of (known convergent) utility functions would be invalid unless the exact list is also presented in a "hypothetical history" where that person is never exposed to superintelligence or strong persuasion, since otherwise the person's decision on what utility function to take would be "illegitimate" due to its data dependence on superintelligence-produced data that has no (legitimate) alternate source.

Nesov's proposal does not define an initial dynamic that would lead to the fixed point he references. This fixed point may, in some cases, try to allow aggregations of legitimate histories where no strongly persuasive or superintelligent entities influence the human in order to extend legitimacy to those that do so contain, but even with a defined initial dynamic, it seems like the space of decisions^[4] that are truly orthogonal^[5] to the particular human's utility function may be confined and weirdly shaped, and since the human deciding on what utility function to use (with or without superintelligent help) must not decide based on an already completed decision (5 dollars does not equal 10 dollars), this is the only allowable space, so the human may not be allowed support from aggregation (the only thing that would allow a superintelligence to show a list that needs a superintelligence to create).

Note that some self reference is okay, but the initial dynamic must reliably be the basis of the fixed point, something that cannot legitimately occur if the dynamic is stripped of everything that causes (in the substrate-independent structure of the human's free will) the human to legitimately obtain^[6] the single correct utility function (for that particular human, according to that particular human's initial dynamic, itself based on (but not solely consisting of) that human's behavior in "non-pathological hypothetical histories" produced by legitimate approximation of the human as legitimately separable from physics^[7], this legitimacy itself requiring the causal substance of free will to be preserved, the causal substance that is the abstract to physics's concrete, even as the human is removed from physics^[8]).

^{^}
Or similar parameter.
^{^}
This would be because the superintelligence would prefer world states where you have one candidate utility function over another.
^{^}
https://www.lesswrong.com/posts/vHesg2rw3jWCGHTWa/human-agency-in-a-superintelligent-world#Superintelligence_is_Unable_to_Help
^{^}
By the particular human.
^{^}
Though orthogonality may be too strong a requirement here, hence my uncertainty. We may need a better account of counterlogicals to clearly write out what we mean.
^{^}
Discussion of outside selection of multiple free wills left until later.
^{^}
Potentially requiring a feathered boundary, not a sharp one.
^{^}
Removed from direct contact, that is, (abstract) human -> superintelligence -> physics, rather than human -> physics (where arrows describe a certain kind of steering).

[-]E. P. Cooper3d10

I want to make a few adjustments to my terminology and clarify a point about what it would really entail for a human to use the "correct" decision theory. The new terminology should better match that of Vladimir Nesov's newest comment on this subject^[1].

The restrictions I describe in my comment above are actually about the human's decision to be replaced by an agent that has a utility function (or similar parameter) programmed into the correct decision theory. The list of options the superintelligence presents is important because it is upstream of the human's choice to be so replaced. Under Nesov's proposal, the information a superintelligence is allowed to show a human is strictly regulated by the "aggregate," a fixed point calculated under laws (similar in concept to the laws of physics) held constant by an updateless core. Control over tiny details in the list and its presentation to the human could be used by an intelligent and knowledgeable enough agent to (unnoticeably) manipulate the exact utility function selected. If Nesov's proposal is implemented, this manipulation may be legitimate council, in the sense that manipulating the human into making illegitimate decisions (according to said human's fixed point) would be off policy.

^[2]

If the superintelligence was just showing a list with incomprehensible items on it that are claimed to be utility functions, that might not be prohibited. Replacement or modification on a deep level are why the requirements may appear too strict if the case I described previously is assumed to be the standard template. Other cases, for example the question of if a human should be persuaded (incredibly subtlety) to make a sandwich with the pieces of bread in loaf or rotated-to-oppose relative orientation may have loose restrictions, if any at all. This is because Nesov's proposal involves (tractable, so he claims) self-reference, in a similar style to CEV.

Trying to keep to Nesov's terminology, what I called an initial dynamic should presumably be called an initial aggregate. "Initial dynamic" may be too suggestive of a particular method for reaching a fixed point, when Nesov's position currently appears to be that it just must be reasoned to by some method. I stand by my claim that an initial aggregate is always required. This is because, as Nesov says, the fixed point can only be approached (or, hypothetically, reached in a single step) "according to what the aggregated values have figured out so far." ^[3]

If anyone's interested, I think a useful task to get started with would be an investigation into what additional constraints (if any) should be applied to the fixed points, beyond Nesov's requirement that the values you obtain in alternate paths are only considered if they are legitimate according the prior (maybe initial) aggregate, and the potential requirement that values are used to influence what paths are considered at all. These additional requirements would presumably be listed out manually by humans.

^[4]

In an attempt to learn from the past 20+ years of work on CEV, I think it's important to think about what should happen if your outer alignment method fails to converge. Note that for CEV, some sort of convergence may be obtained if the CEV of the contributors to the AI's development converges, since it can do the full calculation on all humans that are currently alive while kicking out problem components according to the CEV of the contributors. This may require deciding in advance and/or the sacrifice of a volunteer, however.

Opposed to that, while following Nesov's proposal it may turn out that most or all humans do not have a legitimate fixed point of the right sort, or the math just turns out not to work for many plausible evolved aliens at all (e.g. only trivial transformations turn out to meet all the desiderata). This outcome is reading above chance on the informal "betting" aggregation I have.

Even if this is not true, it may turn out that some human's aggregations can not reach a fixed point successfully. This, and the reasons described before, suggests some sort of fallback to be used in that case and possibly others. I describe a potential approach for this at the end of this comment. Note that even the existence of a fallback may be a catastrophic incentive/preference instability problem, as it apparently was for some CEV proposals.

CEV is already hard enough to implement, requiring a fully unleashed lower-order Do What I Mean (DWIM) agent running a decision theory substantially beyond the state of the art, with only alignment running solely through that decision theory preventing it from immediately self-modifying or creating sub-agents to get around restrictions. I think Nesov's approach may be even harder, given that it requires constant operation instead of being tasked with the creation of a single utility function that will never be reconsidered, among other things. The question about what should be done about humans manipulating other humans for example, given that it is nearly certain that at least one human would have legitimate (according to the fixed point of that person) potential future histories where the successful manipulation of another person occurs while not in the presence of superintelligence. I have great uncertainty about all this, however.

Nesov's proposal contains so much unformalized content that I am unsure where to begin. For a fallback or alternative, it is possible that a line of attack could be opened by the formalization of a static account of rationality and counterlogicals. This may allow a method where the fixed point finding is skipped, and instead the counterlogical versions of Nesov's alternate future histories are used to determine the aggregate, with only counterlogicals meeting certain fixed criteria being inspected according to further fixed criteria. This personal aggregate would then lend legitimacy to some histories involving superintelligences, similar to Nesov's proposal. I am unaware of any progress in these areas, however. It is possible that current work on CEV will not lead to anything that carries over, since I see nothing there that is in the form of rigorous tiling theorems and full designs for the cores of proven-aligned agents. Given such slow progress, and given the perils of trusting AI systems to do this, I think humans would have to individually program each constraint on the counterlogicals. This raises the possibility that humans decades to centuries in the future may do something incorrectly here, either intentionally or unintentionally, as they constrain and evaluate the counterlogicals, even if they delegate to blinded and self-erasing programmed computers as much as possible. I'm not sure what to do.

^{^}
https://www.lesswrong.com/posts/vzHtHHBJoKATi5SeK/empowerment-corrigibility-etc-are-simple-abstractions-of-a?commentId=BjQrqeKfov946oAKj
^{^}
Note that while, hypothetically, all this fixed point calculation could be done by the agent that has preferences itself (think: a human calculating for itself) in practice only a superintelligence would be able to accurately find a valid fixed point. If a friendly AI was developed, it would presumably do everything required by Nesov's proposal on behalf of humans, in the background.
^{^}
Nesov seems to write like there is only one fixed point, maybe for simplicity, but I don't see how any practical method would be that precise and accurate. Maybe there would be a "fixed region" in a similar style to the goals achieved under certain proposals for soft optimization.
^{^}
The initial aggregate could be considered twin to the prior, though since it can't be multiplied it can't be mixed in. This is opposed to the classical pair of prior and utility function. In humans, the situation is presumably much more messy than anything described here, however.

[-]Tom Davidson1mo52

I think you're using a false dichotomy when you say that either superintelligence values will be locked in or they will be corrigible.

There is an in-between where superintelligence won't help with power grabs and won't do other awful things, but it will allow its values to be changed if there is a legitimate process that supports that change, with multiple stakeholders signing off. This would allow society to change the AI's values and behaviors as it likes but no small group to change it so the AI helps them seize power. It is essentially corrigible to a broader legitimate process rather than to any individual user.

That's the kind of AI that I think could allow us to navigate these problems as we go without pause

(I think we should pause or at least significantly slow down despite this objection!)

[-]MichaelDickens1mo20

This is a subject that probably deserves more careful attention but here is my basic thinking:

Either ASI has more than zero values locked in, or it's fully corrigible. If any values at all are locked in, then we need to have a pretty robust understanding of what the consequences of that will be, because we can't change it ever. Like I don't think we know how to encode something like "don't let people do power grabs, but be fully corrigible in every other way". I don't know how much that's downstream of the facts that (1) we don't know how to encode any values at all and (2) we don't know how to encode corrigibility, but my intuition is that even if we solve #1 and #2, the problem of "don't pick incorrigible values that will screw everything up down the road" is still a hard problem.

This is related to Max Harms' work on CAST. Part of his argument is that pure corrigibility is a more robust target than any set of values because a near miss fails gracefully. Whereas if you try to encode any values at all, a near miss could be catastrophic. He's talking more about the "AI kills everyone" flavor of catastrophe, which is valid, but what I'm talking about here is more that a near miss could permanently lock us in to a bad (or maybe just not-that-good) future. Different argument but the concern arises for a similar reason—if you're specifying values, then you have to get the specification right, beyond just ensuring that the AI does what you want.

[-]RedMan1mo-1-2

Has anyone written anything about the costs of pausing early? If the AI safety position on superintelligence eventually killing us all is correct, presumably there are points on the path to it that are better than others.

Is the best spot to pause in the past? If it's in the future, what do we lose by stopping before we reach that point?

As I've written before, I think humans are on a glide path to extinction from non-AI causes. I think we are locked into a bunch of problems that require science and engineering solutions that are not currently available.

Pausing AI is likely pausing or rolling back technical development in general. I think the arguments for that leading to extinction long term are stronger than the arguments for superintelligence coming into being and instantly destroying the universe.

Moderation Log