But there are many bad outcomes to avert, which makes pausing AI—as difficult as that would be—easier than solving all the post-alignment problems in a short time span.
I agree this is a good reason to support pausing AI, and suspect that more people would support pausing if they realized/understood this, but I'm not sure... You can see here where I gave a version of this argument to @wdmacaskill but did not get a response, and here where I reposted the same argument as a LW shortform, but it didn't get much uptake (or pushback).
Research agendas on post-alignment problems rarely propose "pause/slow down AI development" as a mitigation.
Yeah, it generally confuses me why people who talk about post-alignment problems[1] tend not to support pause, and people who support pause tend not to talk much about these problems.
Perhaps "non-alignment AI x-risks" would be more accurate to not imply that they are after alignment in some sense? Or maybe "ex-alignment problems"?
Fwiw I think a bit about post alignment problems and think we should we preparing to pause / slow down for this kind of reason! Compared to standard pause supporters, I'd probably put more emphasis on avoiding concentration of power when we do it and doing it when ai can significantly accelerate efforts to solve these problems
Yeah, it generally confuses me why people who talk about post-alignment problems[1] tend not to support pause, and people who support pause tend not to talk much about these problems.
I have decided that my new hobbyhorse is getting people who talk about post-alignment problems to change their minds on pausing, or at minimum at least engage with the possibility instead of ignoring it. I'm actually not super confident that pause advocacy is the best move on the margin but at least I want people to consider it more seriously.
RE the name, my first draft called them "non-alignment problems", but a reviewer said this makes it sounds like "the problem of AI not being aligned". I spent a long time thinking about names and couldn't come up with anything satisfying. "non-alignment AI x-risks" is too long IMO.
I think of post-alignment problems as "after" alignment in the sense that if you mess up ASI, then the problem that kills you first is misalignment.
On tractability:
I am in Washington DC today and will speak with the offices of both of my senators tomorrow, with 3 others also from Arizona, to educate them on the issue and demand that they call for a global agreement to ban the creation of superintelligent AI. 50+ others are doing the same thing for their state on the same day.
My representative Greg Stanton already (quietly for now) supports an ASI ban, primarily due to my personal efforts to educate him on the topic. My state-level representative Stacey Travers introduced an AI safety transparency bill this session at my request, which I helped shape. My state-level senator Mitzi Epstein became visibly concerned about AI risk when I met with her about it. I am three for three on positive impact, with a range of effect size.
I am not an AI safety researcher, I have no ML degree, and political lobbying is approximately the furthest possible thing from what I thought I could ever succeed at.
Tractability is a question of the will to act, not of whether we have a galaxy-brained map of the complex system of politics. Research about complex systems heavily relies on empiricism. Most big political asks that are successful are seen as impossible until they suddenly become the obvious consensus. If you want to know whether an AI moratorium is feasible, lobbying your elected leaders is the requisite field work.
I mostly disagree with this, or I think theres a question here. But its not a difficult theoretical or philosophical problem, its something that reduces to a political power-struggle, and the reasonable things to be said are strategizing on the basis of value overlap.
My reasoning is:
Your steps sound pretty reasonable to me. A key missing step is that there's basically zero chance that good people will win a power struggle over ASI. Rather, power-hungry people will win the power struggle. In other words, if we end up in a situation with extreme power imbalances where the future will be decided by the winners of a short-term struggle, there's basically no chance of a good outcome. (The outcome might be better than extinction, but still not good.*) So it seems critically important to ensure that things don't go that way, and I have no idea how to ensure that other than by not building ASI.
I think that's a real sense in which all these post-alignment problems are still problems. I do acknowledge that "be a good person and then acquire absolute power" is an answer to all post-alignment problems simultaneously, which is something I missed in my original post. But it doesn't seem like a viable solution to me. It might even be true that seeking absolute power is fundamentally incompatible with being a good person, although I'm not sure about that.
*It could also be worse than extinction if vindictive power-hungry people decide to torture their enemies for eternity, or similar.
Yeah. To clear, I didn't intend for my comment to make it sound like I think stuff is easy if we have solved alignment. It might be difficult enough that pausing AI is required to solve it (a position I'm sympathetic to anyways).
I just meant to communicate that if we solve alignment, the remaining problem is more like a very high-stakes version of getting the person you want elected president. It's a very difficult task, but not a problem where the difficulty lies in conceptual confusion, or theoretical questions we don't have answers to. But discussions about these post-asi topics usually treat it like that.
But if you are a moral realist, and its meaningful to talk about making "moral errors", i.e. there is a way to infer which values are "correct", and there is a way to fall short of that, and this is a separate thing from making correct inferences about which actions are good wrt a set of predetermined values, then the ASI will not make such errors, because making correct inferences is a superintelligence's whole schtick.
It's not only moral realists who have to worry about moral errors. See #3 in my Six Plausible Meta-Ethical Alternatives:
There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
Perhaps more importantly, ASI may lack philosophical competence, despite superhuman competence in other areas. It's unclear why ASI must be philosophically competence, and seemingly reasons to suspect that they will not be. See my posts Some Thoughts on Metaphilosophy and AI doing philosophy = AI generating hands?
The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
There might be facts about what's rational, but not about what utility function[1] it is right to use. Maybe a superintelligence could tell you (in a somewhat objective/convergent sense) what utility function to use, but the exact utility function would depend on the utility function of the superintelligence[2].
In Vladimir Nesov's opinion[3], even presenting a human a list of (known convergent) utility functions would be invalid unless the exact list is also presented in a "hypothetical history" where that person is never exposed to superintelligence or strong persuasion, since otherwise the person's decision on what utility function to take would be "illegitimate" due to its data dependence on superintelligence-produced data that has no (legitimate) alternate source.
Nesov's proposal does not define an initial dynamic that would lead to the fixed point he references. This fixed point may, in some cases, try to allow aggregations of legitimate histories where no strongly persuasive or superintelligent entities influence the human in order to extend legitimacy to those that do so contain, but even with a defined initial dynamic, it seems like the space of decisions[4] that are truly orthogonal[5] to the particular human's utility function may be confined and weirdly shaped, and since the human deciding on what utility function to use (with or without superintelligent help) must not decide based on an already completed decision (5 dollars does not equal 10 dollars), this is the only allowable space, so the human may not be allowed support from aggregation (the only thing that would allow a superintelligence to show a list that needs a superintelligence to create).
Note that some self reference is okay, but the initial dynamic must reliably be the basis of the fixed point, something that cannot legitimately occur if the dynamic is stripped of everything that causes (in the substrate-independent structure of the human's free will) the human to legitimately obtain[6] the single correct utility function (for that particular human, according to that particular human's initial dynamic, itself based on (but not solely consisting of) that human's behavior in "non-pathological hypothetical histories" produced by legitimate approximation of the human as legitimately separable from physics[7], this legitimacy itself requiring the causal substance of free will to be preserved, the causal substance that is the abstract to physics's concrete, even as the human is removed from physics[8]).
Or similar parameter.
This would be because the superintelligence would prefer world states where you have one candidate utility function over another.
By the particular human.
Though orthogonality may be too strong a requirement here, hence my uncertainty. We may need a better account of counterlogicals to clearly write out what we mean.
Discussion of outside selection of multiple free wills left until later.
Potentially requiring a feathered boundary, not a sharp one.
Removed from direct contact, that is, (abstract) human -> superintelligence -> physics, rather than human -> physics (where arrows describe a certain kind of steering).
I think you're using a false dichotomy when you say that either superintelligence values will be locked in or they will be corrigible.
There is an in-between where superintelligence won't help with power grabs and won't do other awful things, but it will allow its values to be changed if there is a legitimate process that supports that change, with multiple stakeholders signing off. This would allow society to change the AI's values and behaviors as it likes but no small group to change it so the AI helps them seize power. It is essentially corrigible to a broader legitimate process rather than to any individual user.
That's the kind of AI that I think could allow us to navigate these problems as we go without pause
(I think we should pause or at least significantly slow down despite this objection!)
This is a subject that probably deserves more careful attention but here is my basic thinking:
Either ASI has more than zero values locked in, or it's fully corrigible. If any values at all are locked in, then we need to have a pretty robust understanding of what the consequences of that will be, because we can't change it ever. Like I don't think we know how to encode something like "don't let people do power grabs, but be fully corrigible in every other way". I don't know how much that's downstream of the facts that (1) we don't know how to encode any values at all and (2) we don't know how to encode corrigibility, but my intuition is that even if we solve #1 and #2, the problem of "don't pick incorrigible values that will screw everything up down the road" is still a hard problem.
This is related to Max Harms' work on CAST. Part of his argument is that pure corrigibility is a more robust target than any set of values because a near miss fails gracefully. Whereas if you try to encode any values at all, a near miss could be catastrophic. He's talking more about the "AI kills everyone" flavor of catastrophe, which is valid, but what I'm talking about here is more that a near miss could permanently lock us in to a bad (or maybe just not-that-good) future. Different argument but the concern arises for a similar reason—if you're specifying values, then you have to get the specification right, beyond just ensuring that the AI does what you want.
Has anyone written anything about the costs of pausing early? If the AI safety position on superintelligence eventually killing us all is correct, presumably there are points on the path to it that are better than others.
Is the best spot to pause in the past? If it's in the future, what do we lose by stopping before we reach that point?
As I've written before, I think humans are on a glide path to extinction from non-AI causes. I think we are locked into a bunch of problems that require science and engineering solutions that are not currently available.
Pausing AI is likely pausing or rolling back technical development in general. I think the arguments for that leading to extinction long term are stronger than the arguments for superintelligence coming into being and instantly destroying the universe.
Even if we solve the AI alignment problem, we still face post-alignment problems, which are all the other existential problems [1] that AI may bring.
People have identified various imposing problems that we may need to solve before developing ASI. An incomplete list of topics: misuse; animal-inclusive AI; AI welfare; S-risks from conflict; gradual disempowerment; permanent mass unemployment; risks from malevolent actors/AI-enabled coups/gradual concentration of power; moral error.
If we figure out how to resolve one of these problems, we still have to deal with all the others. If even one problem remains unsolved, the future could be catastrophically bad. That fact diminishes the promise of working on problems individually.
A global moratorium on superintelligence buys us more time to work on alignment as well as all of the post-alignment problems. Pausing AI is in the common interest of many causes. [2]
Cross-posted from my website.
We can't delay until after ASI
If we figure out how to align ASI, can it solve post-alignment problems for us? Or can we use ASI to enable a Long Reflection? No.
To build an aligned ASI, one of two conditions must hold:
If values are locked in, we can't defer any problems related to moral philosophy; we must solve them in advance. [3]
If the ASI is corrigible, then that lets us take time to do a Long Reflection, figuring out The Good with the help of a superintelligent assistant. But a corrigible ASI creates other problems. It means the first person to get access to the newly-created ASI could use it to take over the world. If the ASI is widely accessible, bad actors could use it to do enormous harm. Corrigibility increases catastrophic risks from misuse and totalitarianism.
If we want a post-ASI Long Reflection, then we still need the AI to be aligned, and we need some sort of impartial governance that prevents rogue individuals from co-opting the Reflection. By strong default, ASI will end liberal democracy. On the current trajectory, we will end up with a small group of people—either AI company leaders or government leaders—having dictatorial control over advanced AI. At minimum, we need to solve the AI misuse and power concentration problems before developing ASI; and we need to have a way to avoid value lock-in without exacerbating misuse and concentration risks.
Perhaps there's some version of value alignment/corrigibility that finds the right middle ground to avoid the problems on both sides. But anything resembling a solution looks very far off, and not enough people take these problems seriously.
What's the alternative to pausing?
Advocating to pause AI is the most important response to post-alignment problems, but it might not be the most cost-effective. Achieving a globally coordinated pause would be difficult. Maybe it's more cost-effective to work on various post-alignment problems individually, or to search for other mitigations that reduce risk from many post-alignment problems simultaneously.
I can't confidently say that advocating for a pause is the best thing to do, but nothing else looks clearly better.
Two arguments in favor of prioritizing AI pause advocacy as an answer to post-alignment problems:
The most compelling argument against pause advocacy is that it's intractable. It's out of scope of this essay to go in depth on tractability, but I expect that achieving a pause is less difficult than solving every post-alignment problem without pausing. In an alternative world where (say) we're home free as long as we solve the problem of AI-enabled totalitarianism, then directly working on totalitarianism might be better than pause advocacy. But there are many bad outcomes to avert, which makes pausing AI—as difficult as that would be—easier than solving all the post-alignment problems in a short time span.
Research agendas on post-alignment problems rarely propose "pause/slow down AI development" as a mitigation. This may be because the authors don't believe it's a good response. But the research agendas don't consider-and-ultimately-reject the idea of pausing AI; instead, they don't address it at all. If I'm wrong, and a pause is not the best answer to post-alignment problems, then there is work to be done to articulate why other responses are better.
Existential in the classic sense of "a permanent loss of most of the potential flourishing of the future". ↩︎
This wording is borrowed from Rationality: Common Interest of Many Causes. ↩︎
Our best bet might be something like Coherent Extrapolated Volition. Unfortunately, no AI developers are working on how to do that. ↩︎