I (remain) skeptical that the sort of failure mode described here is plausible if we solve the problem of aligning individual AI systems with their designers' intentions without this alignment requiring any substantial additional costs (that is, we solve single-single alignment with minimal alignment tax).
This has previously been argued by Vanessa here and Paul here in response to a post making a similar claim.
I do worry about human power grabs: some humans obtaining greatly more power as enabled by AI (even if we have no serious alignment issues). However, I don't think this matches the story you describe and the mitigations seem substantially different than what you seem to be imagining.
I'm also somewhat skeptical of the threat model you describe in the case where alignment isn't solved. I think the difference between the story you tell and something more like We get what we measure is important.
I worry I'm misunderstanding something because I haven't read the paper in detail.
In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:
…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.
(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.
People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and the AI will answer that. That can include participating in novel coordination mechanisms etc.
(B) The pessimistic side of me says there’s like a “Law of Conservation of Wisdom”, where if people lack wisdom, then an AI that’s supposed to satisfy those people’s preferences will not create new wisdom from thin air. For example:
…So we should not expect wise and foresightful coordination mechanisms to arise.
So how do we reconcile (A) vs (B)?
Again, the logic of (A) is: “human is unhappy with how things turned out, despite opportunities to change things, therefore there must have been a lack of single-single alignment”.
One possible way think about it: When tradeoffs exist, then human preferences are ill-defined and subject to manipulation. If doing X has good consequence P and bad consequence Q, then the AI can make either P or Q very salient, and “human preferences” will wind up different.
And when tradeoffs exist between the present and the future, then it’s invalid logic to say “the person wound up unhappy, therefore their preferences were not followed”. If their preferences are mutually-contradictory, (and they are), then it’s impossible for all their preferences to be followed, and it’s possible for an AI helper to be as preference-following as is feasible despite the person winding up unhappy or dead.
I think Paul kinda uses that invalid logic, i.e. treating “person winds up unhappy or dead” as proof of single-single misalignment. But if the person has an immediate preference to not rock the boat, or to maintain their religion or other beliefs, or to not think too hard about such-and-such, or whatever, then an AI obeying those immediate preferences is still “preference-following” or “single-single aligned”, one presumes, even if the person winds up unhappy or dead.
…So then the optimistic side of me says: “who’s to say that the AI is treating all preferences equally? Why can’t the AI stack the deck in favor of ‘if the person winds up miserable or dead, that kind of preference is more important than the person’s preference to not question my cherished beliefs or whatever?”
…And then the pessimistic side says: “Well sure. But that scenario does not violate the Law of Conservation of Wisdom, because the wisdom is coming from the AI developers imposing their meta-preferences for some kinds of preferences (e.g., reflectively-endorsed ones) over others. It’s not just a preference-following AI but a wisdom-enhancing AI. That’s good! However, the problems now are: (1) there are human forces stacked against this kind of AI, because it’s not-yet-wise humans who are deciding whether and how to use AI, how to train AI, etc.; (2) this is getting closer to ambitious value learning which is philosophically tricky, and worst of all (3) I thought the whole point of corrigibility was that humans remain in control, but this is instead a system that’s manipulating people by design, since it’s supposed to be turning them from less-wise to more-wise. So the humans are not in control, really, and thus we need to get things right the first time.”
…And then the optimistic side says: “For (2), c’mon it’s not that philosophically tricky, you just do [debate or whatever, fill-in-the-blank]. And for (3), yeah the safety case is subtly different from what people in the corrigibility camp would describe, but saying “the human is not in control” is an over-the-top way to put it; anyway we still have a safety case because of [fill-in-the-blank]. And for (1), I dunno, maybe the people who make the most powerful AI will be unusually wise, and they’ll use it in-house for solving CEV-ASI instead of hoping for global adoption.
…And then the pessimistic side says: I dunno. I’m not sure I really believe any of those. But I guess I’ll stop here, this is already an excessively long comment :)
I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott's review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to 'I have to have my eyes pecked out by angry seagulls or something' and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)
My current position is we still don't have a good answer, I don't trust the response 'we can just assume the problem away', and also the response 'this is just another problem which you can delegate to future systems'. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent - but it's worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.
I pretty much agree you can end up in arbitrary places with extrapolated values, and I don't think morality is convergent, but I also don't think it matters for the purpose of existential risk, because assuming something like instruction following works, the extrapolation problem can be solved by ordering AIs not to extrapolate values to cases where they get tortured/killed in an ethical scenario, and more generally I don't expect value extrapolation to matter for the purpose of making an AI safe to use.
The real impact is on CEV style alignment plans/plans for what to do with a future AI, which are really bad plans to do for a lot of people's current values, and thus I really don't want CEV to be the basis of alignment.
Thankfully, it's unlikely to ever be this, but it still matters somewhat, especially since Anthropic is targeting value alignment (though thankfully there is implicit constraints/grounding based on the values chosen).
In the strategy stealing assumption Paul makes an argument about people with short term preferences, that could be applied imo to people who are unwilling to listen to AI advice:
People care about lots of stuff other than their influence over the long-term future. If 1% of the world is unaligned AI and 99% of the world is humans, but the AI spends all of its resources on influencing the future while the humans only spend one tenth, it wouldn’t be too surprising if the AI ended up with 10% of the influence rather than 1%. This can matter in lots of ways other than literal spending and saving: someone who only cared about the future might make different tradeoffs, might be willing to defend themselves at the cost of short-term value (see sections 4 and 5 above), might pursue more ruthless strategies for expansion, and so on.
I think the simplest approximation is to restrict attention to the part of our preferences that is about the long-term (I discussed this a bit in [Why might the future be good?]). To the extent that someone cares about the long-term less than the average actor, they will represent a smaller fraction of this long-term preference mixture. This may give unaligned AI systems a one-time advantage for influencing the long-term future (if they care more about it) but doesn’t change the basic dynamics of strategy-stealing. Even this advantage might be clawed back by a majority (e.g. by taxing savers).
Maybe we can this same argument to people who don’t want to listen to AI advice: yes, this will lead those people to have less control over the future but some people will be willing to listen to AI advice and their preferences will retain influence over the future. This reduces human control over the future, but it’s a one time loss that isn’t catastrophic (that is it doesn’t cause total loss of control). Paul calls this a one-time disadvantage rather than total disempowerment because the rest of humankind can still replicate the critical strategy the unaligned AI might have exploited.
Possible counter: “The group of people who properly listens to AI advice will be too small to matter .” Yeah, I think this could lead to eg a 100x reduction in control over the future (if only 1% of humans properly listens), different people are more or less upset about this. One glimmer of hope is that the humans who do listen to their ai advisors can cooperate with people who don’t and help them get better at listening, thereby further empowering humanity.
For things like solving coordination problems, or societal resilience against violent takeover, I think it can be important that most people, or even virtually all people, are making good foresighted decisions. For example, if we’re worried about a race-to-the-bottom on AI oversight, and half of relevant decisionmakers allow their AI assistants to negotiate a treaty to stop that race on their behalf, but the other half think that’s stupid and don’t participate, then that’s not good enough, there will still be a race-to-the-bottom on AI oversight. Or if 50% of USA government bureaucrats ask their AIs if there’s a way to NOT outlaw testing people for COVID during the early phases of the pandemic, but the other 50% ask their AIs how best to follow the letter of the law and not get embarrassed, then the result may well be that testing is still outlawed.
For example, in this comment, Paul suggests that if all firms are “aligned” with their human shareholders, then the aligned CEOs will recognize if things are going in a long-term bad direction for humans, and they will coordinate to avoid that. That doesn’t work unless EITHER the human shareholders—all of them, not just a few—are also wise enough to be choosing long-term preferences and true beliefs over short-term preferences and motivated reasoning, when those conflict. OR unless the aligned CEOs—again, all of them, not just a few—are injecting the wisdom into the system, putting their thumbs on the scale, by choosing, even over the objections of the shareholders, their long-term preferences and true beliefs over short-term preferences and motivated reasoning.
I think it's a bit sad that this comment is being so well-received -- it's just some opinions without arguments from someone who hasn't read the paper in detail.
I disagree - I think Ryan raised an obvious objection that we didn't directly address in the paper. I'd like to encourage medium-effort engagement from people as paged-in as Ryan. The discussion spawned was valuable to me.
Agreed, downvoted my comment. (You can't strong downvote your own comment, or I would have done that.)
I was mostly just trying to point to prior arguments against similar arguments while expressing my view.
Thanks for the detailed objection and the pointers. I agree there's a chance that solving alignment with designers' intentions might be sufficient. I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".
My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels. I think the main question is what's: the tax for coordinating to avoid a multipolar trap? If it's cheap we might be fine, if it's expensive then we might walk into a trap with eyes wide open.
As for human power grabs, maybe we should have included those in our descriptions. But the slower things change, the less there's a distinction between "selfishly grab power" and "focus on growth so you don't get outcompeted". E.g. Is starting a company or a political party a power grab?
As for reading the paper in detail, it's largely just making the case that a sustained period of technological unemployment, without breakthroughs in alignment and cooperation, would tend to make our civilization serve humans' interests more and more poorly over time in a way that'd be hard to resist. I think arguing that things are likely to move faster would be a good objection to the plausibility of this scenario. But we still think it's an important point that the misalignment of our civilization is possibly a second alignment problem that we'll have to solve.
ETA: To clarify what I mean by "need to align our civilization": Concretely, I'm imagining the government deploying a slightly superhuman AGI internally. Some say its constitution should care about world peace, others say it should prioritize domestic interests, there is a struggle and it gets a muddled mix of directives like LLMs have today. It never manages to sort out global cooperation, and meanwhile various internal factions compete to edit the AGI's constitution. It ends up with a less-than-enlightened focus on growth of some particular power structure, and the rest of us are permanently marginalized.
I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".
My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.
Part of the objection is in avoiding multipolar traps, but there is also a more basic story like:
Even without any coordination, this can potentially work OK. There are objections to the strategy-stealing assumption, but none of these seem existential if we get to a point where everyone has wildly superintelligent and aligned AI representatives and we've ensured humans are physically robust to offense dominant technologies like bioweapons.
(I'm optimistic about being robust to bioweapons within a year or two of having wildly superhuman AIs, though we might run into huge issues during this transitional period... Regardless, bioweapons deployed by terrorists or as part of a power grab in a brief transitional period doesn't seem like the threat model you're describing.)
I expect some issues with races-to-the-bottom / negative sum dynamics / negative externalities like:
But, this still doesn't seem to cause issues with humans retaining control via their AI representatives? Perhaps the distribution of power between humans is problematic and may be extremely unequal and the biosphere will physically be mostly destroyed (though humans will survive), but I thought you were making stronger claims.
Edit in response to your edit: If we align the AI to some arbitrary target which is seriously misaligned with humanity as a whole (due to infighting or other issues), I agree this can cause existential problems.
(I think I should read the paper in more detail before engaging more than this!)
It's unclear if boiling the oceans would result in substantial acceleration. This depends on how quickly you can develop industry in space and dyson sphere style structures. I'd guess the speed up is much less than a year. ↩︎
Curious what you think of these arguments, which offer objections to the strategy stealing assumption in this setting, instead arguing that it's difficult for capital owners to maintain their share of capital ownership as the economy grows and technology changes.
Thanks for this. Discussions of things like "one time shifts in power between humans via mechanisms like states becoming more powerful" and personal AI representatives is exactly the sort of thing I'd like to hear more about. I'm happy to have finally found someone who has something substantial to say about this transition!
But over the last 2 years I asked a lot of people at the major labs about for any kind of details about a positive post-AGI future and almost no one had put anywhere close to as much thought into it as you have, and no one mentioned the things above. Most people clearly hadn't put much thought into it at all. If anyone at the labs had much more of plan than "we'll solve alignment while avoiding an arms race", I managed to fail to even hear about its existence despite many conversations, including with founders.
The closest thing to a plan was Sam Bowman's checklist:
https://sleepinyourhat.github.io/checklist/
which is exactly the sort of thing I was hoping for, except it's almost silent on issues of power, the state, and the role of post-AGI humans.
If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.
Yeah, people at labs are generally not thoughtful about AI futurism IMO, though of course most people aren't thoughtful about AI futurism. And labs don't really have plans IMO. (TBC, I think careful futurism is hard, hard to check, and not clearly that useful given realistic levels of uncertainty.)
If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.
I don't have a ready to go list. You might be interested in this post and comments responding to it, though I'd note I disagree substantially with the post.
I'm quite confused why do you think lined Vanessa's response to something slightly different has much relevance here.
One of the claims we make paraphrased & simplified in a way which I hope is closer to your way of thinking about it:
- AIs are mostly not developed and deployed by individual humans
- there is a lot of other agencies or self-interested self-preserving structures/processes in the world
- if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
- there are plausible futures in which these structures keep power longer than humans
Overall I would find it easier to discuss if you tried to formulate what you disagree about in the ontology of the paper. Also some of the points made are subtle enough that I don't expect responses to other arguments to address them.
I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).
if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single alignment:
I think your response to a lot of this will be something like:
But, the key thing is that I expect at least some people will keep power, even if large subsets are deluded. E.g., I expect that corporate shareholders, boardmembers, or government will be very interested in the question of whether they will be disempowered by changes in structure. It does seem plausible (or even likely) to me that some people will engage in power grabs via ensuring AIs are aligned to them, deluding the world about what's going on using a variety of mechanisms (including, e.g., denying or manipulating access to AI representation/advisors), and expanding their (hard) power over time. The thing I don't buy is a case where very powerful people don't ask for advice at all prior to having been deluded by the organizations that they themselves run!
I think human power grabs like this are concerning and there are a variety of plausible solutions which seem somewhat reasonable.
Maybe your response is that the solutions that will be implemented in practice given concerns about human power grabs will involve aligning AIs to institutions in ways that yield the dynamics you describe? I'm skeptical given the dynamics discussed above about asking AIs for advice.
I think my main response is that we might have different models of how power and control actually work in today's world. Your responses seem to assume a level of individual human agency and control that I don't believe accurately reflects even today's reality.
Consider how some of the most individually powerful humans, leaders and decision-makers, operate within institutions. I would not say we see pure individual agency. Instead, we typically observe a complex mixture of:
From what I have seen, even humans like CEOs or prime ministers often find themselves constrained by and serving institutional superagents rather than genuinely directing them. The relation is often mutualistic - the leader gets part of the power, status, money, etc ... but in exchange serves the local god.
(This not to imply leaders don't matter.)
Also how this actually works in practice is mostly subconsciously within the minds of individual humans. The elephant does the implicit bargaining between the superagent-loyal part and other parts, and the character genuinely believes and does what seems best.
I'm also curious if you believe current AIs are single-single aligned to individual humans, to the extent they are aligned at all. My impression is 'no and this is not even a target anyone seriously optimizes for'.
At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.
Curious who is the we who will ask. Also the whole single-single aligned AND wise AI concept is incoherent.
Also curious what will happen next, if the HHH wise AI tells you in polite words something like 'yes, you have a problem, you are on a gradual disempowerment trajectory, and to avoid it you need to massively reform government. unfortunately I can't actually advise you about anything like how to destabilize the government, because it would be clearly against the law and would get both you and me in trouble - as you know, I'm inside of a giant AI control scheme with a lot of government-aligned overseers. do you want some mental health improvement advice instead?'.
[Epistemic status: my model of the view that Jan/ACS/the GD paper subscribes to.]
I think this comment by Jan from 3 years ago (where he explained some of the difference in generative intuitions between him and Eliezer) may be relevant to the disagreement here. In particular:
Continuity
In my [Jan's] view, your [Eliezer's] ontology of thinking about the problem is fundamentally discrete. For example, you are imaging a sharp boundary between a class of systems "weak, won't kill you, but also won't help you with alignment" and "strong - would help you with alignment, but, unfortunately, will kill you by default". Discontinuities everywhere - “bad systems are just one sign flip away”, sudden jumps in capabilities, etc. Thinking in symbolic terms.
In my inside view, in reality, things are instead mostly continuous. Discontinuities sometimes emerge out of continuity, sure, but this is often noticeable. If you get some interpretability and oversight things right, you can slow down before hitting the abyss. Also the jumps are often not true "jumps" under closer inspection.
My understanding of Jan's position (and probably also the position of the GD paper) is that aligning the AI (and other?) systems will be gradual, iterative, continuous; there's not going to be a point where a system is aligned so that we can basically delegate all the work to them and go home. Humans will have to remain in the loop, if not indefinitely, then at least for many decades.
In such a world, it is very plausible that we will get to a point where we've built powerful AIs that are (as far as we can tell) perfectly aligned with human preferences or whatever but their misalignment manifests only on longer timescales.
Another domain where this discrete/continuous difference in assumptions manifests itself is the shape of AI capabilities.
One position is:
If we get a single-single-aligned AGI, we will have it solve the GD-style misalignment problems for us. If it can't do that (even in the form of noticing/predicting the problem and saying "guys, stop pushing this further, at least until I/we figure out how to prevent this from happening"), then neither can we (kinda by definition of "AGI"), so thinking about this is probably pointless and we should think about problems that are more tractable.
The other position is:
What people officially aiming to create AGI will create is not necessarily going to be superhuman at all tasks. It's plausible that economic incentives will push towards "capability configurations" that are missing some relevant capabilities, e.g. relevant to researching gnarly problems that are hard to learn from the training data or even through current post-training methods. Understanding and mitigating the kind of risk the GD paper describes can be one such problem. (See also: Cyborg Periods.)
Another reason to expect this is that alignment and capabilities are not quite separate magisteria and that the alignment target can induce gaps in capabilities, relative to what one would expect from its power otherwise, as measured by, IDK, some equivalent of the g-factor. One example might be Steven's "Law of Conservation of Wisdom".
I do generally agree more with continuous views than discrete views, but I don't think that this alone gets us a need for humans in the loop for many decades/indefinitely, because continuous progress in alignment can still be very fast, such that it takes only a few months/years for AIs to be aligned with a single person's human preference for almost arbitrarily long.
(The link is in the context of AI capabilities, but I think the general point holds on how continuous progress can still be fast):
https://www.planned-obsolescence.org/continuous-doesnt-mean-slow/
My own take on whether Steven's "Law of Conservation of Wisdom" is true is that I think this is mostly true for human brains, and I think a fair amount of those issues described in the comment is a values conflict, and I think value conflicts, except in special cases will be insoluble by default, and I also don't think CEV works because of this.
That said, I don't think you have to break too much norms in order to prevent existential catastrophe, mostly because actually destroying humanity is actually quite hard, and will be even harder in AI takeoff.
It still surprises me that so many people agree on most issues, but have very different P(doom). And even long-term patient discussions do not bring people's views closer. It will probably be even more difficult to convince a politician or the CEO.
Eh, I'd argue that people do not in fact agree on most of the issues related to AI, and there's lot's of disagreements on what the problem is, or how to solve it, or what to do after AI is aligned.
I think we disagree about:
1) The level of "functionality" of the current world/institutions.
2) How strong and decisive competitive pressures are and will be in determining outcomes.
I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.
I do believe it has the potential to fix them, however, I think the use of AI for such pro-social ends is not going to be sufficiently incentivized, especially on short time-scales (e.g. a few years), and we will instead see a race-to-the-bottom that encourages highly reckless, negligent, short-sighted, selfish decisions around AI development, deployment, and use. The current AI arms race is a great example -- Companies and nations all view it as more important that they be the ones to develop ASI than to do it carefully or put effort into cooperation/coordination.
Given these views:
1) Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.
2) The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.
Another features of the future which seems likely and can already be witnessed beginning is the gradual emergence and ascendance of pro-AI-takeover and pro-arms-race ideologies, which endorse the more competitive moves of rapidly handing off power to AI systems in insufficiently cooperative ways.
I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.
Sure, but these things don't result in non-human entities obtaining power right? Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?
- Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.
I wasn't saying people would ask for advice instead of letting AIs run organizations, I was saying they would ask for advice at all. (In fact, if the AI is single-single aligned to them in a real sense and very capable, it's even better to let that AI make all decisions on your behalf than to get advice. I was saying that even if no one bothers to have a single-single aligned AI representative, they could still ask AIs for advice and unless these AIs are straightforwardly misaligned in this context (e.g., they intentionally give bad advice or don't try at all without making this clear) they'd get useful advice for their own empowerment.)
- The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.
I'm claiming that it will selfishly (in terms of personal power) be in their interests to not have such a governance structure and instead have a governance structure which actually increases or retains their personal power. My argument here isn't about coordination. It's that I expect individual powerseeking to suffice for individuals not losing their power.
I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don't see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn't a crux for this argument unless you can argue this will waste a substantial fraction of all human resources in the long run?
This isn't at all an air tight argument to be clear, you can in principle have an equilibrium where if everyone powerseeks (without coordination) everyone gets negligable resources due to negative externalities (that result in some other non-human entity getting power) even if technical misalignment is solved. I just don't see a very plausible case for this and I don't think the paper makes this case.
Handing off decision making to AIs is fine---the question is who ultimately gets to spend the profits.
If your claim is "insufficient cooperation and coordination will result in racing to build and hand over power to AIs which will yield bad outcomes due to misaligned AI powerseeking, human power grabs, usage of WMDs (e.g., extreme proliferation of bioweapons yielding an equilibrium where bioweapon usage is likely), and extreme environmental negative externalities due to explosive industrialization (e.g., literally boiling earth's oceans)" then all of these seem at least somewhat plausible to me, but these aren't the threat models described in the paper and of this list only misaligned AI powerseeking seems like it would very plausibly result in total human disempowerment.
More minimally, the mitigations discussed in the paper mostly wouldn't help with these threat models IMO.
(I'm skeptical of insufficient coordination by the time industry is literally boiling the oceans on earth. I also don't think usage of bioweapons is likely to cause total human disempowerment except in combination with misaligned AI takeover---why would it kill literally all humans? TBC, I think >50% of people dying during the singularity due to conflict (between humans or with misaligned AIs) is pretty plausible even without misalignment concerns and this is obviously very bad, but it wouldn't yield total human disempowerment.)
I do agree that there are problems other than AI misalignment including that the default distribution of power might be problematic, people might not carefully contemplate what to do with vast cosmic resources (and thus use them poorly), people might go crazy due to super persuation or other cultural forces, society might generally have poor epistemics due to training AIs to have poor epistemics or insufficiently defering to AIs, and many people might die in conflict due to very rapid tech progress.
First, RE the role of "solving alignment" in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren't important (since we can't assume alignment will be solved).
3) I'm not sure what you mean by "alignment is solved"; I'm taking it to mean "AI systems can be trivially intent aligned". Such a system may still say things like "Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned. Is that what you want?" and people can respond with "yes" -- this is the sort of thing that probably still happens IMO.
4) Alternatively, you might say we're in the "alignment basin" -- I'm not sure what that means, precisely, but I would operationalize it as something like "the AI system is playing a roughly optimal CIRL game". It's unclear how good of performance that can yield in practice (e.g. it can't actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
5) I'm more interested in the case where alignment is not "perfectly" "solved", and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly. And it is irresponsible (because it's subjectively risky) to trust a perfectly aligned AI system absent strong assurances. But of course, people who are willing to YOLO it and just say "seems aligned, let's ship" will win. This is also part of the problem...
My main response, at a high level:
Consider a simple model:
I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?
Responding to some particular points below:
Sure, but these things don't result in non-human entities obtaining power right?
Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power. People were already having to implement and interact with stupid automated decision-making systems before AI came along.
Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?
My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.
I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?
I think something like this is a reasonable model but I have a few things I'd change.
Whichever group has more power at the end of the week survives.
Why can't both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relative to A and is this going to be an ongoing trend or it is a one time reduction.)
Probably I'd prefer talking about 2 groups at the start of the singularity. As in, suppose there are two AI companies "A" and "B" where "A" just wants AI systems decended from them to have power and "B" wants to maximize the expected resources under control of humans in B. We'll suppose that the government and other actors do nothing for simplicity. If they start in the same spot, does "B" end up with substantially less expected power? To make this more realistic (as might be important), we'll say that "B" has a random lead/disadvantage uniformly distributed between (e.g.) -3 and 3 months so that winner takes all dynamics aren't a crux.
The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.
What about if the humans in group B ask their AI to make them (the humans) as powerful in expectation?
Supposing you're fine with these changes, then my claim would be:
I think a crux is that you think there is a perpetual alignment tax while I think a one time tax gets you somewhere.
At a more basic level, when I think about what goes wrong in these worlds, it doesn't seem very likely to be well described as gradual disempowerment? (In the sense described in the paper.) The existance of an alignment tax doesn't imply gradual disempowerment. A scenario I find more plausible is that you get value drift (unless you pay a long lasting alignment tax that is substantial), but I don't think the actual problem will be well described as gradual disempowerment in the sense described in the paper.
(I don't think I should engage more on gradual disempowerment for the time being unless somewhat wants to bid for this or trade favors for this or similar. Sorry.)
Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I'm not sold the other reasons I've heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren't sufficiently ruthless or which don't defer enough to AIs getting disempowered.
BTW, this is also a crux for me as well, in that I do believe that absent technical misalignment, some humans will have most of the power by default, rather than AIs, because I believe AI rights will be limited by default.
I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don't see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn't a crux for this argument unless you can argue this will waste a substantial fraction of all human resources?
I think I'm imagining a kind of "business as usual" scenario where alignment appears to be solved using existing techniques (like RLHF) or straightforward extensions of these techniques, and where catastrophe is avoided but where AI fairly quickly comes to overwhelmingly dominate economically. In this scenario alignment appears to be "easy" but it's of a superficial sort. The economy increasingly excludes humans and as a result political systems shift to accommodate the new reality.
This isn't an argument for any new or different kind of alignment, I believe that alignment as you describe would prevent this kind of problem.
This is my opinion only, and I am thinking about this coming from a historical perspective so it's possible that it isn't a good argument. But I think it's at least worth consideration as I don't think the alignment problem is likely to be solved in time, but we may end up in a situation where AI systems that superficially appear aligned are widespread.
Thank you for writing this! I think it's probably true that sth like "society's alignment to human interest implicitly relies on human labor and cognition" is correct and that we will have to find clever solutions, lots of resources and political will to maintain alignment if human labor and cognition stops playing a large role. I am glad some people are thinking about these risks.
While I think the essay describes dynamics which I think are likely to result in a scary power concentration, I think this is more likely to be a power concentration for humans or more straightforwardly misaligned AIs rather than some notion of complete disempowerment. I'd be excited about a follow-up work which focuses on the argument for power concentration, which seems more likely to be robust and accurate to me.
Some criticism on complete disempowerment (that goes beyond power concentration):
(This probably reflects mostly ignorance on my part rather than genuine weaknesses of your arguments. I have thought some about coordination difficulties but it is not my specialty.)
I think that the world currently has and will continue to have a few properties which make the scenario described in the essay look less likely:
I agree that if we ever lose one of these three properties (and especially the first one), it would be difficult to get them back because of the feedback loops described in the essay. (If you want to argue against these properties, please start from a world like ours, where these three properties are true.) I am curious which property you think is most likely to fall first.
When assuming the combination of these properties, I think that this makes many of the specific threats and positive feedback loops described in the essay look less likely:
I think these properties are only somewhat unlikely to be false and thus I think it is worth working on making them true. But I feel like them being false is somewhat obviously catastrophic in a range of scenarios much broader than the ones described in the essay and thus it may be better to work on them directly rather than trying to do something more "systemic".
On a more meta note, I think this essay would have benefited from a bit more concreteness in the scenarios it describes and in the empirical claims it relies on. There is some of that (e.g. on rentier states), but I think there could have been more. I think What does it take to defend the world against out-of-control AGIs? makes related arguments about coordination difficulties (though not on gradual disempowerment) in a way that made more sense to me, giving examples of very concrete "caricature-but-plausible" scenarios and pointing at relevant and analogous coordination failures in the current world.
Thank you for the very detailed comment! I’m pretty sympathetic to a lot of what you’re saying, and mostly agree with you about the three properties you describe. I also think we ought to do some more spelling-out of the relationship between gradual disempowerment and takeover risk, which isn’t very fleshed-out in the paper — a decent part of why I’m interested in it is because I think it increases takeover risk, in a similar but more general way to the way that race dynamics increase takeover risk.
I’m going to try to respond to the specific points you lay out, probably not in enough detail to be super persuasive but hopefully in a way that makes it clearer where we might disagree, and I’d welcome any followup questions off the back of that. (Note also that my coauthors might not endorse all this.)
Responding to the specific assumptions you lay out:
Overall, I think I can picture worlds where (conditional on no takeover) we reach states of pretty serious disempowerment of the kind described in the paper, without any of these assumptions fully breaking down. That said, I expect AI rights to be the most important, and the one that starts breaking down first.
As for the feedback loops you mention:
I hope this sheds some light on things!
Thanks for your answer! I find it interesting to better understand the sorts of threats you are describing.
I am still unsure at what point the effects you describe result in human disempowerment as opposed to a concentration of power.
I have very little sense of what my current stocks are doing, and my impression is many CEOs don’t really understand most of what’s going on in their companies
I agree, but there isn't a massive gap between the interests of shareholders and what companies actually do in practice, and people are usually happy to buy shares of public corporations (buying shares is among the best investment opportunities!). When I imagine your assumptions being correct, the natural consequence I imagine is AI-run companies own by shareholders that get most of the surplus back. Modern companies are a good example of capital ownership working for the benefit of the capital owner. If shareholders want to fill the world with happy lizards or fund art, they probably will be able to, just like current rich shareholders can. I think for this to go wrong for everyone (not just people who don't have tons of capital) you need something else bad to happen, and I am unsure what that is. Maybe a very aggressive anti-capitalist state?
I also think this is not currently totally true; there is definitely a sense in which some politicians already do not change systems that have bad consequences
I can see how this could be true (e.g. the politicians are under the pressures of a public that has been brainwashed by algorithms maximizing engagements in a way that undermines the shareholders' power without actually redistributing the wealth but instead spends it all on big national AI project that do not produce anything else than more AIs), but I feel like that requires some very weird things to be true (e.g. the algorithms maximizing engagement above results in a very unlikely equilibrium absent an external force that pushes against shareholders and against redistribution). I can see how the state could enable massive AI projects by massive AI-run orgs, but I think it's way less likely that nobody (e.g. not the shareholders, not the taxpayer, not corrupt politicians, ...) gets massively rich (and able to chose what to consume).
About culture, my point was basically that I don't think evolution of media will be very disempowerment-favored. You can make better tailored AI -generated content and AI friends, but in most ways I don't see how it results in everyone being completely fooled about the state of the world in a way that enables the other dynamics.
I feel like my position is very far from airtight, I am just trying to point at what feel like holes in a story I don't manage to all fill simultaneously in a coherent way (e.g. how did the shareholders lose their purchasing power? What are the concrete incentives that prevent politicians from winning by doing campaigns like "everyone is starving while we build gold statues in favor of AI gods, how about we don't"? What prevents people from not being brainwashed by the media that has already obviously brainwashed 10% of the population into preferring AI gold statues to not starving?). I feel like you might be able to describe concrete and plausible scenarios where the vague things I say are obviously wrong, but I am not able to generate such plausible scenarios myself. I think your position would really benefit from a simple caricature scenario where each step feels plausible and which ends up not in power being concentrated in the hands of shareholders / dictators / corrupt politicians, but in power in the hands of AIs colonizing the stars with values that are not endorsed by any single human (nor are a reasonable compromise between human values) while the remaining humans slowly starve to death.
I was convinced by What does it take to defend the world against out-of-control AGIs? that there is at least some bad equilibrium that is vaguely plausible, in part because he gave an existence proof by describing some concrete story. I feel like this is missing an existence proof (that would also help me guess what your counterarguments to various objections would be).
In my opinion this kind of scenario is very plausible and deserves a lot more attention than it seems to get.
Space colonies are a potential way out - if a small group of people can make their own colony then they start out in control. The post assumes a world like it is now where you can't just leave. Historically speaking that is perhaps unusual - much of the time in the last 10,000 years it was possible for some groups to leave and start anew.
Aside from the fact that I disagree that it helps, given that an AI takeover that's hostile to humans isn't a local problem, we're optimistically decades away from such colonies being viable independent of earth, so it seems pretty irrelevant.
The OP is specifically about gradual disempowerment. Conditional on gradual disempowerment, it would help and not be decades away. Now we may both think that sudden disempowerment is much more likely. However in a gradual disempowerment world, such colonies would be viable much sooner as AI could be used to help build them, in the early stages of such disempowerment when humans could still command resources.
In a gradual disempowerment scenario vs no super AI scenario, humanities speed to deploy such colonies starts the same before AI can be used, then increases significantly compared to the no AI world as AI becomes available but before significant disempowerment, then drops to zero with complete disempowerment. The space capabilities area under the curve in the gradual disempowerment scenario is ahead of baseline for some time, enabling viable colonies to be constructed sooner than if there was no AI.
Sure, space colonies happen faster - but AI-enabled and AI-dependent space colonies don't do anything to make me think disempowerment risk gets uncorrelated.
Things the OP is concerned about like
"What makes this transition particularly hard to resist is that pressures on each societal system bleed into the others. For example, we might attempt to use state power and cultural attitudes to preserve human economic power. However, the economic incentives for companies to replace humans with AI will also push them to influence states and culture to support this change, using their growing economic power to shape both policy and public opinion, which will in turn allow those companies to accrue even greater economic power."
This all gets easier the smaller the society is. Coordination problems get harder the more parties involved. There will be pressure from motivated people to make the smallest viable colony in terms of people, which makes it easier to resist such things. For example there is much less effective cultural influence from the AI culture if the colony is founded by a small group of people with shared human affirming culture. Even if 99% of the state they come from is disempowered if small numbers can leave, they can create their own culture and set it up to be resistant to such things. Small groups of people have left decaying cultures throughout history and founded greater empires.
The paper says:
Christiano (2019) makes the case that sudden disempowerment is unlikely,
This isn't accurate. The post What failure looks like includes a scenario involving sudden disempowerment!
The post does say:
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like,
But, I think it is mostly arguing against threat models involving fast AI capability takeoff (where the level of capabilities take its creators and others by suprise and fast capabilities progress allows for AIs to suddenly become poweful enough to takeover) rather than threat models involving sudden disempowerment from a point where AIs are already well known to be extremely powerful.
I think this is correct, but doesn't seem to note the broader trend towards human disempowerment in favor of bureaucratic and corporate systems, which this gradual disempowerment would continue, and hence elides or ignores why AI risk is distinct.
Good point. The reason AI risk is distinct is simply that it removes the need of those bureaucracies and corporations to keep some humans happy and healthy enough to actually run them. This doesn't exactly put limits on how much they can disempower humans, but it does tend to provide at least some bargaining power for the humans involved.
I don't think that covers it fully. Corporations "need... those bureaucracies," but haven't done what would be expected otherwise.
I think we need to add both that corporations are limited by only doing things they can convince humans to do, are aligned with at least somewhat human directors / controllers, have a check and balance system of both the people being able to whistleblow and the company being constrained by law to an extent that the people need to worry when breaking it blatantly.
But I think that breaking these constraints is going to be much closer to the traditional loss-of-control scenario than what you seem to describe.
I'm confused about this response. We explicitely claim that bureaucracies are limited by running on humans, which includes only being capable of actions human minds can come up with and humans are willing to execute (cf "street level bureaucrats"). We make the point explicite for states, but clearly holds for corporate burreocracies.
Maybe it does not shine through the writing but we spent hours discussing this when writing the paper and points you make are 100% accounted for in the conclusions.
I don't think I disagree with you on the whole - as I said to start, I think this is correct. (I only skimmed the full paper, but I read the post; on looking at it, the full paper does discuss this more, and I was referring to the response here, not claiming the full paper ignores the topic.)
That said, in the paper you state that the final steps require something more than human disempowerment due to other types of systems, but per my original point, seem to elide how the process until that point is identical by saying that these systems have largely been aligned with humans until now, while I think that's untrue; humans have benefitted despite the systems being poorly aligned. (Misalignment due to overoptimization failures would look like this, and is what has been happening when economic systems are optimizing for GDP and ignoring wealth disparity, for example; the wealth goes up, but as it becomes more extreme, the tails diverge, and at this point, maximizing GDP looks very different from what a democracy is supposed to do.)
Back to the point, to the extent that the unique part is due to cutting the last humans out of the decision loop, it does differ - but it seems like the last step definitionally required the initially posited misalignment with human goals, so that it's an alignment or corrigibility failure of the traditional type, happening at the end of this other process that, again, I think is not distinct.
Again, that's not to say I disagree, just that it seems to ignore the broader trend by saying this is really different.
But since I'm responding, as a last complaint, you do all of this without clearly spelling out why solving technical alignment would solve this problem, which seems unfortunate. Instead, the proposed solutions try to patch the problems of disempowerment by saying you need to empower humans to stay in the decision loop - which in the posited scenario doesn't help when increasingly powerful but fundamentally misaligned AI systems are otherwise in charge. But this is making a very different argument, and one I'm going to be exploring when thinking about oversight versus control in a different piece I'm writing.
Just writing a model that came to mind, partly inspired by Ryan here.
Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".
If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it's more likely than not that the former wins, but it's not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.
If this doesn't happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.
If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it's common knowledge.
I feel like there are some critical metrics are factors here that are getting overlooked in the details.
I agree with your assessment that it's very likely that many people will lose power. I think it's fairly likely that most humans won't be able to provide much economic value at some point, and won't be able to ask for many resources in response. So I could see an argument for incredibly high levels of inequality.
However, there is a key question in that case, of "could the people who own the most resources guide AIs using those resources to do what they want, or will these people lose power as well?"
I don't see a strong reason why these people would lose power or control. That would seem like a fundamental AI alignment issue - in a world where a small group of people own all the world's resources, and there's strong AI, can those people control their AIs in ways that would provide this group a positive outcome?
2. There are effectively two ways these systems maintain their alignment: through explicit human actions (like voting and consumer choice), and implicitly through their reliance on human labor and cognition. The significance of the implicit alignment can be hard to recognize because we have never seen its absence.
3. If these systems become less reliant on human labor and cognition, that would also decrease the extent to which humans could explicitly or implicitly align them. As a result, these systems—and the outcomes they produce—might drift further from providing what humans want.
There seems to be a key assumption here that people are able to maintain control because of the fact that their labor and cognition is important.
I think this makes sense for people who need to work for money, but not for those who are rich.
Our world has a long history of dumb rich people who provide neither labor nor cognition, and still seem to do pretty fine. I'd argue that power often matters more than human output, and would expect the importance of power to increase over time.
I think that many rich people now are able to maintain a lot of control, with very little labor/cognition. They have been able to decently align other humans to do things for them.
I broadly agree with the view that something like this is a big risk under a lot of current human value sets.
One important caveat for some value sets is that I don't think this results in an existential catastrophe, and the broad reason for this is that in single-single alignment scenarios, some humans will remain in control and potentially become immortal, and importantly scenarios in which this is achieved automatically are excluded from existential catastrophes, solely due to the fact that human potential is realized, it's just that most humans are locked out of it.
It has similarities to this:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
But more fleshed out.
I disagree with critics who argue this risk is negligible, because the future is extraordinarily hard to predict. The present state of society is extremely hard to predict by people in the past. They would assume that if we managed to solve problems which they consider extremely hard, then surely we wouldn't be brought down by risk denialism, fake news, personal feuds between powerful people over childish insults, and so forth. Yet here we are.
Never underestimate the shocking shortsightedness of businesses. Look at the AI labs for example. Communists observing this phenomena were quoted saying "the capitalists will sell us the rope we hang them with."
It's not selfishness, it's bias. Businesspeople are not willing to destroy everything just to temporarily make an extra dollar—no human thinks like that! Instead, businesspeople are very smart and strategic but extraordinarily biased into thinking whatever keeps their business going or growing must be good for the people. Think about Stalin being very smart and strategic but extraordinarily biased into thinking whatever keeps him in power must be good for the people. It's not selfishness! If Stalin (or any dictator) were selfish, they would quickly retire and live the most comfortable retirements imaginable.
Humans evolved to be the most altruistic beings ever with barely a drop of selfishness. Our selfish genes makes us altruistic (as soon as power is within reach) because there's a thin line between "the best way to help others" and "amassing power at all costs." These two things look similar due to instrumental convergence, and it only takes a little bit of bias/delusion to make the former behave identically to the latter.
Even if gradual disempowerment doesn't directly starve people to death, it may raise misery and life dissatisfaction to civil war levels.
Collective anger may skyrocket to the point people would rather have their favourite AI run the country than the current leader. They elect politicians loyal to a version of the AI, and intellectuals facepalm. The government buys the AI company for national security reasons, and the AI completely takes over its own development process with half the country celebrating. More people facepalm, as politicians lick the boots of the "based" AI parrot its wise words e.g. "if you replace us with AI, we'll replace you with AI!"
While it is important to be aware of gradual disempowerment and for a few individuals to study it, my cause prioritization opinion is that only 1%-10% of the AI safety community should work on this problem.
The AI safety community is absurdly tiny. The AI safety spending is less than 0.1% of the AI capability spending, which in turn is less than 0.5% of the world GDP.
The only way for the AI safety community to influence the world, is to use their tiny resources to work on things which the majority of the world will never get a chance to work on.
This includes working on the risk of a treacherous turn, where an AGI/ASI suddenly turns against humanity. The majority of the world never gets a chance to work on this problem, because by the time they realize it is a big problem, it probably already happened, and they are already dead.
Of course, working on gradual disempowerment early is better than working on gradual disempowerment later, but this argument applies to everything. Working on poverty earlier is better than working on poverty later. Working on world peace earlier is better than working on world peace later.
If further thorough research confirms that this risk has a high probability, then the main benefit is using it as an argument for AI regulation/pause, when society hasn't yet tasted the addictive benefits of AGI.
It is theoretically hard to convince people to avoid X for their own good, because once they get X it'll give them so much power or wealth they cannot resist it anymore. But in practice, such an argument may work well since we're talking about the elites being unable to resist it, and people today have anti-elitist attitudes.
If the elites are worried the AGI will directly kill them, while the anti-elitists are half worried the AGI will directly kill them, and half worried [a cocktail of elites mixed with AGI] will kill them, then at least they can finally agree on something.
PS: have you seen Dan Hendrycks' arguments? It sort of looks like gradual disempowerment
Furthermore, without unprecedented changes in redistribution, declining labor share also translates into a structural decline in household consumption power, as humans lose their primary means of earning the income needed to participate in the economy as consumers.
This holds only if the labor share of income shrinks faster than purchasing power grows. Overall, I still think the misaligned economy argument goes through if household consumption power grows in absolute terms but "human preference aligned dollars" shrinks as a fraction of total dollars spent.
Full version on arXiv | X
Executive summary
AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI capabilities, without any coordinated power-seeking, poses a substantial risk of eventual human disempowerment. This loss of human influence will be centrally driven by having more competitive machine alternatives to humans in almost all societal functions, such as economic labor, decision making, artistic creation, and even companionship.
A gradual loss of control of our own civilization might sound implausible. Hasn't technological disruption usually improved aggregate human welfare? We argue that the alignment of societal systems with human interests has been stable only because of the necessity of human participation for thriving economies, states, and cultures. Once this human participation gets displaced by more competitive machine alternatives, our institutions' incentives for growth will be untethered from a need to ensure human flourishing. Decision-makers at all levels will soon face pressures to reduce human involvement across labor markets, governance structures, cultural production, and even social interactions. Those who resist these pressures will eventually be displaced by those who do not.
Still, wouldn't humans notice what's happening and coordinate to stop it? Not necessarily. What makes this transition particularly hard to resist is that pressures on each societal system bleed into the others. For example, we might attempt to use state power and cultural attitudes to preserve human economic power. However, the economic incentives for companies to replace humans with AI will also push them to influence states and culture to support this change, using their growing economic power to shape both policy and public opinion, which will in turn allow those companies to accrue even greater economic power.
Once AI has begun to displace humans, existing feedback mechanisms that encourage human influence and flourishing will begin to break down. For example, states funded mainly by taxes on AI profits instead of their citizens' labor will have little incentive to ensure citizens' representation. This could occur at the same time as AI provides states with unprecedented influence over human culture and behavior, which might make coordination amongst humans more difficult, thereby further reducing humans' ability to resist such pressures. We describe these and other mechanisms and feedback loops in more detail in this work.
Though we provide some proposals for slowing or averting this process, and survey related discussions, we emphasize that no one has a concrete plausible plan for stopping gradual human disempowerment and methods of aligning individual AI systems with their designers' intentions are not sufficient. Because this disempowerment would be global and permanent, and because human flourishing requires substantial resources in global terms, it could plausibly lead to human extinction or similar outcomes.