- If takeoff is slow-ish, a pivotal act (preventing more AGIs from being developed) will be difficult.
- If no pivotal act is performed, RSI-capable AGI proliferates. This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, wins.
These two points seem to be in direct conflict. The sorts of capabilities and winner-take-all underlying dynamics which would make "the first to attack wins" true are also exactly the sorts of capabilities and winner-take-all dynamics which would make a pivotal act tractable.
Or, to put it differently: the first "attack" (though might not look very "attack"-like) is the pivotal act; if the first attack wins, that means the pivotal act worked, and therefore wasn't that difficult. Conversely, if a pivotal act is too hard, then even if an AI attacks first and wins, it has no ability prevent new AI from being built and displacing it; if it did have that ability, then the attack would be a pivotal act.
Yes; except that a successful act can still be quite difficult.
You could reframe the concern to be that pivotal acts in a slow takeoff are prone to be bloody and dangerous. And because they are, and humans are likely to retain control, a pivotal act may be put off until it's even more bloody - like a nuclear conflict or sending the sun nova.
Worse yet, the "pivotal act" may be performed by the worst (human) actor, not the best.
Thanks for writing this, I think it's good to have discussions around these sorts of ideas.
Please, though, let's not give up on "value alignment," or, rather, conscience guard-railing, where the artificial conscience is inline with human values.
Sometimes when enough intelligent people declare something's too hard to even try at, it becomes a self-fulfilling prophesy - most people may give up on it and then of course it's never achieved. We do want to be realistic, I think, but still put in effort in areas where there could be a big payoff when we're really not sure if it'll be as hard as it seems.
Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn't 'sticky', it's easy to remove it without substantially impacting capabilities.
So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
Please convince me I'm wrong.
(I've only skimmed for now but) here's a reason / framework which might help with things going well: https://aiprospects.substack.com/p/paretotopian-goal-alignment.
Even with very slow takeoff where AIs reformat the economy without there being superintelligence, peaceful loss of control due to rising economic influence of AIs seems more plausible (as a source of overturn in the world order) than human-centric conflict. Humans will gradually hand off more autonomy to AIs as they become capable of wielding it, and at some point most relevant players are themselves AIs. This mostly seems unlikely only because superintelligence makes humans irrelevant even faster and less consensually.
Pausing AI for decades, if it's not y...
One element that needs to be remembered here is that each major participant in this situation will have superhuman advice. Even if these are "do what I mean and check" order-following AI, if they can forsee that an order will lead to disaster they will presumably be programmed to say so (not doing so is possible, but is a clearly a flawed design). So if it is reasonably obvious to anything superintelligent that both:
a) treating this as a zero-sum winner-take all game is likely to lead to a disaster, and
b) there is a cooperative non-zero-sum game approach w...
I think "The first AGI probably won't perform a pivotal act" is by far the weakest section.
To start things off, I would predict a world with slow takeoff and personal intent-alignment looks far more multipolar than the standard Yudkowskian recursively self-improving singleton that takes over the entire lightcone in a matter of "weeks or hours rather than years or decades". So the title of that section seems a bit off because, in this world, what the literal first AGI does becomes much less important, since we expect to see other similarly capable AI ...
I don't think your scenario works, maybe because I don't believe that the world is as offense advantaged as you say.
I think the closest domain where things are this offense biased is the biotech domain, and whie I do think biotech leading to doom is something we will eventually have to solve, I'm way less convinced of the assumption that every other domain is so offense advantaged that whoever goes first essentially wins the race.
That said, I'm worried about scenarios where we do solve alignment and get catastrophe anyways. though unlike your scenario, I e...
I think "pivotal act" is being used to mean both "gain affirmative control over the world forever" and "prevent any other AGI from gaining affirmative control of the world for the foreseeable future". The latter might be much easier than the former though.
(Posting this initial comment without having read the whole thing because I won't have a chance to come back to it today; apologies if you address this later or if it's clearly addressed in a comment)
If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving.
It seems worth spelling out your view here on how RSI-capable early AGI is likely to be. I would expect that early AGI will be capable of RSI in the weak sense of bein...
If no pivotal act is performed, RSI-capable AGI proliferates
Minor suggestion: spell out 'recursive self-improvement (RSI)' the first time; it took me a minute to remember the acronym.
I've been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can't seem to find the link, so I am commenting here) and which I don't really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The pr...
I wonder, have the comments managed to alleviate your concerns at all? Are there any promising ideas for multipolar AGI scenarios? Were there any suggestions that could work?
This strikes me as defining "alignment" a little differently than me.
It even might defing "instruction-following" differently than me.
If we really solved instruction following, you could give the instruction "Do the right thing" and it would just do the right thing.
If you that's possible, then what we need is a coalition to tell powerful AIs to "do the right thing", rather than "make my creators into god-emperors" or whatever. This seems doable, though the clock is perhaps ticking.
If you can't just tell an AI to do the right thing, but it's still competent...
The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I've described. We've survived that so far- but with only nine participants to date.
I wonder if there's a clue in this. When you say "only" nine participants it suggests that more would introduce more risk, but that's not what we've seen with MAD. The greater the number becomes, the bigger the deterrent gets. If, for a minute we forgo alliances, there is a natural alliance of "everyone else" at play when it comes to an aggressor. Military aggression is, after ...
I am not sure how notifications on LessWrong work, so I am going to repost one of my concerns about alignment here.
Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.
We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, our eviden...
Epistemic status: I'm aware of good arguments that this scenario isn't inevitable, but it still seems frighteningly likely even if we solve technical alignment. Clarifying this scenario seems important.
TL;DR: (edits in parentheses, two days after posting, from discussions in comments )
The first AGIs will probably be aligned to take orders
People in charge of AGI projects like power. And by definition, they like their values somewhat better than the aggregate values of all of humanity. It also seems like there's a pretty strong argument that Instruction-following AGI is easier than value aligned AGI. In the slow-ish takeoff we expect, this alignment target seems to allow for error-correcting alignment, in somewhat non-obvious ways. If this argument holds up even weakly, it will be an excuse for the people in charge to do what they want to anyway.
I hope I'm wrong and value-aligned AGI is just as easy and likely. But it seems like wishful thinking at this point.
The first AGI probably won't perform a pivotal act
In realistically slow takeoff scenarios, the AGI won't be able to do anything like make nanobots to melt down GPUs. It would have to use more conventional methods, like software intrusion to sabotage existing projects, followed by elaborate monitoring to prevent new ones. Such a weak attempted pivotal act could fail, or could escalate to a nuclear conflict.
Second, the humans in charge of AGI may not have the chutzpah to even try such a thing. Taking over the world is not for the faint of heart. They might get it after their increasingly-intelligent AGI carefully explains to them the consequences of allowing AGI proliferation, or they might not. If the people in charge are a government, the odds of such an action go up, but so do the risks of escalation to nuclear war. Governments seem to be fairly risk-taking. Expecting governments to not just grab world-changing power while they can seems naive, so this is my median scenario.
So RSI-capable AGI may proliferate until a disaster occurs
If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving. How long until someone tells their AGI to hide, self-improve, and take over?
Many people seem optimistic about this scenario. Perhaps network security can be improved with AGIs on the job. But AGIs can do an end-run around the entire system: hide, set up self-replicating manufacturing (robotics is rapidly improving to allow this), use that to recursively self-improve your intelligence, and develop new offensive strategies and capabilities until you've got one that will work within an acceptable level of viciousness.[1]
If hiding in factories isn't good enough, do your RSI manufacturing underground. If that's not good enough, do it as far from Earth as necessary. Take over with as little violence as you can manage or as much as you need. Reboot a new civilization if that's all you can manage while still acting before someone else does.
The first one to pull the stops probably wins. This looks all too much like a non-iterated Prisoner's Dilemma with N players - and N increasing.
Counterarguments/Outs
For small numbers of AGI and similar values among their wielders, a collective pivotal act could be performed. I place some hopes here, particularly if political pressure is applied in advance to aim for this outcome, or if the AGIs come up with better cooperation structures and/or arguments than I have.
The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I've described. We've survived that so far- but with only nine participants to date.
One means of preventing AGI proliferation is universal surveillance by a coalition of loosely cooperative AGI (and their directors). That might be done without universal loss of privacy if a really good publicly encrypted system were used, as Steve Omohundro suggests, but I don't know if that's possible. If privacy can't be preserved, this is not a nice outcome, but we probably shouldn't ignore it.
The final counterargument is that, if this scenario does seem likely, and this opinion spreads, people will work harder to avoid it, making it less likely. This virtuous cycle is one reason I'm writing this post including some of my worst fears.
Please convince me I'm wrong. Or make stronger arguments that this is right.
I think we can solve alignment, at least for personal-intent alignment, and particularly for the language model cognitive architectures that may well be our first AGI. But I'm not sure I want to keep helping with that project until I've resolved the likely consequences a little more. So give me a hand?
(Edit:) Conclusions after discussion
None of the suggestions in the comments seemed to me like workable ways to solve the problem.
I think we could survive an n-way multipolar human-controlled ASI scenario if n is small - like a handful of ASIs controlled by a few different governments. But not indefinitely - unless those ASIs come up with coordination strategies no human has yet thought of (or argued convincingly enough that I've heard of it - this isn't really my area, but nobody has pointed to any strong possibilities in the comments). I'd love more pointers to coordination strategies that could solve this problem.
So my conclusion is to hope that this is so obviously such a bad/dangerous scenario that it won't be allowed to happen.
Basically, my hope is that this all becomes viscerally obvious to the first people who speak with a superhuman AGI and who think about global politics. I hope they'll pull their shit together, as humans sometimes do when they're motivated to actually solve hard problems.
I hope they'll declare a global moratorium on AGI development and proliferation, and agree to share the benefits of their AGI/ASI broadly in hopes that this gets other governments on board, at least on paper. They'd use their AGI to enforce that moratorium, along with hopefully minimal force. Then they'll use their intent-aligned AGI to solve value alignment and launch a sovereign ASI before some sociopath(s) gets ahold of the reins of power and creates a permanent dystopia of some sort.
More on this scenario in my reply below.
I'd love to get more help thinking about how likely the central premise, that people get their shit together once they're staring real AGI in the face is. And what we can do now to encourage that.
Additional edit: Eli Tyre and Steve Byrnes have reached similar conclusions by somewhat different routes. More in a final footnote.[2]
Some maybe-less-obvious approaches to takeover, in ascending order of effectiveness: Drone/missile-delivered explosive attacks on individuals controlling and data centers housing rival AGI; Social engineering/deepfakes to set off cascading nuclear launches and reprisals; dropping stuff from orbit or altering asteroid paths; making the sun go nova.
The possibilities are limitless. It's harder to stop explosions than to set them off by surprise. A superintelligence will think of all of these and much better options. Anything more subtle that preserves more of the first actors' near-term winnings (earth and humanity) is gravy. The only long-term prize goes to the most vicious.
Eli Tyre reaches similar conclusions with a more systematic version of this logic in Unpacking the dynamics of AGI conflict that suggest the necessity of a premptive pivotal act:
Steve Byrnes reaches similar conclusions in What does it take to defend the world against out-of-control AGIs?, but he focuses on near-term, fully vicious attacks from misaligned AGI, prior to fully hardening society and networks, centering on triggering full nuclear exchanges. I find this scenario less likely because I expect instruction-following alignment to mostly work on the technical level, and the first groups to control AGIs to avoid apocalyptic attacks.
I have yet to find a detailed argument that addresses these scenarios and reaches opposite conclusions.