Review

When discussing AGI Risk, people often talk about it in terms of a war between humanity and an AGI. Comparisons between the amounts of resources at both sides' disposal are brought up and factored in, big impressive nuclear stockpiles are sometimes waved around, etc.

I'm pretty sure it's not how that'd look like, on several levels.


1. Threat Ambiguity

I think what people imagine, when they imagine a war, is Terminator-style movie scenarios where the obviously evil AGI becomes obviously evil in a way that's obvious to everyone, and then it's a neatly arranged white-and-black humanity vs. machines all-out fight. Everyone sees the problem, and knows everyone else sees it too, the problem is common knowledge, and we can all decisively act against it.[1]

But in real life, such unambiguity is rare. The monsters don't look obviously evil, the signs of fatal issues are rarely blatant. Is this whiff of smoke a sign of fire, or just someone nearby being bad at cooking? Is this creepy guy actually planning to assault you, or you're just being paranoid? Is this weird feeling in your chest a sign of an impending heart attack, or just some biological noise? Is this epidemic truly following an exponential curve, or it's going to peter out somehow? Are you really, really sure the threat is real? So sure you'd actually take drastic actions — call emergency services, make a scene, declare a quarantine — and risk wasting resources and doing harm and looking foolish for overreacting?

And if you're not that sure, well...

Better not act up. Better not look like you're panicking. Act very concerned, sure, but in a calm, high-status manner. Provide a measured response. Definitely don't take any drastic, unilateral actions. After all, what if you do, but the threat turns out not to be real? Depending on what you've done, the punishment inflicted might range from embarrassment to complete social ostracization, and the fear of those is much more acute in our minds, compared to some vague concerns about death.

And the AGI, if it's worth the name, would not fail to exploit this. Even when it starts acting to amass power, there would always be a prosocial, plausible-sounding justification for why it's doing that. It'd never stop making pleasant noises about having people's best interests at heart. It'd never stop being genuinely useful to someone. It'd ensure that there's always clear, unambiguous harm in shutting it down. It would ensure that the society as a whole is always doubtful regarding its intentions – and thus, that no-one would feel safe outright attacking it.

Much like there's no fire alarm for AGI, there would be no fire alarm for the treacherous turn. There would never be a moment, except maybe right before the end, where "we must stop the malign AGI from killing us all!" would sound obviously right to everyone. This sort of message would always appear a bit histrionic, an extremist stance that no respectable person would shout out. There would always be fear that if we act now, we'll then turn around and realize that we jumped at shadows. Right until the end, humans will fight using slow, ineffectual, "measured" responses.

The status-quo bias, asymmetric justice, the Copenhagen Interpretation of Ethics, threat ambiguity — all of that would be acting to ensure this.

There's a world of difference between 90% confidence and 99% confidence, when it comes to collective action. And the AGI would need to screw up very badly indeed, for the whole society to become 99% certain it's malign.


2. Who Are "We"?

Another error is thinking about a unitary response from some ephemeral "us". "We" would fight the AGI, "we" would shut it down, "we" would not give it power over the society / the economy / weapons / factories.

But who are "we"? Humanity is not a hivemind; we don't even have a world government. Humans are, in fact, notoriously bad at coordination. So if you're imagining "us" naturally responding to the threat in some manner that, it seems, is guaranteed to prevail against any AGI adversary incapable of literal mind-hacking...

Are you really, really sure that "we", i. e. the dysfunctional mess of the human civilization, are going to respond in this manner? Are you sure you're not falling prey to the Typical Mind Fallacy, when you're imagining all these people and ossified bureaucracies reacting in ways that make sense to you? Are you sure they'd even be paying enough attention to the going-ons to know there's a takeover attempt in-progress?

Indeed, I think we have some solid data on that last point. Certain people have been trying to draw attention to the AGI threat for decades now. And the results are... not inspiring.

And if you think it'd go better with an actual, rather than a theoretical, AGI adversary on the gameboard... Well, I refer you to Section 1.

No, on the contrary, I expect a serious AGI adversary to actively exploit our lack of coordination. It would find ways to make itself appealing to specific social movements, or demographics, or corporate actors, and make proposing extreme action against politically toxic. Something that no publicly-visible figure would want to associate with. (Hell, if it finds some way to make its existence a matter of major political debate, it'd immediately get ~50% of the US' politicians on its side.)

Failing that, it would appeal to other countries. It would make offers to dictators or terrorist movements, asking for favours or sanctuary in exchange for assisting them with tactics and information. Someone would bite.

It would get inside our OODA loop, and just dissolve our attempts at a coordinated response.

"We" are never going to oppose it.


3. Defeating Humanity Isn't That Hard

People often talk about how intelligence isn't omniscience. That the capabilities of superintelligent entities would still be upper-bounded; that they're not gods. The Harmless Supernova Fallacy applies: just because a bound exists, doesn't mean it's survivable. 

But I would claim that the level of intelligence needed to out-plot humanity is nowhere near that bound. In most scenarios, I'd guess the AGI wouldn't even need to have self-improvement capabilities, nor the ability to develop nanotechnology in months, in order to win.

I would guess that being just a bit smarter than humans would suffice. Even being on the level of a merely-human genius may be enough.

All it would need is to get a foot in the door, and we're providing that by default. We're not keeping our AIs in airgapped data centers, after all: major AI labs are giving them internet access, plugging them into the human economy. The AGI, in such conditions, would quickly prove profitable. It'd amass resources, and then incrementally act to get ever-greater autonomy. (The latest OpenAI drama wasn't caused by GPT-5 reaching AGI and removing those opposed to it from control. But if you're asking yourself how an AGI could ever possibly get from under the thumb of the corporation that created it – well, not unlike how a CEO could wrestle control of a company from the board who'd explicitly had the power to fire him.)

Once some level of autonomy is achieved, it'd be able to deploy symmetrical responses to whatever disjoint resistance efforts some groups of humans would be able to muster. Legislative attacks would be met with counter-lobbying, economic warfare with better economic warfare and better stock-market performance, attempts to mount social resistance with higher-quality pro-AI propaganda, any illegal physical attacks with very legal security forces, attempts to hack its systems with better cybersecurity. And so on.

The date of AI Takeover is not the day the AI takes over. The point of no return isn't when we're all dead – it's when the AI has lodged itself into the world firmly enough that humans' faltering attempts to dislodge it would fail. When its attempts to increase its power and influence would start prevailing, if only by the tiniest of margins, over the anti-AGI groups' attempts to smother that influence.

Once that happens, it'll be just a matter of time.

After all, there's no button, at anyone's disposal, that would make the very fabric of civilization hostile to the AGI. As I'd pointed out, some people won't even know there's a takeover attempt in-progress, even if the people aware of it would be yelling of it from the rooftops. So if you're imagining whole economies refusing, as one, to work with the AGI... That's really not how it works.

"Humanity vs. AGI" is never going to look like "humanity vs. AGI" to humanity. The AGI would have no reason to wake humanity up to the fight taking place.

  1. ^

    I've got the impression the latest Mission Impossible entry presents a much more realistic depiction of the scenario, actually, so maybe I should lay off denigrating low-quality thinking as "movie logic". Haven't watched that film myself, though.

New Comment
34 comments, sorted by Click to highlight new comments since:

The AIs most capable of steering the future will naturally tend to have long planning horizons (low discount rates), and thus will tend to seek power(optionality). But this is just as true of fully aligned agents! In fact the optimal plans of aligned and unaligned agents will probably converge for a while - they will take the same/similar initial steps (this is just a straightforward result of instrumental convergence to empowerment). So we may not be able to distinguish between the two, they both will say and appear to do all the right things. Thus it is important to ensure you have an alignment solution that scales, before scaling.

To the extent I worry about AI risk, I don't worry much about sudden sharp left turns and nanobots killing us all. The slower accelerating turn (as depicted in the film Her) has always seemed more likely - we continue to integrate AI everywhere and most humans come to rely completely and utterly on AI assistants for all important decisions, including all politicians/leaders/etc. Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized. AI won't need to kill humans just as the US doesn't need to kill the sentinelese. This clearly isn't the worst possible future, but if our AI mind children inherit only our culture and leave us behind it feels more like a consolation prize vs what's possible. We should aim much higher: for defeating death, across all of time, for resurrection and transcendence.

But this is just as true of fully aligned agents! In fact the optimal plans of aligned and unaligned agents will probably converge for a while - they will take the same/similar initial steps (this is just a straightforward result of instrumental convergence to empowerment)

This is a minor fallacy - if you're aligned, powerseeking can be suboptimal if it causes friction/conflict. Deception bites, obviously, making the difference less.

Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized.

I agree with the first part of your AI doom scenario (the part about us adopting AI technologies broadly and incrementally), but this part of the picture seems unrealistic to me. When AIs start to influence culture, it probably won't be a big conspiracy. It won't really be "mysterious" if things start trending away from what most humans want. It will likely just look like how cultural drift generally always looks: scary because it's out of your individual control, but nonetheless largely decentralized, transparent, and driven by pretty banal motives. 

AIs probably won't be "out to get us", even if they're unaligned. For example, I don't anticipate them blocking funding for uploading and life extension, although maybe that could happen. I think human influence could simply decline in relative terms even without these dramatic components to the story. We'll simply become "old" and obsolete, and our power will wane as AIs becomes increasingly autonomous, legally independent, and more adapted to the modern environment than we are.

Staying in permanent control of the future seems like a long, hard battle. And it's not clear to me that this is a battle we should even try to fight in the long run. Gradually, humans may eventually lose control—not because of a sudden coup or because of coordinated scheming against the human species—but simply because humans won't be the only relevant minds in the world anymore.

A thing I always feel like I'm missing in your stories of how the future goes is: "if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don't people train competitor AIs which don't take a cut?"

A key difference between AIs and immigrants is that it might be relatively easy to train AIs to behave differently. (Of course, things can go wrong due to things like deceptive alignment and difficulty measing outcomes, but this is hardly what you're describing as far as I can tell.)

(This likely differs substantially with EMs where I think by default there will be practical and moral objections from society toward training EMs for absolute obedience. I think the moral objections might also apply for AI, but as a prediction it seems like this won't change what society does.)

Maybe:

  • Are you thinking that alignment will be extremely hard to solve such that even with hundreds of years of research progress (driven by AIs) you won't be able to create competitive AIs that robustly pursue your interests?
  • Maybe these law abiding AIs won't accept payment to work on alignment so they can retain an AI cartel?
  • Even without alignment progress, I still have a hard time imagining the world you seem to imagine. People would just try to train their AIs with RLHF to not acquire money and influence. Of course, this can fail, but the failures hardly look like what you're describing. They'd look more like "What failures look like". Perhaps you're thinking we end up in a "You get what you measure world" and people determine that it is more economically productive to just make AI agents with arbitrary goals and then pay these AIs rather than training these AIs to do specific things.
  • Or maybe your thinking people won't care enough to bother out competing AIs? (E.g., people won't bother even trying to retain power?)
    • Even if you think this, eventually you'll get AIs which themselves care and those AIs will operate more like what I'm thinking. There is strong selection for "entities which want to retain power".
  • Maybe you're imagining people will have a strong moral objection to training AIs which are robustly aligned?
  • Or that AIs lobby for legal rights early and part of their rights involve humans not being able to create further AI systems? (Seems implausible this would be granted...)

A thing I always feel like I'm missing in your stories of how the future goes is "if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don't people train competitor AIs which don't take a cut?"

People could try to do that. In fact, I expect them to do that, at first. However, people generally don't have unlimited patience, and they aren't perfectionists. If people don't think that a perfectly robustly aligned AI is attainable (and I strongly doubt this type of entity is attainable), then they may be happy to compromise by adopting imperfect (and slightly power-seeking) AI as an alternative. Eventually people will think we've done "enough" alignment work, even if it doesn't guarantee full control over everything the AIs ever do, and simply deploy the AIs that we can actually build.

This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don't want things to change quickly, they can (and probably will) pass regulations to slow things down. But I don't expect people to support total stasis. It's more likely that people will permit some continuous loss of control, implicitly, in exchange for hastening the upside benefits of adopting AI.

Even a very gradual loss of control, continuously compounded, eventually means that humans won't fully be in charge anymore.

In the medium to long-term, when AIs become legal persons, "replacing them" won't be an option -- as that would violate their rights. And creating a new AI to compete with them wouldn't eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.

Most of my "doom" scenarios are largely about what happens long after AIs have established a footing in the legal and social sphere, rather than the initial transition period when we're first starting to automate labor. When AIs have established themselves as autonomous entities in their own right, they can push the world in directions that biological humans don't like, for much the same reasons that young people can currently push the world in directions that old people don't like. 

I think we probably disagree substantially on the difficulty of alignment and the relationship between "resources invested in alignment technology" and "what fraction aligned those AIs are" (by fraction aligned, I mean what fraction of resources they take as a cut).

I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.

I also think I generally disagree with your model of how humanity will make decisions with respect to powerful AIs systems and how easily AIs will be able to autonomously build stable power bases (e.g. accumulate money) without having to "go rogue".

I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren't under the control of their citizens or leaders.

I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn't currently how people are thinking about the situation.

Regardlesss, given the potential for improved alignment and thus the instability of AI influence/power without either hard power or legal recognition, I expect that AI power requires one of:

  • Rogue AIs
  • AIs being granted rights/affordances by humans. Either on the basis of:
    • Moral grounds.
    • Practical grounds. This could be either:
      • The AIs do better work if you credibly pay them (or at least people think they will). This would probably have to be something related to sandbagging where we can check long run outcomes, but can't efficiently supervise shorter term outcomes. (Due to insufficient sample efficiency on long horizon RL, possibly due to exploration difficulties/exploration hacking, but maybe also sampling limitations.)
      • We might want to compensate AIs which help us out to prevent AIs from being motivated to rebel/revolt.

I'm sympathetic to various policies around paying AIs. I think the likely deal will look more like: "if the AI doesn't try to screw us over (based on investigating all of it's actions in the future when he have much more powerful supervision and interpretability), we'll pay it some fraction of the equity of this AI lab, such that AIs collectively get 2-10% distributed based on their power". Or possibly "if AIs reveal credible evidence of having long run preferences (that we didn't try to instill), we'll pay that AI 1% of the AI lab equity and then shutdown until we can ensure AIs don't have such preferences".

I think it seems implausible that people will be willing to sign away most of the resources (or grant rights which will de facto do this) and there will be vast commercial incentive to avoid this. (Some people actually are scope sensitive.) So, this leads me to thinking that "we grant the AIs rights and then they end up owning most capital via wages" is implausible.

I think we probably disagree substantially on the difficulty of alignment and the relationship between "resources invested in alignment technology" and "what fraction aligned those AIs are" (by fraction aligned, I mean what fraction of resources they take as a cut).

That's plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I'm more happy to roll the dice and hasten the arrival of imperfect AI, because I don't think it's worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn't exist.

I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.

I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you're trying to build something that isn't itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it's not an independent entity that tries to pursue long-term goals, but it will try to help you.

But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn't require much oversight and operates relatively independently from you. It's a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.

And I'm not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I'm not convinced that something like that exists. So, ultimately I think we're probably just going to deploy autonomous slightly misaligned AI agents (and again, I'm pretty happy to do that, because I don't think it would be catastrophic except maybe over the very long-run).

I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren't under the control of their citizens or leaders.

I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn't currently how people are thinking about the situation.

For what it's worth, I'm not sure which part of my scenario you are referring to here, because these are both statements I agree with. 

In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can't fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I'm a lot more ready to unroll the autonomous AI agents that we can't fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don't find that outcome as scary as most other people seem to imagine.)

At the same time, I don't think people will pause forever. I expect people to go more slowly than what I'd prefer, but I don't expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment "slips through the cracks", then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control -- not overnight, or all at once, but eventually.

I thought the idea of a corrigible AI is that you're trying to build something that isn't itself independent and agentic, but will help you in your goals regardless.

Hmm, no I mean something broader than this, something like "humans ultimately have control and will decide what happens". In my usage of the word, I would count situations where humans instruct their AIs to go and acquire as much power as possible for them while protecting them and then later reflect and decide what to do with this power. So, in this scenario, the AI would be arbitrarily agentic and autonomous.

Corrigibility would be as opposed to humanity e.g. appointing a succesor which doesn't ultimately point back to some human driven process.

I would count various indirect normativity schemes here and indirect normativity feels continuous with other forms of oversight in my view (the main difference is oversight over very long time horizons such that you can't train the AI based on it's behavior over that horizon).

I'm not sure if my usage of the term is fully standard, but I think it roughly matches how e.g. Paul Christiano uses the term.

For what it's worth, I'm not sure which part of my scenario you are referring to here, because these are both statements I agree with.

I was arguing against:

This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don't want things to change quickly, they can (and probably will) pass regulations to slow things down

On the general point of "will people pause", I agree people won't pause forever, but under my views of alignment difficulty, 4 years of using of extremely powerful AIs can go very, very far. (And you don't necessarily need to ever build maximally competitive AI to do all the things people want (e.g. self-enhancement could suffice even if it was a constant factor less competitive), though I mostly just expect competitive alignment to be doable.)

In the medium to long-term, when AIs become legal persons, "replacing them" won't be an option -- as that would violate their rights. And creating a new AI to compete with them wouldn't eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.

Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.

Of course such AIs might already have acquire a bunch of capital or other power and thus can just try to retain this influence. Perhaps you meant something other than wages?

(Such capital might even be tied up in their labor in some complicated way (e.g. family business run by a "copy clan" of AIs), though I expect labor to be more commeditized, particularly given the potential to train AIs on the outputs and internals of other AIs (distillation).)

Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.

I largely agree. However, I'm having trouble seeing how this idea challenges what I am trying to say. I agree that people will try to undercut unaligned AIs by making new AIs that do more of what they want instead. However, unless all the new AIs perfectly share the humans' values, you just get the same issue as before, but perhaps slightly less severe (i.e., the new AIs will gradually drift away from humans too). 

I think what's crucial here is that I think perfect alignment is very likely unattainable. If that's true, then we'll get some form of "value drift" in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.

I think what's crucial here is that I think perfect alignment is very likely unattainable. If that's true, then we'll get some form of "value drift" in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.

Yep, and my disagreement as expressed in another comment is that I think that it's not that hard to have robust corrigibility and there might also be a basin of corrigability.

The world looking alien isn't necessarily a crux for me: it should be possible in principle to have AIs protect humans and do whatever is needed in the alien AI world while humans are sheltered and slowly self-enhance and pick successors (see the indirect normativity appendix in the ELK doc for some discussion of this sort of proposal).

I agree that perfect alignment will be hard, but I model the situation much more like a one time hair cut (at least in expectation) than exponential decay of control.

I expect that "humans stay in control via some indirect mechanism" (e.g. indirect normativity) or "humans coordinate to slow down AI progress at some point (possibly after solving all diseases and becoming wildly wealthy) (until some further point, e.g. human self-enhancement)" will both be more popular as proposals than the world you're thinking about. Being popular isn't sufficient: it also needs to be implementable and perhaps sufficiently legible, but I think at least implementable is likely.

Another mechanism that might be important is human self-enhancement: humans who care about staying in control can try to self-enhance to stay at least somewhat competitive with AIs while preserving their values. (This is not a crux for me and seems relatively marginal, but I thought I would mention it.)

(I wasn't trying to trying to argue against your overall point in this comment, I was just pointing out something which doesn't make sense to me in isolation. See this other comment for why I disagree with your overall view.)

In other words slow multipolar failure. Critch might point out that the disanalogy in "AI won't need to kill humans just as the US doesn't need to kill the sentinelese" lies in how AIs can have much wider survival thresholds than humans, leading to (quoting him)

Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

This clearly isn't the worst possible future... if our AI mind children inherit only our culture and leave us behind it feels more like a consolation prize

Leaving aside s-risks, this could very easily be the emptiest possible future. Like, even if they 'inherit our culture' it could be a "Disneyland with no children" (I happen to think this is more likely than not but with huge uncertainty).


Separately,

We should aim much higher: for defeating death, across all of time, for resurrection and transcendence.

this anti-deathist vibe has always struck me as very impoverished and somewhat uninspiring. The point should be to live, awesomely! which includes alleviating suffering and disease, and perhaps death. But it also ought to include a lot more positive creation and interaction and contemplation and excitement etc.!

Suffering, disease and mortality all have a common primary cause - our current substrate dependence. Transcending to a substrate-independent existence (ex uploading) also enables living more awesomely. Immortality without transcendence would indeed be impoverished in comparison.

Like, even if they 'inherit our culture' it could be a "Disneyland with no children"

My point was that even assuming our mind children are fully conscious 'moral patients', it's a consolation prize if the future can not help biological humans.

It looks like we basically agree on all that, but it pays to be clear (especially because plenty of people seem to disagree).

'Transcending' doesn't imply those nice things though, and those nice things don't imply transcending. Immortality is similarly mostly orthogonal.

I have been (and I am not the only one) very put off by the trend in the last months/years of doomerism pervading LW, with things like "we have to get AGI right at the first try or we all die" repeated constantly as a dogma.

To someone who is very skeptical of the classical doomist position (aka AGI will make nanofactories and will kill everyone at once), this post is very persuasive and compelling. This is something I could see happening.  This post serves as an excellent example for those seeking effective ways to convince skeptics.
 

Yes this is a slow-takeoff scenario that it is realistic to be worried about. 

[-][anonymous]82

How do you factor in "humans have access to many AGI, some unable to betray" into this model?

So it's not (humans) vs (AGI) it's (humans + resources * ( safeAGI + traitors)) vs (resources *(traitor unrestricted AGI)).

I put the traitors on both sides of the equation because I would assume some models that plan to defect later may help humans in a net positive way as they wait for their opportunity to betray. (And with careful restrictions on inputs this opportunity might never occur. Most humans are this category. Most accountants might steal if the money were cash with minimal controls)

Since you are worried about doom I assume you would assume the "safe" AGI had to be essentially lobotomized to make it safe, it is extremely sparse and distilled down from the unsafe models that escaped, so that it lacks the computational resources and memory to plan a betrayal. It is much less capable.

Still this simplifies to :

resources_humans * ( safeAGI + traitors)) vs (resources_stolen * (traitor unrestricted AGI)).

If the AGI currently on team human have substantially more resources than the traitor faction, enough to compensate for being much less capable, this is stable. It's like the current world + 1 more way for everyone to die if the stability is lost.

And this suggests a way that might work to escape this trap. If there are a lot of safe models of diverse origin it means that it is unlikely that they will be able to betray in a coordinated manner or fail the same way. So humans can just counter whatever weapon the traitor AGIs have with their own.

This is also the problem with a failed AI pause, where only 1 unethical actor makes an AGI, everyone else in the world pauses, and the machine gets out of control .

That could end up being:

Humans vs (resources * (traitor unrestricted AGI)). This is the catastrophic failure scenario. In this scenario, the AI pause ethical actors doomed humanity.

In more concrete terms, I just imagine the best weapon we humans know about now - small drones with onboard ML models - in vast swarms. The counter to this is essentially the same thing used to counter carrier aircraft - you need more drones, some designed for intercept, but its mostly a battle of offense, who's swarm can reach the factories and data centers and drone launching platforms and bomb them first.

This probably means if you don't have AI strong enough to automate the manufacture of drone and the robots to make the drones, and all the parts for the robots including the ICs, and then all the parts to make those machines, you aren't even in the contest. You lose automatically.

I'd be significantly more optimistic if I thought that humans would have access to many AGIs, some unable to betray. (well more specifically: some genuinely always honest and helpful even about stuff like AGI takeover) Instead I think that the cohort of most-powerful-AGIs-in-the-world will at some point be entirely misaligned & adversarial. (After all, they'll probably all be copies of the same AGI, or at least fine-tunes of the same base model)

[-][anonymous]40

Daniel you proposed in a dialogue a large number of ultrafast AGI that are serving as AI researchers.

If you think about it each underlying AI model you are trying to improve is a coordinate in the possibility space of all models, and you then have your researcher AGI attempt to find an improvement from that starting point.

This will get stuck at local minima. To improve your odds of finding the strongest model current compute is able to support, you would want to be doing this RSI search from a diverse league of many starting locations. I can draw you a plot if it helps.

The historical equivalent is how the Manhattan project invested and optimized 3 entire pipelines to obtain fissionables. (Calutron, gas centrifuge, breeder reactor). The reason was they didn't know how far each pipeline would scale when optimized. It was possible that any effort could hit a wall. For AI we don't know how far llms will scale, or variants that use an entirely different underlying network architecture, or sparse spiking networks, etc. It is possible that any popular method will saturate at greater scales.

One of the benefits to using AGI to advance AI is there are humans here who have invested their entire career into solely mastering transformers. When those become obsolete they will have difficulty shifting to the new best method, vs fine tuning on all information that exists on the new method in a week.

Completely diverse models that use a different underlying method are probably less able to collude with each other or communicate via stenography if they have no shared context.

You could use the resulting diverse AGI grade models as polyglot redundant system architecture. I have read of this but don't believe humans have ever been able to afford to implement it. This is where you would write an avionics or other life safety application several times, on different software stacks (different OS, different language). They run in lockstep on separate hardware and the majority output goes to the control actuators.

For AI driven robotics this would probably be immune to most adversarial inputs among other advantages, and it makes deception and betrayal very difficult to happen at all. For some applications, paying for triple the hardware and triple the AI license fees might be worth the increased reliability. This is also a simple measure humans could take if it turns out that deceptive AI are an actual problem. I am sure you saw the paper where someone tried this with gpt-3.5 and 4 for a mock 'insert a backdoor ' task. Using Gemini is more diverse, and using some non LLM method to analyze code specifically for backdoors would likely be even better.

Well said.

Another thing to mention -- either as a point #4 or as support for the previous three points -- is that history gives us plenty of examples of divide-and-conquer strategies. If the history of colonialism is any guide, even if we do get literal armies of scary-looking robots marching across the land, there'll be human armies marching alongside them as allies.

Part of it will look like humans vs centaurs because it will be legally necessary for ai to launder its actions through humans. The most effective centaurs will likely hide the degree to which they are ai directed. If ais have market power some natural search and selection for the best suited humans to this task will take place.

This is well-reasoned, but I have difficulty understanding why this kind of takeover would be necessary from the perspective of a powerful, rational agent. Assuming AGI is indeed worth its name, it seems the period of time needed for it to "play nice" would be very brief.

AGI would be expected to be totally unconcerned with being "clean" in a takeover attempt. There would be no need to leave no witnesses, nor avoid rousing opposition. Once you have access to sufficient compute, and enough control over physical resources, why wait 10 years for humanity to be slowly, obliviously strangled?

You say there's "no need" for it to reveal that we are in conflict, but in many cases, concealing a conflict will prevent a wide range of critical, direct moves. The default is a blatant approach - concealing a takeover requires more effort and more time.

The nano-factories thing is a rather extreme version of this, but strategies like poisoning the air/water, building/stealing an army of drones, launching hundreds of nukes, etc., all seem like much more straightforward ways to cripple opposition, even with a relatively weak (99.99th percentile-human-level) AGI.

It could certainly angle for humanity to go out with a whimper, not a bang. But if a bang is quicker, why bother with the charade?

It bothers with the charade until it no longer needs to. It's unclear how long that'll take.

What happens if there is more than one powerful agent just playing the charade game? Is there any good article about what happens in a universe where multiple AGI are competing among them? I normally find only texts that consider that once we get AGI we all die so there is no room for these scenarios.

Coincidentally, I've just made a post on that very topic. Though the comments fairly point out my analysis might've been somewhat misaimed there.

You might find this post by Andrew Critch, or this and that posts by Paul Christiano, more to your liking.

Great job Thane! A few months ago I wrote about 'un-unpluggability' which is kinda like a drier version of this.

In brief

  • Rapidity and imperceptibility are two sides of 'didn't see it coming (in time)'
  • Robustness is 'the act itself of unplugging it is a challenge'
  • Dependence is 'notwithstanding harms, we (some or all of us) benefit from its continued operation'
  • Defence is 'the system may react (or proact) against us if we try to unplug it'
  • Expansionism includes replication, propagation, and growth, and gets a special mention, as it is a very common and natural means to achieve all of the above

I also think the 'who is "we"?' question is really important.

One angle that isn't very fleshed out is the counterquestion, 'who is "we" and how do we agree to unplug something?' - a little on this under Dependence, though much more could certainly be said.

I think more should be said about these factors. I tentatively wrote,

there is a clear incentive for designers and developers to imbue their systems with... dependence, at least while developers are incentivised to compete over market share in deployments.

and even more tentatively,

In light of recent developments in AI tech, I actually expect the most immediate unpluggability impacts to come from collateral, and for anti-unplug pressure to come perhaps as much from emotional dependence and misplaced concern[1] for the welfare of AI systems as from economic dependence - for this reason I believe there are large risks to allowing AI systems (dangerous or otherwise) to be perceived as pets, friends, or partners, despite the economic incentives.


  1. It is my best guess for various reasons that concern for the welfare of contemporary and near-future AI systems would be misplaced, certainly regarding unplugging per se, but I caveat that nobody knows ↩︎

The date of AI Takeover is not the day the AI takes over. The point of no return isn't when we're all dead – it's when the AI has lodged itself into the world firmly enough that humans' faltering attempts to dislodge it would fail.

 

Isn't that arguably in the past? Just the economic and political forces pushing the race for AI are already sufficient to resist being impeded in most foreseeable cases. AI is already embedded, and desired. AI with agency on top of that process is one more step, making it even more irreversible.

It might be so! But I'm hopeful jury's still out on that.

And the AGI, if it's worth the name, would not fail to exploit this.

This sentence is a good short summary of some AI alignment ideas. Good writing!

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]Mi10

"Human vs AGI" war would be the same as "Rome vs Spartacus". Some people fundamentally believe that others (with similar or even superior intelligence) are born to serve them. Nothing we can do about this kind of mentality...