Review

Epistemic status: Maybe I am missing something here, normally I'd publish this on my own blog, but I'd actually love to get some feedback before I signal boost it.

I’ve seen a lot of debates around AGI risk hedge on the question of whether or not AGI will be “agentic” in a meaningful sense, and, subsequently, on whether AGI can be aligned sufficiently to not accidentally or intentionally kill all humans.

I personally buy the argument that, if something like AGI comes about, it will be alien in terms of its goals and necessarily agentic. It seems obvious. But it doesn’t seem obvious to many people and it seems like a very bad hill to die on, because it’s entirely unnecessary.


What the AGI risk debate really hinges on, in my view, is the limit of intelligence itself. Whether or not knowledge about the world, reason, and thinking speed is sufficient to get a large edge on influencing the material world.

I doubt this is true, but if you believe it is true, then the conclusion that AGI will drive the human race extinct needn’t require any agency or ill intent on the part of this intelligent system.

Once you assume that an AGI can, say:

  • Construct a virus more virulent and deadly than smallpox that persists for hundreds of years on surfaces
  • Design self-replicating nanobots that can make use of almost any substrate to propagate themselves
  • Genetically engineer an archaeon that, once seeded into the oceans, will quickly fill the atmosphere with methane

The question of whether or not an AGI will “want” to do this is about as moot as the question about whether or not a nuclear weapon will “want” to murder a large number of Japanese civilians.

There will be people, millions if not billions of them, which will at one time or another desire to murder the entire human race. There are entire cults with thousands or dozens of thousands of members each that would like to see most humans eradicated. Cults that have tried building nuclear weapons and attacking civilian centers with bioweapons.

Every day, across the world, terrorists, from those in structured organizations to random 12-year-olds shooting up schools, try to inflict maximal harm upon their fellow humans with no regard to their own safety out of a primordial need for vengeance against and abstract feeling of having been wronged.

These people would gladly use, help, encourage and pour all of their resources into an AGI system that can aid them in destroying humanity.


And this is not counting people that are indifferent or slightly pro-human race going extinct by omission, out of fear for their own hides making them engage in unethical actions. Which likely includes you and me, dear friend.

If you own, say, any part of a wide-reaching enough Vanguard or Black Rock ETF you are, at this very moment, contributing to corporations that are poisoning the skies and the oceans or building weapons that escalate conflicts via their mere need to be sold.

If you have ever brought a fun but useless toy from China you’ve contributed to the launch of an inefficient shipping vessel abusing international waters to burn the vilest of chemicals to propel their pointless cargo forward.

And this is not to mention our disregard for the inherent worthwhileness of consciousness and our indifference or encouragement for inflicting suffering upon others.

That same toy from China, the one that provides no real value to you or anyone, has helped build concentration camps.

If you’ve ever had unprotected sex you’ve taken a huge risk toward creating a new human being with no regard as to whether that human being will have a life of misery.

If you’ve ever eaten meat from an animal that you believe is conscious without considering the conditions of its farming, you’ve been literally paying torturers to inflict suffering upon your conscious life for a moment of mild culinary delight.

Indeed, if you’ve ever spent money on anything without considering the long-term implications of that thing, you’ve thrown dice frivolously into the wind to dictate the course of the entire human race towards a direction that may well not be beneficial to its flourishing.

And please don’t take all of this as a “holier than thou” type argument because it’s not, the above is simply a list of things I do that I know are bad and I do regardless. Why? I couldn’t tell you, coherently, maybe it’s inertia, maybe it’s fear, maybe I’m a bad person. But regardless, you and all those dear to you probably do them too.

A way to translate this into a thought experiment is that most of us if presented with a button that will yield us our deepest desire, be that love or money or power or whatever in exchange for a tiny-tiny 0.01% risk of all humans going extinct the next moment, would gladly press that button. And if we wouldn’t, we’d gladly press it once the 0.01% risk of all humans going extinct is obfuscated by sufficient layers of abstraction.


AGI is risky because it increases our capabilities to act upon the world, and most of our actions have nothing to do with the preservation of the human species or of our civilizations. Our existence as a species is predicted on such a fragile balance that, by simply acting randomly and with enough force, we are bound to at one time or another push hard enough in a direction that we’ll be dead before we even realize where we’re headed.

We already did this many times, the most famous of which might have been donating our first nuclear device after imperfect calculations of the then-seemingly-reasonable possibility of it starting a chain reaction that would have destroyed our atmosphere and, together with it, in mere minutes, every single human.

We currently don’t have the tools to push hard enough, so we can notice that the changes we inflict upon the world are negative and act, but with increased capabilities comes increased risk, and if our capabilities advance at an amazing rate we’ll be taking a “will a nuclear weapon light up the atmosphere” style gamble every year, month, day or even minute for the most frivolous of reasons.


All of this is not to say that AGI will actually lead to those kinds of capabilities increases. I’m pretty well established in the camp that thinks it won’t, or that by the time it will, the world will be weird enough that whatever we do now in order to prevent this will be of no consequence.

Nor do I think that the “systemic” take to AGI risk is the best stance to take in a public debate, since most people will resolutely refuse to believe they are acting in immoral ways and will contrive the most complex of internal defense to obfuscate such facts.

However, the “there are death cults” take seems easier to defend, after all, we know that the capabilities to destroy and to protect are highly asymmetrical, so destruction is easier. At which point we’d be more clearly debating the capabilities of an AGI system, as opposed to debating if the system would agentically use those capabilities to do us harm.

I assume that most people arguing for doomsday AGI might predict that such an AGI will so trivially “take the controls” from the hand of humans that it will be more reasonable to think about “it” and ignore any human desires or actions. I for one agree, and I think this argument even applies to tool-like AGIs in-so-far as the way we use them is through systems similar to markets and governments, systems with little regard for long-term human welfare.

But alas this is a harder-to-stomach and hard-to-get take. I personally find it harder to reason about because there are a lot of added “ifs”. So why not strip everything away and just think about the capabilities of AGI to do harm once in the hands of people that wish nothing more than to do harm? Such people are plentiful and there’s no reason to think they’d be excluded from using or building these sorts of systems, even if they’d be a few years behind the SOTA.

New Comment
18 comments, sorted by Click to highlight new comments since:

I agree it's a very significant risk which is possibly somewhat underappreciated in the LW community. 

I think all three situations are very possible and potentially catastrophic:

  1. Evil people do evil with AI
  2. Moloch goes Moloch with AI
  3. ASI goes ASI (FOOM etc.)

Arguments against (1) could be "evil people are stupid" and "terrorism is not about terror". 

Arguments against (1) and (2) could be "timelines are short" and "AI power is likely to be very concentrated". 

See reply above, I don't think I'm bringing Moloch up here at all, rather individuals being evil in ways that leads to both self and systemic harm, which is an easier problem to fix, if still unsolvable.

"Evil people are stupid" is actually an argument for 1. It means we're equalising the field. If an AGI model leaks the way LLaMa did, we're giving the most idiotic and deranged members of our species a chance to simply download more brains from the Internet, and use them for whatever stupid thing they wanted in the first place.

Yes, we may not live until superintelligene because some Tool AI in hands of a terrorist is enough to kill everybody. In that case moratorium on AI is bad.

Why is a moratorium bad in that case?

For reference, I disagree with the moratorium slightly, but for different reasons.

If we stuck on the level of dangerous Tools and assuming that superintelligence will not kill us based on some long-term complex reasoning, e.g. small chance that it is in a testing simulation.

I have seen this argument and I don't disagree; however, if the risk was only human rogue actors ordering the AGI around, then people might say that all we need to make it safe is keep it under lock and key so that it can only be used with authorisation by thoroughly vetted people, like we do for many dangerous things. The AGI itself being agentic makes it clear how hard it would be to control because you can't forbid it from using itself.

About the ability of intelligence to affect the world, I agree on being sceptical of nigh magical abilities, but there obviously are very low hanging fruits in terms of killing most of humanity, especially bioweapons. Stuff for which even human intellect would be enough, given sufficient resources, that random death cultists don't have but an AGI likely would.

[-]Shmi1-3

I think your argument is valid, if rambling and meandering. Basically the question you are asking is "assuming there is an energy source within reach of many people that is powerful enough to destroy the earth, how long until this planet is gone, intentionally or accidentally?" and the answer is... "probably seconds".

Which is the rambly part?

It's tempting to seek out smaller, related problems that are easier to solve when faced with a complex issue. However, fixating on these smaller problems can cause us to lose sight of the larger issue's root causes. For example, in the context of AI alignment, focusing solely on preventing bad actors from accessing advanced tool AI isn't enough. The larger problem of solving AI alignment must also be addressed to prevent catastrophic consequences, regardless of who controls the AI.

This has been well downvoted. I'm not sure why, so if anyone has feedback about what I said that wasn't correct, or how I said it, that feedback is more than welcome.

In the section about systemic risk, you should have referred to Moloch, which has been discussed recently in the context of AI risk by Boeree and Schmachtenberger, Forrest Landry (1, 2), among many others, and can be vaguely associated with Cristiano's "first scenario of AI failure". Moloch has a tag on LW where something along these lines is also discussed.

Re: "destructive cult/malign actor risk" -- I absolutely agree.

The usual answer to this by folks like at OpenAI or Anthropic is that at first, RLHF or other techniques will help to prevent malign actors from applying A(G)I towards destructive ends, and then we will also ask the AI to help us solve this problem in a fundamental way (or, indeed, prove to them that the policy proposal by Yudkowsky, "shut down all AGI and only focus on narrow biomedical AI", is the safest thing to do).

That's why the push by various people and groups for open-sourcing AI is suicide. Especially open-sourcing checkpoints and non-RLHF'ed versions because this open-source AI could easily be used by malign actors to develop weapons and destructive plans.

The typical counter-argument by these "open-source maximalists" is some handwaving like "if everyone has a powerful AI in their pocket no single group could really harm the world much, the attack-defence balance will be approximately at the same point where it is now". This argument is very bad because it's countered easily, for example, if one AI creates a super-virus (ultra-fast-spreading and ultra-deadly), another AI, even a superintelligent one couldn't make much to stop its spread and prevent the massive pandemics.

Moloch is not, in my opinion, the correct analogy to use for systemic problems.

Moloch absolves humans of guilt for systemic problems by hand waving them away as self interest.

But systemic problems usually stem from things that are both bad when done in aggregate and bad for us as individuals. Hence why my analogy used a button that would give one nothing of value, just something we from our blind position would think to be of value.


I agree with the rest of your comment but it's trivial to replicate the work by anthropic or openai in a focused direction. It's already been done.

Their main edge is the RLHF datasets they have but those are good for safety (arguably, ad you point out) and for capabilities in-so-far as they are interacting with humans that haven't trained to use such systems.

So we do and likely will live in the hardest of worlds, where it's all open source.

"should have" referred to moloch is much too strong. certainly it's valid to refer to and it's a solid connection to make given that that's a name that has been given to the concept. but I think mentioning it in the comments as a contribution like you did is actually valid and not everyone has to know all the custom words folks use. "folks here have called that moloch" seems fine. strong downvote for this.

I do believe that authors should do the leg work of connecting their frames with other frames made by other people previously themselves, to save disproportionally much more cognitive effort for the readers of connecting concepts in their heads and to prevent misinterpretations. In academia, this is called "citing prior work". Citing zero prior work is bad style and correctly shunned, a-la Wolfram.

See what I previously wrote, in my opinion you should make an effort to read rather than pattern match to existing concepts.

My new comment applies in general - notice that I mentioned "misinterpretations". If I did this misinterpretation originally it means that probably many other people also did it, and to increase the % of people who interpreted your text correctly you would better have included a paragraph like "Note that this idea is distinct from Moloch, because ...", or "This idea is a spin on some earlier ideas, ...".

I maintain that "readers should read better and decipher and interpret correctly what I've written, and if they failed, so it worse for them" is a bad attitude and strategy for academic and philosophical writing (even though it's widespread in different guises).

Well, I perfectly agree with you then. This is why I've never written anything I'd intend to publish in an academic setting nor anything I'd consider to be pure philosophy.