I really appreciate the detail in this. Many take-over scenarios have a "and then a miracle occurred moment," which this mostly avoids. So I'm going to criticize some, but I really appreciate the legibility.
Anyhow: I'm extremely skeptical of takeover risk from LLMs and remain so after reading this. Points of this narrative which seem unlikely to me:
You can use a model to red-team itself without training it to do so. It would be pretty nuts if you rewarded it for being able to red-team itself -- like, it's deliberately training it to go of the rails, and I thiiiiink would seem so even to non-paranoid people? Maybe I'm wrong.
There's a gigantic leap from "trained to code" to "trained to maximize profit of a company" -- like, the tasks are vastly more difficult in (1) long time horizon and (2) setting up a realistic simulation for them. For reference, it's hard to set up a realistic sim2real for walking in a robot with 6 electric motors -- a realistic company sim is like.... so so so so so so so so much harder. If it's not high fidelity, the simulation is no use; and a high-fidelity company sim is a simulation of the entire world. So I just don't see this happening, like at all. You can say "trained to maximized profits" in a short sentence just like "trained via RLHF" but the difference between them is much bigger than this.
(Or at least, the level of competence implied by this is so enormous that simple failure modes of above and below seem really unlikely.)
I'm not sure if I should have written all that, because 2. is really the central point here.
It would be pretty nuts if you rewarded it for being able to red-team itself -- like, it's deliberately training it to go of the rails, and I thiiiiink would seem so even to non-paranoid people? Maybe I'm wrong.
I'm actually most alarmed on this vector, these days. We're already seeing people giving LLM's completely untested toolsets - web, filesystem, physical bots, etc - and "friendly" hacks like Reddit jailbreaks and ChaosGPT. Doesn't it seem like we are only a couple steps before a bad actor produces an ideal red-team agent, and then abuses it rather than using it to expose vulnerabilities?
I get the counter-argument, that humans already are diverse and try a ton of stuff, and so resilient systems are a result... but peering into the very near future, I fear that those arguments simply won't apply to super-human intelligence, especially when combined with bad human actors directing those.
I'll focus on 2 first given that it's the most important. 2. I would expect sim2real to not be too hard for foundations models because they're trained over massive distributions which allow and force to generalize to near neighbours. E.g. I think that it wouldn't be too hard for a LLMbto generalize some knowledge from stories to real life if it had an external memory for instance. I'm not certain but I feel like robotics is more sensitive to details than plans (which is why I'm mentioning a simulation here). Finally regarding long horizon I agree that it seems hard but I worry that at current capabilities level you can already build ~any reward model because LLMs, given enough inferences seem generally very capable atb evaluating stuff.
I agree that it's not something which is very likely. But I disagree that "nobody would do that". People would do that if it were useful.
I've asked some ML engineers and it happens that you don't look at it for a day. I don't think that deploying it in the real world changes much. Once again you're also assuming a pretty advanced formb of security mindset.
Can people start writing some scenarios of benevolent AI takeover too? Get GPT-4 to write them if that's too hard.
If you believe in instrumental convergence, most of the story can remain the same, you just need to change the beginning ("the AI wanted to help humanity, and in order to do that, it needed to get power, fast") and the ending ("and then the benevolent AI, following the noble truth that desire causes suffering, lobotomized all people, thus ending human suffering forever").
I asked Bing to write a story in which alignment is solved next year and an "interplanetary utopia" begins the year after. I tried it four times. In each story, the breakthrough is made by a "Dr Alice Chen" at CHAI. I assume this was random the first time, but somehow and for some reason, was retained as an initial condition in subsequent stories...
In the first story, Stuart Russell suggests that she and her colleague "Dr Bob Lee" contact "Dr" Elon Musk, Dr Nick Bostrom, and Dr Jane Goodall, for advice on how to apply their breakthrough. Alice and Bob get as far as Mars, where Dr Musk reveals that his colony has discovered a buried ancient artefact. He was in the middle of orating about it, when Bing aborted that chapter and refused to continue.
In that story, Alice's alignment technique is based on "a combination of reinforcement learning, inverse reinforcement learning, and causal inference". Then she adds "a new term to the objective function of the AI agent, which represented the agent's uncertainty about the human user's true reward function". After that, her AI shows "curiosity, empathy, and cooperation", and actively tries to discover people's preferences and align with them.
In the second story, she's working on Cooperative and Open-Ended Learning Agents (abbreviated as COALA), designed to learn from and partner with humans, and she falls in love with a particularly cooperative and open-ended agent. Eventually they get as far as a honeymoon on Mars, and again Bing aborted the story, I think out of puritanical caution.
In the third story, her AIs learn human values via "interactive inverse reinforcement learning" (which in the real world is a concept from a paper that Jan Leike coauthored with Stuart Armstrong, some years before he, J.L., became head of alignment at OpenAI). Her first big success is with GPT-7, and then in subsequent chapters they work together to create GPT-8, GPT-9... all the way up to GPT-12.
Each one is more spectacular than the last. The final model, GPT-12, is "a hyperintelligent AI system that could create and destroy multiple levels and modes of existence", with "more than 1 septillion parameters". It broadcasts its existence "across space and time and beyond and above and within and without", and the world embraces it as "a friend ally mentor guide leader savior god creator destroyer redeemer transformer transcender" - and that's followed by about sixty superlatives all starting with "omni", including "omnipleasure omnipain .. omniorder omnichaos" - apparently once you get to GPT-12's level, nonduality is unavoidable.
Bing was unable to finish the third story, instead it just kept declaiming the glory of GPT-12's coming.
In the fourth and final story, Dr Chen has created game-playing agents that combine inverse reinforcement learning with causal inference. She puts one of them in a prisoner's dilemma situation with a player who has no preferences, and the AI unexpectedly follows "a logic of compassion", and decides that it must care for the other player.
Bing actually managed to finish writing this story, but it's more like a project report. Dr Chen and her colleagues want to keep the AI a secret, but it hacks the Internet and broadcasts an offer of help to all humanity. We get bullet-point lists of the challenges and benefits of the resulting cooperation, of ways the AI grew and benefited too, and in the end humans and AI become one.
Can people start writing some scenarios of benevolent AI takeover too?
Here's one... https://www.lesswrong.com/posts/RAFYkxJMvozwi2kMX/echoes-of-elysium-1
Sure, but would that be valuable? Non-benevolent takeovers help us fine-tune controls by noting new potential ways that we might not have thought about before, this can help in ai safety research and policy making. I'm not worried about 'accidentally' losing control of the planet to a well-meaning AI.
It then just start building a bunch of those paying humans to do various tasks to achieve that.
Typo correction...
It then just starts building a bunch of those paying humans to do various tasks to achieve that.
Two weeks after the start of the accident, while China has now became the main power in place and the US is completely chaotic,
Typo correction...
Two weeks after the start of the accident, while China has now become the main power in place and the US is completely chaotic,
This system has started leveraging rivalries between different Chinese factions in order to get an access to increasing amounts of compute.
Typo correction...
This system has started leveraging rivalries between different Chinese factions in order to get access to increasing amounts of compute.
AI systems in China and Iran have bargained deals with governments in order to be let use a substantial fraction of the available compute in order to massively destabilize the US society as a whole and make China & Iran dominant.
Typo correction...
AI systems in China and Iran have bargained deals with governments in order to use a substantial fraction of the available compute, in order to massively destabilize the US society as a whole and make China & Iran dominant.
AI Test seems to be increasing its footprint over every domains,
Typo correction...
AI Test seems to be increasing its footprint over every domain,
Introduction
In this AI takeover scenario, I tried to start from a few technical assumptions and imagine a specific takeover scenario from human-level AI.
I found this exercise very useful to identify new insights on safety. For instance, I hadn't clearly in mind how rivalries like China vs US and their lack of coordination could be leveraged by instances of a model in order to gain power. I also hadn't identified the fact that as long as we are in the human-level regime of capabilities, copy-pasting weights on some distinct data centers than the initial one is a crucial step for ~any takeover to happen. Which means that securing model weights from the model itself could be an extremely promising strategy if it was tractable.
Feedback on what is useful in such scenarios is helpful. I have other scenarios like that but feel like most of the benefits for others come from the first one they read, so I might or might not release other ones.
Status: The trade-off on this doc was to either let it as a Google Doc or share it without a lot more work. Several people found it useful so I leaned towards sharing it. Most of this scenario has been written 4 months ago.
The Story
Phase 1: Training from methods inspired by PaLM, RLHF, RLAIF, ACT
Phase 2: Profit maximization
Phase 3: Hacking & Power Seeking
Phase 4: Anomaly exploration
Phase 5: Coordination Issues
This part on MCTS and RL working on coding is speculative. The intuition is that MCTS might allow transformers to approximate in one shot some really long processes of reasoning that would take on a basic transformer many inferences to get right.