I quite agree. I feel rather like humanity is a bunch of children playing with matches in a drought-stricken forest. Sooner or later something's bound to get out of hand. My one small point of disagreement here is that I think perhaps we are closer to potentially dangerous human-out-of-the-loop RSI than this post suggests.
My own takeaway from GPT-4 and other recent developments is that we're more likely to see the kind of smooth, gradual takeoff described by Christiano in e.g. the 2021 MIRI conversations (already seeing it, even?), but at a speed too fast to be useful for making progress on aligning even the gradual-takeoff systems.
And if we don't survive gradual takeoff, we don't even get a chance to try surviving hard takeoff, regardless of how close it is. I don't really have any strong beliefs about how likely or when hard takeoff happens, other than "probably after gradual takeoff, assuming there is an after", and even that I'm not that confident in.
Yeah, good point. Like the story about the guy who shot himself while jumping off the cliff into the ocean. Gotta dodge the bullet before you worry about the landing.
Overview
This post aims to speculate on some concrete ways that things could go awry in a world of gradual takeoff, where AI capabilities increase smoothly and predictably on human timescales.
(At this point, "human timescales" are probably best measured in months or even weeks, as opposed to years or decades. However, the scenarios that follow are intended to feel plausible whether gradual takeoff is fast or slow.)
Aside, I think many of the concerns related to hard takeoff (CIS, deception, RSI) are correct and important. It's just that there are lots of scenarios where they don't end up mattering because the world is wrecked earlier, by a less intelligent, non-agentic, out-of-control optimization process, before a deceptive/power-seeking/reflective/self-improving/etc. superintelligence ever has the chance. This post aims to outline one such class of scenarios.
The primary observation I make here is that it's quite straightforward to combine current AI models, such as LLMs, into a relatively simple system that is much more capable than the individual model alone. This is likely to continue to result in capability increases that stack on top of capability increases due to the models themselves. I'll describe what this looks like today, speculate about what it might look like in the future, and then use those speculations to illustrate some concrete disaster scenarios.
LLMs, LLMs + LangChain
Today, anyone can manually create a prompt for an LLM and make an API call directly to get a useful result, for example, by typing "I'm applying to the following job: <copy+pasted job description>. Write a cover letter for me that includes some relevant details from my CV: <copy+pasted CV>" into ChatGPT or Bing.
However, chaining together multiple calls to an LLM programmatically enables more interesting and useful applications. The standard way to do this today is using LangChain, an open-source Python library that enables developers to incorporate useful context or private data into the LLM prompt programmatically, and / or chain together multiple calls to LLMs in a structured or recursive way.
The LangChain documentation features many interesting examples, and many of the real-world commercial applications of LLMs likely use LangChain or similar tools under the hood.
There are several ways that LangChain-based LLM applications benefit from improved performance that are mostly independent of the benefit from improvements in the underlying model's text-completion ability:
Bottom line: While OpenAI iterates and improves on the underlying LLM, prompt engineers and open-source developers can iterate on their prompts, chains, and chain-building toolkits. The result is that increasingly powerful models get to make use of increasingly powerful and flexible tools.
Humans, Humans + Tools, Humans + Other Humans with Tools
Humans are another example of agents that become more powerful with tools.
In many cases, it's easier to enhance a human's capabilities by providing them with a tool or chaining their actions together with another human's, rather than improving the human directly through education or advanced brain surgery.
A relatively less intelligent human with the right tool can be much more capable (and much more dangerous) than even the smartest human who is less well-equipped, and a human who can direct the actions of other humans with tools can have truly massive impacts on the world.
Of course, one new class of tools that humans are now gaining access to is AI systems built on top of LLMs + LangChain.
Future AI Systems
Future models will likely be more powerful than current models [citation not needed, hopefully...]. That means they'll likely be better at using tools that humans (or AIs) connect them to, and people will continue to connect more powerful tools (including, of course, tools that make use of the improved model itself, recursively. Again, this recursive use is already happening extensively, today.)
I'll refrain from speculating further on other ideas that are close enough in reach that they could likely be on OpenAI's or LangChain's roadmap, if they're not there already. Instead, I'll speculate on things that are a little more distant.
For example, suppose an organization succeeds in building something like a CoEm, as described by Conjecture. I don't think the people at Conjecture are reckless enough to publish their work or offer a publicly available CoEm API, but suppose OpenAI or FAIR are. It seems likely that the first thing people will do with a CoEm is write a Python script to connect it to a bunch of tools (including, potentially, other CoEms) or arrange a bunch of CoEms into a chain or tree.
Or, suppose DeepMind releases a state-of-the-art world modeler, reward modeler, or action planner, like those used in Dreamer or MuZero to play Minecraft or Go, except that they work on domains in the real world.
Individually, all of these components might be relatively harmless and well-understood, exhibiting human-level performance or lower at their individual tasks, and behaving in relatively well-understood, non-agentic ways when used individually or as intended.
However, if these models are connected together in a simple tree search algorithm over world states and augmented with tools that give the search access to a sufficiently rich action space, the system as a whole is capable of accomplishing real-world tasks that exceed the capabilities of even the best-equipped humans.
Extrapolating from the way LangChain works today, it seems plausible that:
The surrounding "glue code" needed to connect a CoEm or some world models to a bunch of tools is probably relatively simple, even trivial, compared to the models themselves.
The overall system that comprises the powerful model(s) augmented with tools may not behave particularly agentically or exhibit any of the deceptive or power-seeking behaviors speculated to emerge by default among more advanced agents. Well-understood models augmented with tools or stacked into chains will likely enhance their capabilities in conceptually straightforward (but potentially still dangerous) ways, even if humans may not be able to foresee all of the immediate consequences of doing so.
Putting It All Together
Suppose in the not-too-distant future that some organization releases an API for a CoEm, or a "real world" world model, or even just publishes a paper outlining how to build one. A careless organization or even an individual programmer, hooks up such an API to an array of suitably powerful tools and chains, and gives it some suitably difficult task to chug along on. This system will be able to do useful and powerful things for the programmer who builds it, incentivizing them to give it more tools, more compute power, and let it run for longer. Eventually, it will do something dangerous (but not necessarily deceptively or agentically so, just dangerous in the ordinary way that ordinary computer programs controlling powerful systems sometimes have bugs or behavior unforeseen by the programmers).
(Or, it won't actually be CoEms or a "real world" world model that people use to build a runaway optimization process - something like GPT-6 or Llama-1T, is probably sufficient, given sufficient carelessness.)
The first such programs will plausibly be literally stoppable by pressing
Ctrl+ C
in the appropriate terminal. However, I think given current capabilities trajectories and social dynamics, it's unlikely that everyone who will have the ability to run such tools will have the inclination to avoid running them or even to try terminating them before they cause catastrophic damage. Given the way people are currently developing AI tools using LLMs with LangChain, most will be inclined to "let them run" and see what they can do.At the rate capabilities are advancing, I think it is also unlikely that alignment research will proceed fast enough to prevent this from happening through some kind of pivotal use, nor will AI governance or regulation succeed in preventing these tools from being built and used. (And of course, even if all the problems of gradual takeoff are miraculously solved in time, we still probably run into the hard problems of hard takeoff and superintelligence, a bit further down the line.)
I realize this is kind of a doom-y note to end on, but on the bright side, we'll probably be around to witness some cool (and terrible) things, maybe even for long enough to make this comic a reality: