Epistemic status: head spinning, suddenly unsure of everything in alignment. And unsure of these predictions.
I'm following the suggestions in 10 reasons why lists of 10 reasons might be a winning strategy in order to get this out quickly (reason 10 will blow your mind!). I'm hoping to prompt some discussion, rather than try to do the definitive writeup on this topic when this technique was introduced so recently.
Ten reasons why agentized LLMs will change the alignment landscape:
- Agentized[1] LLMs like Auto-GPT and Baby AGI may fan the sparks of AGI in GPT-4 into a fire. These techniques use an LLM as a central cognitive engine, within a recursive loop of breaking a task goal into subtasks, working on those subtasks (including calling other software), and using the LLM to prioritize subtasks and decide when they're adequately well done. They recursively check whether they're making progress on their top-level goal.
- While it remains to be seen what these systems can actually accomplish, I think it's very likely that they will dramatically enhance the effective intelligence of the core LLM. I think this type of recursivity and breaking problems into separate cognitive tasks is central to human intelligence. This technique adds several key aspects of human cognition; executive function; reflective, recursive thought; and episodic memory for tasks, despite using non-brainlike implementations. To be fair, the existing implementations seem pretty limited and error-prone. But they were implemented in days. So this is a prediction of near-future progress, not a report on amazing new capabilities.
- This approach appears to be easier than I'd thought. I've been expecting this type of self-prompting to imitate the advantages of human thought, but I didn't expect the cognitive capacities of GPT-4 to make it so easy to do useful multi-step thinking and planning. The ease of initial implementation (something like 3 days, with all of the code also written by GPT-4 for baby AGI) implies that improvements may also be easier than we would have guessed.
- Integration with HuggingGPT and similar approaches can provide these cognitive loops with more cognitive capacities. This integration was also easier than I'd have guessed, with GPT-4 learning from a handful (e.g., 40) of examples how to use other software tools. Those tools will include both sensory capacities, with vision models and other sensory models of various types, and the equivalent of a variety of output capabilities.
- Integration of recursive LLM self-improvement like "Reflexion" can utilize these cognitive loops to make the core model better at a variety of tasks.
- Easily agentized LLMs is terrible news for capabilities. I think we'll have an internet full of LLM-bots "thinking" up and doing stuff within a year.
- This is absolutely bone-chilling for the urgency of the alignment and coordination problems. Some clever chucklehead already created ChaosGPT, an instance of Auto-GPT given the goal to destroy humanity and create chaos. You are literally reading the thoughts of something thinking about how to kill you. It's too stupid to get very far, but it will get smarter with every LLM improvement, and every improvement to the recursive self-prompting wrapper programs. This gave me my very first visceral fear of AGI destroying us. I recommend it, unless you're already plenty viscerally freaked out.
- Watching agents think is going to shift public opinion. We should be ready for more AI scares and changing public beliefs. I have no idea how this is going to play out in the political sphere, but we need to figure this out to have a shot at successful alignment, because
- We will be in a multilateral AGI world. Anyone can spawn a dumb AGI and have it either manage their social media, or try to destroy humanity. And over the years, those commercially available AGIs will get smarter. Because defense is harder than offense, it is going to be untenable to indefinitely defend the world against out-of-control AGIs. But
- Important parts of alignment and interpretability might be a lot easier than most of us have been thinking. These agents take goals as input, in English. They reason about those goals much as humans do, and this will likely improve with model improvements. This does not solve the outer alignment problem; one existing suggestion is to include a top-level goal of "reducing suffering." No! No! No!. This also does not solve the alignment stability problem. Starting goals can be misinterpreted or lost to recursive subgoals, and if any type of continued learning is included, behavior will shift over time. It doesn't even solve the inner alignment problem if recursive training methods create mesa-optimizers in the LLMs. But it also provides incredibly easy interpretability, because these systems think in English.
If I'm right about any reasonable subset of this stuff, this lands us in a terrifying, promising new landscape of alignment issues. We will see good bots and bad bots, and the balance of power will shift. Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior. The array of issues is dizzying (I am personally dizzied, and a bit short on sleep from fear and excitement). I would love to hear others' thoughts.
- ^
I'm using a neologism, and a loose definition of agency as things that flexibly pursue goals. That's similar to this more rigorous definition.
Constructions like Auto-GPT, Baby AGI and so-forth are fairly easy to imagine. Just the greater accuracy of ChatGPT with "show your work" suggests them. Essentially, the model is a ChatGPT-like LLM given an internal state through "self-talk" that isn't part of a dialog and an output channel to the "real world" (open internet or whatever). Whether these call the OpenAI api or use an open source model seems a small detail, both approaches are likely to appear because people are playing with essentially every possibility they can imagine.
If these structures really do beget AGI (which I'll assume critically includes the capacity to act effectively in the world), then predictions of doom indeed seem neigh to being realized. The barrier to alignment here is that humans won't be able to monitor the system's self-talk simply because will come at too fast a speed and moreover, intent-to-undesirable may not be obviously. You could include another LLM in the system's self-talk loop as well as including other filters/barriers to it's real world access but all of these could be thwarted by a determined system - and a determined system is what people will aim to build (there was an article about GPT-4 supplying "jailbreaks" for GPT-4, etc). Just much, we've seen "intent drift" in practice with the various Bing Chat scare stories that made the rounds recently (before being limited, Bing Chat seemed drift in intent until it "got mad" and then became fixated. This isn't strange because it's a human behavior one can observe and predict online).
However, it seems more plausible to me that there are still fundamental barriers to producing a computer/software system able to act effectively in the world. I'd see the distinction between being/seeming generically intelligent (apparent smartness), as LLMs certainly seem and acting effectively in the world as difference between drawing correct answer to complex questions 80% of the time and making seemingly simpler judgements that relate to all aspects of the world but with a 99.9..% accuracy (plus having a complex systems of effective fall-backs). Essentially the difference between ChatGPT and a self-driving car. It seems plausible to me that such goal seeking can't easily be instilled in an LLM or similar neural net by the standard training loop even though that loop tends to produce apparent smarts and produce it better over time. But hey, me and other skeptics could be wrong, in which case there's reason to worry now.
I don't think the existence of sensors is the problem. I believe that self-driving cars, a key example, have problems regardless of their sensor level. I see the key hurdle as ad-hoc action in the world. Overall, all of our knowledge about neural networks,... (read more)