[Epistemic Status: Semi-informed conjecture. Feedback is welcome.]
At EAG last month, I got the advice to develop my own mental model of what general/transformative/superintelligent AI will look like when it gets here, and where I think the risks lie. This week I’ll be writing a series of shorter posts thinking out loud on this.
I want to start by stating three of my intuitions about what artificial general intelligence (AGI) will look like and how safe it will be, which I’ll phrase as conjectures:
Conjecture A:Large Language Models (LLMs) will be central to (our first) Artificial General Intelligences (AGIs)
Conjecture B: LLM capabilities and safety are heavily dependent on their prompt and the way they are invoked (in addition to the model).
By “the way they are invoked” I mean the call-execute loop external to the LLM itself, including whether its outputs can be interpreted and run by an external tool. Examples of different ways they could be invoked include “GPT-4 via ChatGPT”, “GPT-4 via ChatGPT plus plugins”, and “GPT-4 in ARC’s testing environment where it can ‘execute code, do chain-of-thought reasoning, and delegate to copies of itself’”.
Conjecture C: LLMs don’t behave like expected utility maximizers for tasks they are given.
Caveats/clarifications: LLMs are obviously choosing next tokens based on minimizing a function (for pre-trained models the function is an approximation of “predictive loss on the next token”), but this doesn’t map clearly a utility calculation for a task it is given (e.g. “make paperclips”). That said, LLMs might be capable of role-playing an expected utility maximizer.
I think I’m reasonably confident in these conjectures (90+% belief under appropriate formalization). I see conjecture A as very different than conjectures B and C, and I think B and C have important safety implications that I’ll discuss in a later post. But I am a bit concerned that the properties I am ascribing to LLMs is mostly true of the pre-trained models, and might be rendered false by relatively small amounts of fine-tuning (via RLHF or other means).
The rest of this post will be outlining what evidence I’ve seen for these conjectures.
Conjecture A: AGI from LLMs
Reasons I think LLMs will be central to AGIs:
LLMs have state-of-the-art and near-human capabilities in a huge range of tasks.
LLMs have room for improvement. So far it’s looked like bigger=smarter for LLMs, and if nothing else the improvements in computer hardware will allow bigger models in the future.
While LLMs have limitations, it’s architecturally easy to augment them to overcome their limitations, for instance by adding multimodality or plugins. (Remember when I predicted those capability improvementsearlier this month? How innocent we were in the halcyon days of early-March.) While some challenges such as hallucinations remain, I think those can also be overcome with methods I will not discuss here.
There will be enormous economic pressure to improve LLMs. We’re on a margin where slightly better LLMs could unlock even greater economic value (e.g. “automate 10% of my work” → “automate 50% of my work”), so we’ll see a lot of money and talent poured into improving them.
Conjecture B: Prompts and invocation are important
Reasons I think capabilities and safety are prompt- and invocation-dependent:
Capabilities+Safety: Whether the AI can use tools obviously changes its capabilities, and could push it from harmless to potentially dangerous. For instance, if an LLM’s output is only sanitized plain text, it is likely less capable of harm than if it can directly execute arbitrary code.
Safety: If the LLM is a simulator that is made of masks, it should be able to role-play both “safe AIs” and “unsafe AIs” to the extent those exist in its training data/pop culture/the collective human psyche. If the LLM can role-play either a safe or an unsafe AI depending on its prompt, then obviously its prompt is important!
Safety: Part of OpenAI’s “safety” procedure with ChatGPT is a preamble to the prompt telling it to be a harmless assistant. While I don’t think OpenAI’s policies are sufficient for safety against catastrophic AI risk, I think this is evidence that the prompt can make it harder for the AI to display undesirable behavior.
Safety: I’ve produced behavior in LLMs which which is more or less safe depending on their prompt. I’ll try to write those up for a forthcoming blog post.
Perhaps an analogy between LLMs and computers is appropriate: depending on the LLM’s “program” it might be dangerous or safe, might qualify as a cohesive mind or not, etc.
Reasons I think LLMs aren’t expected utility maximizers:
The model chooses next tokens to minimize predictive loss, not to maximize the objective you give it. A more typical response to “get me paperclips” is “I bought you some on Amazon”, not “I turned the universe into paperclips” (modulo the memetic popularity of paperclip maximizers in particular), and this is true even for more capable models.
I’ve produced behavior in LLMs that is contrary to some of the classical results about expected utility maximizers (like incorrigibility and instrumental convergence). I’ll try to write those up for a forthcoming blog post.
[Epistemic Status: Semi-informed conjecture. Feedback is welcome.]
At EAG last month, I got the advice to develop my own mental model of what general/transformative/superintelligent AI will look like when it gets here, and where I think the risks lie. This week I’ll be writing a series of shorter posts thinking out loud on this.
I want to start by stating three of my intuitions about what artificial general intelligence (AGI) will look like and how safe it will be, which I’ll phrase as conjectures:
I think I’m reasonably confident in these conjectures (90+% belief under appropriate formalization). I see conjecture A as very different than conjectures B and C, and I think B and C have important safety implications that I’ll discuss in a later post. But I am a bit concerned that the properties I am ascribing to LLMs is mostly true of the pre-trained models, and might be rendered false by relatively small amounts of fine-tuning (via RLHF or other means).
The rest of this post will be outlining what evidence I’ve seen for these conjectures.
Conjecture A: AGI from LLMs
Reasons I think LLMs will be central to AGIs:
Conjecture B: Prompts and invocation are important
Reasons I think capabilities and safety are prompt- and invocation-dependent:
Perhaps an analogy between LLMs and computers is appropriate: depending on the LLM’s “program” it might be dangerous or safe, might qualify as a cohesive mind or not, etc.
Conjecture C: LLMs aren’t expected utility maximizers
Reasons I think LLMs aren’t expected utility maximizers: