Takes on a few more important questions:
Should safety-focused people support the advancement of FMA capabilities?
Probably. The advantages of a system without goal-directed RL (RL is used, but only to get the "oracle" to answer questions as the user intended them) and with a legible train-of-thought seem immense. I don't see how we close the floodgates of AGI development now. Given that we're getting AGI, it really seems like our best bet is FMA AGI.
But I'm not ready to help anyone develop AGI until this route to alignment and survival has been more thoroughly worked through in the abstract. I really wish more alignment skeptics would engage with specific plans isntead of just pointing to general arguments about how alignment would be difficult, some of which don't apply to the ways we'd really align FMAs (see my other comment on this post). We may be getting close; Shut It All Down isn't a viable option AFAICT so we need to get together our best shot.
- Will the first transformative AIs be FMAs?
Probably, but not certainly. I'd be very curious to get a survey of people who've really thought about this. People who are sure they won't give reasons I find highly dubious. At the least it seems likely enough that we should be thinking about aligning them in more detail, because we can see their general shape better than other possible first AGIs
2. Will narrow FMAs for a variety of specific domains be transformatively useful before we get transformatively useful general FMAs?
No. There are advantages to creating FMAs for specific domains, but there are also very large advantages to working on general reasoning. Humans are not limited to narrow domains, but can learn but anything through instruction or self-instruction. Language models trained on human "thought" can do the same as soon as they have any sort of useful persistent memory. Existing memory systems don't work well, but they will be improved, probably rapidly.
3. If FMAs are the first transformative AIs (TAIs), how long will FMAs remain the leading paradigm?
This is a really important question. I really hope they remain the leading paradigm long enough to become useful in aligning other types of AGI. And that they remain free of goal-directed RL adequately to remain alignable.
Great post! These are the questions that keep me up at night. Here is my provisional answers to the central, most important question:
- Is Chain of Thought faithful enough in the relevant sense for FMA safety to be attainable through natural language alignment?
Maybe, but it doesn't have to be. The nice thing about foundation model agents is that there are several different alignment techniques that are cheap, easy, and so obvious that they'll probably be implemented, even if the org in question isn't terribly safety-oriented. I wrote about these a while ago in the ill-titled Internal independent review for language model agent alignment, which focuses on System 2 internal review (in Shane Legg's better terminology) but also lists several other approaches that would easily be layered with it.
I need to do a clearer and simpler rewrite that surveys all of these. Here's the table from my draft post. CoT legibility is only relevant for one of these six approaches. Sorry for the missing links; I'm short on time and the draft is incomplete.
Technique | Example | Alignment tax |
Goal prompting | "Keep pursuing goal X..." (repeated frequently) | Negligible |
Identity prompting | ...acting as a helpful, cautious assistant" | Negligible |
Internal System 2 action review | "Does this proposed action/plan potentially violate (conditions)?" ... (further evaluations for actions/plans predicted to be highly impactful | Low (if review also implemented for costs and effectiveness) |
Human action review | Wait for human review if this action/plan would cost more than $(x) or impact more than (y) days of human happiness (application of instruction-following goal) | High, but saves costs and reputational damage |
External CoT review | Human and external AI review of chain of thought log | Modest for AI, high if reliant on human review |
"Bitter lesson" synthetic data training set | Curated training set for decision-making LLM leaving out hostile/misaligned "thoughts" | High, but modest if synthetic data is a key approach for next-gen LLMs |
Instruction-following as core goal | "Keep following all of the instructions from your authorized user, consulting them when instructions might be interpreted in substantially different ways" | Low if consulting is scaled to only impactful choices |
So we don't need CoT to be perfectly faithful to succeed- but we'd sure be safer if it was.
Back to the original question CoT faithfulness: the case for CoT unfaithfulness is overstated currently, but if we adopt more outcome-driven RL, or even fine-tuning, it could easily become highly unfaithful. So people shouldn't do that. If they do, I think the remaining easy techniques might be adequate - but I'd rather not gamble the future of humanity on them.
There are many other important questions here, but I'll stick to this one for now.
Linguistic Drift, Neuralese, and Steganography
In this section you use these terms implying there's a body of research underneath these terms. I'm very interested in understanding this behaviour but I wasn't aware it was being measured. Is anyone currently working on models of linguistic drift/measuring it with manuscripts you could link?
Max Nadeau recently made a comment on another post that gave an opinionated summary of a lot of existing CoT faithfulness work, including steganography. I'd recommend reading that. I’m not aware of very much relevant literature here; it’s possible it exists and I haven’t heard about it, but I think it’s also possible that this is a new conversation that exists in tweets more than papers so far.
Many people helped us a great deal in developing the questions and ideas in this post, including people at CHAI, MATS, various other places in Berkeley, and Aether. To all of them: Thank you very much! Any mistakes are our own.
Foundation model agents - systems like AutoGPT and Devin that equip foundations models with planning, memory, tool use, and other affordances to perform autonomous tasks - seem to have immense implications for AI capabilities and safety. As such, I (Rohan) am planning to do foundation model agent safety research.
Following the spirit of an earlier post I wrote, I thought it would be fun and valuable write as many interesting questions as I could about foundation model agent safety. I shared these questions with my collaborators, and Govind wrote a bunch more questions that he is interested in. This post includes questions from both of us. We didn't worry too much about avoiding redundancy in our questions.
We’ll often use FM for foundation model and FMA for foundation model agent.
Rohan
I've bolded some of the questions I'm most interested in. I've also attempted to categorize the questions, mostly to make this a bit more digestible than 70+ randomly arranged questions.
Basics and Current Status
Chain-of-Thought (CoT) Interpretability
Goals
Forecasting (Technical and Sociological)
Broad Conceptual Safety Questions
Miscellaneous
Govind
OpenAI o1 and other RL CoT Agents
Linguistic Drift, Neuralese, and Steganography
Agentic Performance
Forecasting