I think you're absolutely right.
And I think there's approximately a 0% chance humanity will stop at pure language models, or even stop at o1 and o3, which very likely to use RL to dramatically enhance capabilities.
Because they use RL not to accomplish things-in-the-world but to arrive at correct answers to questions they're posed, the concerns you express (and pretty much anyone who's been paying attention to AGI risks agrees with) are not fully in play.
Open AI will continue on this path unless legislation stops them. And that's highly unlikely to happen, because the argument against is just not strong enough to convince the public or legislators.
We are mostly applying optimization pressure to our AGI systems to follow instructions and produce correct answers. Framed that way, it sounds like it's as safe an approach as you could come up with for network-based AGI. I'm not saying it's safe, but I am saying it's hard to be sure it's not without more detailed arguments and analysis. Which is what I'm trying to do in my work.
Also as you say, it would be far safer to not make these things into agents. But the ease of doing so with a smart enough model and a prompt like "continue pursuing goal X using tools Y as appropriate to gather information and take actions" ensures that they will be turned into agents.
People want a system that actually does their work, not one that just tells them what to do. So they're going to make agents out of smart LLMs. This won't be stopped even with legislation; people will do it illegally or from jurisdictions that haven't passed the laws.
So we are going to have to both hope and plan for this approach, including RL for correct answers, is safe enough. Or come up with way stronger and more convincing arguments for why it won't. I currently think it can be made safe in a realistic way with no major policy or research direction change. But I just don't know, because I haven't gotten enough people to engage deeply enough with the real difficulties and likely approaches.
Thank you Seth for the thoughtful reply. I largely agree with most of your points.
I agree that RL trained to accomplish things in the real world is far more dangerous than RL trained to just solve difficult mathematical problems (which in turn is more dangerous than vanilla language modeling). I worry that the real-world part will soon become commonplace, judging from current trends.
But even without the real-world part, models could still be incentivized to develop superhumam abilities and complex strategic thinking (which could be useful for solving mathematical and coding problens).
Regarding the chances of stopping/banning open-ended RL, I agree it's a very tall order, but my impression of the advocacy/policy landscape is that people might be open to it under the right conditions. At any rate I wasn't trying to reason about what's reasonable to ask for, only on the implications of different paths. I think the discussion should start there, and then we can consider what's wise to advocate for.
For all of these reasons, I fully agree with you that work on demonstrating these risks in a rigorous and credible way is one of the most important efforts for AI safety.
I explain it in more detail in my original post.
In short, in standard language modeling the model only tries to predict the most likely immediate next token (T1), and then the most likely token after that (T2) given T1, and so on; whereas in RL it's trying to optimize a whole sequence of next tokens (T1, ..., Tn) such that the rewards for all the tokens (up to Tn) are taken into account in the reward of the immediate next token (T1).
I disagree. I think the current approach, with chain-of-thought reasoning, is a marked improvement over naive language modelling in terms of alignment difficulty. CoT allows us to elicit higher capabilities out of the same level base text generation model, meaning less of the computation is done inside the black box and more is done in human-readable tokens. While this still (obviously) has risks, it seems preferable to models that fully internalize the reasoning process. Do you agree with that?
I'm all in for CoT! But when you RL a model to produce CoTs that are better at solving difficult ptoblems, you risk encouraging i) CoTs that carry side information that's only known to the model, ii) superhuman capabilities, and iii) planning ahead and agency in ways that are difficult to anticipate. Whether you produce the CoT with a pure LLM or a model that's also undergone RL, you end up with a seriew of tokens where each token was produced by a fully internalized reasoning process. The only difference is that RL encotages this reasoning process to be more powerful, more agentic, and less predictable. What's the adventage of RL for safety?
"you risk encouraging i) CoTs that carry side information that's only known to the model"
This is true by default, but not intractable. For example, you can train the CoT model with periodic paraphrasing to avoid steganography, or you can train a CoT model just for capabilities and introduce a separate model that reasons about safety. Daniel Kokotajlo has some nice writeups about these ideas, he calls it the Shoggoth/Face framework.
"superhuman capabilities"
Agreed that this would be bad, but condition on this happening, better to do it with RL CoT over blackbox token generation.
"planning ahead and agency in ways that are difficult to anticipate"
Not sure why this would be the case--shouldn't having access to the model's thought process make this easier to anticipate than if the long-term plans were stored in neuralese across a bunch of transformer layers?
"RL encotages this reasoning process to be more powerful, more agentic, and less predictable"
This is something I agree with in the sense that our frontier models are trained with RL, and those models are also the most powerful and most agentic (since they're more capable), but I'm not convinced that this is inherent to RL training, and I'm not exactly sure in what way these models are less predictable.
Why are you conditioning on superhuman AGI emerging? I think it's something very dangerous that our society isn't ready for. We should pursue a path where we can enjoy as many of the benefits of sub-human-level AGI (of the kind we already have) without risking uncontrolled acceleration. Pushing for stronger capabilities with open-ended RL is counterproductive for the very scenario we need to avoid.
It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?
Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?
I like the idea of discouraging steganography, but I still worry that given strong enough incentives, RL-trained models will find ways around this.
I think we're working with a different set of premises, so I'll try to disentangle a few ideas.
First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn't prepared for the advent of AI models that can perform economically useful labor.
Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is that conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding (e.g. o1) as opposed to simply more capable text generators (e.g. GPT-4). I believe that the former lends itself to stronger oversight mechanisms, more tractable interpretability techniques, and more robust safety interventions during real-world deployment.
"It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?"
This might have come across wrong, and is a potential crux. Conditioning on a particular text-generation model, I would guess that applying RL increases the risk--for example, I would consider Gemini 2.0 Flash Thinking as riskier than Gemini 2.0 Flash. But if you just showed me a bunch of eval results for an unknown model and asked how risky I thought the model was based on those, I would be more concerned about a fully black-box LLM than a RL CoT/scaffolded LM.
"Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?"
No, it seems pretty clear that RL models like o3 are more capable than vanilla LLMs. So in a sense, I guess I think RL is bad because it increases capabilities faster, which I think is bad. But I still disagree that RL is worse for any theoretical reason beyond "it works better".
Tying this all back to your post, there are a few object-level claims that I continue to disagree with, but if I came to agree with you on them I would also change my mind more on the overall discussion. Specifically:
At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. (I agree with this naively, but think this probelm is a lot more tractable than e.g. interpretability on a 100b parameter transformer.)
Of course, I'm more than open to hearing stronger arguments for these, and would happily change my mind if I saw convincing evidence.
I haven’t found many credible reports on what algorithms and techniques have been used to train the latest generation of powerful AI models (including OpenAI’s o3). Some reports suggest that reinforcement learning (RL) has been a key part, which is also consistent with what OpenAI officially reported about o1 three months ago.
The use of RL to enhance the capabilities of AGI[1] appears to be a concerning development. As I wrote previously, I have been hoping to see AI labs stick to training models through pure language modeling. By “pure language modeling” I don’t rule out fine-tuning with RLHF or other techniques designed to promote helpfulness/alignment, as long as they don’t dramatically enhance capabilities. I’m also okay with the LLMs being used as part of more complex AI systems that invoke many instances of the underlying LLMs through chain-of-thought and other techniques. What I find worrisome is the underlying models themselves trained to become more capable through open-ended RL.
The key argument in my original post was that AI systems based on pure language modeling are relatively safe because they are trained to mimic content generated by human-level intelligence. This leads to weak pressure to surpass human level. Even if we enhance their capabilities by composing together many LLM operations (as in chain of thought), each atomic operation in these complex reasoning structures would be made by a simple LLM that only tries to generate a good next token. Moreover, the reasoning is done with language we can read and understand, so it’s relatively easy to monitor these systems. The underlying LLMs have no reason to lie in very strategic ways[2], because they are not trained to plan ahead. There is also no reason for LLMs to become agentic, because at their core they are just prediction machines.
Put RL into the core training algorithm and everything changes. Models are now explicitly trained to plan ahead, which makes all kinds of strategic and agentic behaviors actively rewarded. At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. In other words, we are now encouraging models to race towards superhuman capabilities.
For all of these reasons, it seems much more likely to end up with AI models developing capabilities no one intended for in a world where frontier AI labs are playing with open-ended RL to enhance their models.
My original post from 20 months ago (Language Models are a Potentially Safe Path to Human-Level AGI) elaborates on all the points I briefly made here. I don’t claim any particular novelty in what I wrote there, but I think it mostly stood the test of time[3] (despite being published only 5 months after the initial release of ChatGPT). In particular, I still think that pure LLMs (and more complex AI systems based on chaining LLM outputs together) are relatively safe, and that humanity would be safer sticking with them for the time being.
Artificial general intelligence (AGI) means totally different things to different people. When I use this term, the emphasis is on “general”, regardless of the model’s strength and whether it exceeds or is below human level. For example, I consider GPT3 and GPT4 as forms of AGI because they have general knowledge about the world.
LLMs do lie a lot (and hallucinate), but mostly in cute and naive ways. All the examples of untruthful behavior I have seen in LLMs so far seem perfectly consistent with the assumption that they just want a good reward for the next token (judged through RLHF). I haven’t seen any evidence of LLMs lying in strategic ways, which I define as trying to pursue longterm goals beyond having their next few tokens receive higher reward than they deserve.
Perhaps I was overly optimistic about the amount of economic activity focused on producing more capable AI systems without more capable underlying models.