Thank you Seth for the thoughtful reply. I largely agree with most of your points.
I agree that RL trained to accomplish things in the real world is far more dangerous than RL trained to just solve difficult mathematical problems (which in turn is more dangerous than vanilla language modeling). I worry that the real-world part will soon become commonplace, judging from current trends.
But even without the real-world part, models could still be incentivized to develop superhumam abilities and complex strategic thinking (which could be useful for solving mathematical and coding problens).
Regarding the chances of stopping/banning open-ended RL, I agree it's a very tall order, but my impression of the advocacy/policy landscape is that people might be open to it under the right conditions. At any rate I wasn't trying to reason about what's reasonable to ask for, only on the implications of different paths. I think the discussion should start there, and then we can consider what's wise to advocate for.
For all of these reasons, I fully agree with you that work on demonstrating these risks in a rigorous and credible way is one of the most important efforts for AI safety.
I'm all in for CoT! But when you RL a model to produce CoTs that are better at solving difficult ptoblems, you risk encouraging i) CoTs that carry side information that's only known to the model, ii) superhuman capabilities, and iii) planning ahead and agency in ways that are difficult to anticipate. Whether you produce the CoT with a pure LLM or a model that's also undergone RL, you end up with a seriew of tokens where each token was produced by a fully internalized reasoning process. The only difference is that RL encotages this reasoning process to be more powerful, more agentic, and less predictable. What's the adventage of RL for safety?
I explain it in more detail in my original post.
In short, in standard language modeling the model only tries to predict the most likely immediate next token (T1), and then the most likely token after that (T2) given T1, and so on; whereas in RL it's trying to optimize a whole sequence of next tokens (T1, ..., Tn) such that the rewards for all the tokens (up to Tn) are taken into account in the reward of the immediate next token (T1).
Thank you for this comment!
I first want to note that your comment implies my post is saying things which I don't think it does (or at least I didn't mean it to):
- I didn't argue that language models are internally human-like, only externally. I do think however that compositional AI systems made of language models should be internally human-like.
- I didn't advocate for training AGI end-to-end (but rather taking the factored cognition approach).
I agree with you that a human-ish-level AGI would be VERY dangerous regardless of how aligned it is (mostly because there are 8 billion people on this planet, and it's sufficient for just a few of them to use it in stupid/evil ways to put us all in danger).
Thank you for this comment. I'm curious to understand the source of disagreement between us, given that you generally agree with each of the sub-points. Do you really think that the chances of misalignment with LM-based AI systems is above 90%? What exactly do you mean by misalignment in this context and why do you think it's the most likely result with such AI? Do you think it will happen even if humanity sticks with the paradigm I described (of chaining pure language models while avoiding training models on open-ended tasks)?
I want to also note that my argument is less about "developing language models was counterfactually a good thing" and more "given that language models have been developed (which is now a historic fact), the safest path towards human-level AGI might be to stick with pure language models".
Thank you!
I don't have any concrete plans, but maybe.
Why are you conditioning on superhuman AGI emerging? I think it's something very dangerous that our society isn't ready for. We should pursue a path where we can enjoy as many of the benefits of sub-human-level AGI (of the kind we already have) without risking uncontrolled acceleration. Pushing for stronger capabilities with open-ended RL is counterproductive for the very scenario we need to avoid.
It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?
Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?
I like the idea of discouraging steganography, but I still worry that given strong enough incentives, RL-trained models will find ways around this.