A friend of mine thinks that RL is a dead end: LLMs are much better at problem solving, exploration, and exploitation than any RL algorithm. And I agree that LLMs are better than RL on RL's tasks: companies even have LLMs controlling robots nowadays.
The part where we disagree is that I see RL as the step that goes beyond LLMs. LLMs can only consume so much data, and get so good at predicting the next word. At some point, they will predict exactly what an expert would say, and then they will be exactly expert level (except much faster and more scaleable).
If you combine a bunch of LLMs into some organizational structure, then you at best get a company of experts. [1]
But to create a system beyond the capabilities of LLMs, either created by humans or expert-level LLMs, the training will need to go beyond human-generated data. It will need to intelligently explore the world to test its hypotheses and improve its mind. Essentially, in order for LLMs to go beyond larping human experts, they need RL.
In Project Lawful, Eliezer's smart character made a remark that there are some hypotheses that are impossible (computationally intractable) to learn passively from data (as LLMs do), but are possible to learn actively (as RL can). I couldn't find that proof, and would appreciate if someone could find it.[2]
And if I'm wrong and there are intelligence amplification methods for LLMs that do not involve RL, I'd like to know I'm wrong.
A friend of mine thinks that RL is a dead end: LLMs are much better at problem solving, exploration, and exploitation than any RL algorithm. And I agree that LLMs are better than RL on RL's tasks: companies even have LLMs controlling robots nowadays.
The part where we disagree is that I see RL as the step that goes beyond LLMs. LLMs can only consume so much data, and get so good at predicting the next word. At some point, they will predict exactly what an expert would say, and then they will be exactly expert level (except much faster and more scaleable).
If you combine a bunch of LLMs into some organizational structure, then you at best get a company of experts. [1]
But to create a system beyond the capabilities of LLMs, either created by humans or expert-level LLMs, the training will need to go beyond human-generated data. It will need to intelligently explore the world to test its hypotheses and improve its mind. Essentially, in order for LLMs to go beyond larping human experts, they need RL.
In Project Lawful, Eliezer's smart character made a remark that there are some hypotheses that are impossible (computationally intractable) to learn passively from data (as LLMs do), but are possible to learn actively (as RL can). I couldn't find that proof, and would appreciate if someone could find it.[2]
And if I'm wrong and there are intelligence amplification methods for LLMs that do not involve RL, I'd like to know I'm wrong.
The expert-level or the expert-organization AI may be enough to kickstart an intelligence explosion, but the point I'm trying to make stands.
The work of Dana Angluin might be a starting point for a search.