That's an interesting point, why didn't we see major improvements in LLMs for instance when coding... Despite them achieving reasoning on the level that allows them become a GM on codeforces.
I'd say this is a fundamental limitation of reinforcement learning. Using purely reinforcement learning is stupid. Look at humans, we do much more than that. We make observations about our failures and update, we develop our own heuristics for what it means to be good at something and then try to figure out how to make ourselves better by reasoning about it watching other people etc...
This form of learning that happens at inference time is imo the fundamental thing preventing LLMs right now becoming more intelligent. And actual memory of course.
So we're just making them improve at measurable tasks through naive reinforcement learning but don't allow them to generalize it by using their understanding of that to properly update themselves in other not so measurable fields...
I'll understand if this offends some people here who are researchers and don't have much profit but I'm assuming a functioning society where good research is adequately rewarded.
How to fix universities: make their profits tied to competency of leaving students by taking a percentage of their future profits for the next x years.
That's an interesting point, why didn't we see major improvements in LLMs for instance when coding... Despite them achieving reasoning on the level that allows them become a GM on codeforces.
I'd say this is a fundamental limitation of reinforcement learning. Using purely reinforcement learning is stupid. Look at humans, we do much more than that. We make observations about our failures and update, we develop our own heuristics for what it means to be good at something and then try to figure out how to make ourselves better by reasoning about it watching other people etc...
This form of learning that happens at inference time is imo the fundamental thing preventing LLMs right now becoming more intelligent. And actual memory of course.
So we're just making them improve at measurable tasks through naive reinforcement learning but don't allow them to generalize it by using their understanding of that to properly update themselves in other not so measurable fields...