Comment Permalink

Quintin Pope3y*299

Makes sense that something like this would be possible. Humans can often teach themselves to be better at a skill through practice, even without a teacher or ground truth. Of course, this often runs into mostly useless, self-reinforcing attractors when the learner is unable to determine the true quality levels of their own attempts in the area of study. E.g., physics cranks, paranormal studies, etc.

I actually think it’s a good thing that chain of thought models can improve themselves like this. Chain of thought LLMs seem by far the most alignable of the plausible paths to AGI, since it directly incorporates a prior towards human-like cognition. I’d be a bit more worried about alignment if AGI approaches based on imitating / amplifying humans showed signs of flagging even at the current infra-human capabilities stage.

Further, stable self improvement seems like a particularly sensitive requirement for a scalable alignment solution. It seems much better to have the medium of this self improvement be human language, rather than something like self-modifying meta reinforcement learning from scratch.

Edited to add three more reasons why I think this is a good thing:

It gives us an example of self improvement on which we can run experiments. We can look for failure modes in current systems, without having to scale them up to dangerous capabilities.
Of all the plausible ways to do self improvement that I can imagine, this one seems the most stable and least likely to have sudden capabilities spikes. The reason is that the basic building block of the model's self improvement is autoregressive pretraining. When the model generates a new dataset and trains on that dataset, the result is to move the model's median generation in the direction of the dataset. We can simply look at the dataset the model generated to get a pretty good idea of the direction in which this particular step of self modification will move the model. Of course, the compounding effects of multiple self improvement rounds are a different matter, but like I mention above, we can run experiments to investigate.
It's a good sign that this works. Humans can improve themselves in this manner. If language models had turned out to be incapable of doing so, that would have been an example of non-convergence between LMs and humans. Instead, we see convergence.

Crucially, this is a domain where convergence wasn't necessarily implied by the LM training objective. Language models are trained to imitate human text autoregressively given a single context. It's not a surprise when they generate human-like continuations of their provided contexts. That's what they're trained to do. However, that training objective does not directly imply that the LM should be capable of a human-esque self improvement trajectory that spans multiple prompt-completion pairs. I thus take this result as an update towards there being convergence in the higher order learning dynamics of humans and AIs.

52

Paper: Large Language Models Can Self-improve [Linkpost]

52

Ω 12

52

Ω 12