This paper came out about a week ago. I am not the author - it was published anonymously.
I learned about the paper when JJ Hepburn shared it on Slack. I just thought it seemed potentially really important and I hadn't seen it discussed on this forum yet:
Paper title: "Large Language Models Can Self-improve"
Author: Anonymous
Abstract: "Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%→82.1% on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that finetuning on reasoning is critical for self-improvement."
Makes sense that something like this would be possible. Humans can often teach themselves to be better at a skill through practice, even without a teacher or ground truth. Of course, this often runs into mostly useless, self-reinforcing attractors when the learner is unable to determine the true quality levels of their own attempts in the area of study. E.g., physics cranks, paranormal studies, etc.
I actually think it’s a good thing that chain of thought models can improve themselves like this. Chain of thought LLMs seem by far the most alignable of the plausible paths to AGI, since it directly incorporates a prior towards human-like cognition. I’d be a bit more worried about alignment if AGI approaches based on imitating / amplifying humans showed signs of flagging even at the current infra-human capabilities stage.
Further, stable self improvement seems like a particularly sensitive requirement for a scalable alignment solution. It seems much better to have the medium of this self improvement be human language, rather than something like self-modifying meta reinforcement learning from scratch.
Edited to add three more reasons why I think this is a good thing:
Crucially, this is a domain where convergence wasn't necessarily implied by the LM training objective. Language models are trained to imitate human text autoregressively given a single context. It's not a surprise when they generate human-like continuations of their provided contexts. That's what they're trained to do. However, that training objective does not directly imply that the LM should be capable of a human-esque self improvement trajectory that spans multiple prompt-completion pairs. I thus take this result as an update towards there being convergence in the higher order learning dynamics of humans and AIs.
I really appreciated this comment for making the connection between this paper and IDA.
More explicitly, to the extent that you can think of the original large language model as simulating a human, there's an analogy between:
This is a also a great chance for IDA skeptics to try to... (read more)