In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases;
I'm not sure how well this metric tracks what people care about — performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)
The abstract feels overly long.
IMO, Abstracts should be either Actually Short™, or broken into paragraphs
Copied it from the paper. I could break it down into several paragraphs but I figured bolding the important bits was easier. Might break up abstracts in future linkposts.
the bold font on lesswrong is too small of a difference vs normal text for this to work well, IMO. (I wish it were a stronger bolding so it was actually useful for communicating to others.) just messing around with one suggested way to do this from SO:
Abstract
Implications
Data Quality & Capabilities
Along with TinyStories and QLoRA I'm becoming increasingly convinced that data quality is all you need, definitely seems to be the case for finetuning, and may be the case for base-model training as well. Better scaling laws through higher-quality corpus?
Also for who haven't updated, it seems very likely that GPT-4 equivalents will be essentially free to self-host and tune within a year. Plan for this!
Perplexity != Quality
Because of this, the authors manually select checkpoints between the 5th and 10th epochs (out of 15) using the held-out 50-example development set.