I've been thinking for a while that one could do syllabus learning for LLMs. It's fairly easy to classify text by reading age. So start training the LLM on only text with a low reading age, and then increase the ceiling on reading age until it's training on the full distribution of text. (https://arxiv.org/pdf/2108.02170.pdf experimented with curriculum learning in early LLMs, with little effect, but oddly didn't test reading age.)
To avoid distorting the final training distribution by much, you would need to be able to raise the reading age limit fairly fast, so by the time it's reached maximum you're only used up say ten percent of the text with low reading ages, so then in the final training distribution those're only say ten percent underrepresented. So the LLM is still capable of generating children's stories if needed (just slightly less likely to do so randomly).
The hope is that this would improve quality faster early in the training run, to sooner get the LLM to a level where it can extract more benefit from even the more difficult texts, so hopefully reach a slightly higher final quality from the same amount of training data and compute. Otherwise for those really difficult texts that happen to be used early on in the training run, the LLM presumably gets less value from them than if they'd been later in the training. I'd expect any resulting improvement to be fairly small, but then this isn't very hard to do.
A more challenging approach would be to do the early training on low-reading-age material in a smaller LLM, potentially saving compute, and then do something like add more layers near the middle, or distill the behavior of the small LLM into a larger one, before continuing the training. Here the aim would be to also save some compute during the early parts of the training run. Potential issues would be if the distillation process or loss of quality from adding new randomly-initialized layers ended up costing more compute/quality than we'd saved/gained.
[In general, the Bitter Lesson suggests that sadly the time and engineering effort spent on these sorts of small tweaks might be better spent on just scaling up more.]
So basically... LMs have to learn language in the exact same way human children do: start by grasping the essentials and then work upward to complex meanings and information structures.
Has any tried training LLMs with some kind of "curriculum" like this? With a simple dataset that starts with basic grammar and simple concepts (like TinyStories), and gradually moves onto move advanced/abstract concepts, building on what's been provided so far? I wonder if that could also lead to more interpretable models?
This is my thought exactly. I would try it, but I am poor and don't even have a GPU lol. This is something I'd love to see tested.
Hah yeah I'm not exactly loaded either, it's pretty much all colab notebooks for me (but you can get access to free GPUs through colab, in case you don't know).
I don't know anything about colab, other than that the colab notebooks I've found online take a ridiculously long time to load, often have mysterious errors, and annoy the hell out of me. I don't know enough AI-related coding stuff to use it on my own. I just want something plug and play, which is why I mainly rely on KoboldAI, Open Assistant, etc.
I think this offers an interesting possibility for another way to safely allow users to get benefit from a strong AI that a company wishes to keep private. The user can submit a design specification for a desired task, and the company with a strong AI can use the strong AI to create a custom dataset and train a smaller simpler narrower model. The end user then gets full access to the code and weights of the resulting small model, after the company has done some safety verification on the custom dataset and small model. I think this is actually potentially safer than allowing customers direct API access to the strong model, if the strong model is quite strong and not well aligned. It's a relatively bounded, supervisable task.
Existing large tech companies are using approaches like this, training or fine-tuning small models on data generated by large ones.
For example, it's helpful for the cold start problem, where you don't yet have user input to train/fine-tune your small model on because the product the model is intended for hasn't been launched yet: have a large model create some simulated user input, train the small model on that, launch a beta test, and then retrain your small model with real user input as soon as you have some.
Abstract
Implications
Interpretability
One part that isn't mentioned in the abstract but is interesting:
The difference between highly activating tokens for a neuron is striking, here's the tiny model:
...and here's GPT2-XL:
Capabilities
Again from the introduction (emphasis mine)
If this is true, there could be ways to drastically cut LLM training costs while maintaining (or increasing) the capabilities of the final model.
This could be related to dataset quality. QLoRA found (among other things) that a high-quality dataset of 9000 examples (OpenAssistant) beat a 1M dataset of lower quality.