Continual learning is a blessing of scale: https://www.reddit.com/r/mlscaling/search?q=continual+learning&restrict_sr=on&include_over_18=on
(I don't mind saying this because it is obvious to anyone following the literature who has watched prior blessings of scale happen, and in particular, how each subfield copes with the realization that their problem was never a real one and all their clever ideas only mattered at scales which are quickly becoming OOMs irrelevant; and the continual-learning people already are going through the stages of grief, so a throwaway LW comment from me makes no difference.)
If you are trying to model DL capabilities, you should just assume continual-learning is already solved for all intents and purposes at GPT-4 scale (and note, for example, OA's revealed preferences in terms of training models from scratch vs further training old checkpoints) until you see an extremely compelling empirical demonstration to the contrary. We don't see it much overtly, simply because fullblown 'finetuning' is often not easy, and is much more expensive, and can be replaced to a considerable degree by tricks like retrieval or better prompts when your underlying model is really smart.
Fascinating, thanks for the research. Your analysis makes sense and seems to indicate that for most situations, prompt engineering is the always the first plan of attack and often works well enough. Then, a step up from there, OpenAI/etc would most likely experiment with fine-tuning or RLHF as it relates to a specific business need. To train a better chatbot and fill in any gaps, they probably would get more bang for their buck on simply fine-tuning it on a large dataset that matched their needs. For example, if they wanted to do better mathematical reasoning, they'd probably pay people to generate detailed scratchwork and fine-tune a whole dataset in batch, rather than set up an elaborate "tutor" framework. Continual learning itself would be mainly applicable for research into whether the thing spontaneously develops a sense of self, or seeing if this helps with the specific case of long term planning and agency. These are things the general public are fascinated with, but perhaps don't seem to be the most promising direction for improving a company's bottom line yet.
This was the very first thing I thought of when language models came to my attention as "hey this looks like it actually might be the thing that the future looks like" (years ago). Since I'm not particularly smart or particularly well-informed, I conclude that I was not the first person to come up with this idea (or the tenth, or even the ten-thousandth). I strongly suspect that the simplest possible approach of "just turn on backprop" was tried within the first couple of days of the weights of a GPT model being available. For context, nostalgebraist-autoresponder has been live on Tumblr since late 2019.
I do concur with you that this is an important thing to explore. However, I am quite confident that "do the thing that is obvious to someone with no background encountering the field for the first time" is not an effective approach.
When I briefly looked into the academic research on this, I picked up the following general impression:
Basically, the impression I've picked up is
I have noted the problem of catastrophic forgetting in the section "why it might not work". In general I agree continual learning is obviously a thing, otherwise I would not have used the established terminology. What I believe however is that the problems we face in continual learning in e.g. a 100M BERT model may not be the same as what we observe in models that can now meaningfully self critique. We have explored this technique publicly, but have we tried it with GPT-4? The publicly part was really just a question of whether OpenAI actually did it on this model or not, and it would be an amazing data point if they could say "We couldn't get it to work."
Ah, so the point was whether that had been explored publicly on the very largest language models that exist, because of the whole "sometimes approaches that didn't work at small scale start working when you throw enough compute at them" thing? Makes sense.
Essentially yes, heh. I take this as a learning experience for my writing, I don't know what I was thinking, but it is obvious in hindsight that saying to just "switch on backprop" sounds very naive.
I also confess I haven't done the due diligence to find out what the actual largest model that has been tried with this, whether someone has tried it with Pythia or LLaMa. I'll do some more googling tonight.
One intuition why the largest models might be different, is that part of the training/fine-tuning going on will have to do with the model's own output. The largest models are the ones where the model's own output is not essentially word salad.
It seems you're dismissing things like autoGPT for their lack of long term memory. But they have long term memory. They have an episodic memory that works much like human episodic memory. We have a special system to store specifics, because continuous learning isn't adequate to do that alone without using a learning rate high enough to cause catastrophic interference.
The vector-based episodic memory in auto-GPT operates much like human EM; it searches for relevant past experiences and brings them back into the context window (roughly equivalent to human working memory). They don't seem to work very well just yet, but those are literally first attempts.
Continuous learning will doubtless become part of advanced systems at some point, but that's not likely to substitute for episodic memory. To be fair, this is an empirical question. I'm reasoning based on catastrophic interference findings in lots of networks, but who knows.
See my recent post on the topic if you like.
Thanks, this is a great analysis on the power of agentized LLMs, which I probably need to spend some more time thinking about. I will work my way through the post over the next few days. I briefly skimmed the episodic memory section for now, and I see it is like an embedding based retrieval system for past outputs/interactions of the model, reminiscent of the way some Helper chatbots look up stuff from FAQs. My overall intuitions on this:
It might be possible that a hybrid approach would end up working better, perhaps not even "continuous learning", but batched episodic learning. ("Sleep" but not sure how far that analogy goes.)
It's possible it's downvoted because it might be considered dangerous capability research. It just seems highly unlikely that this would not be one of many natural research directions perhaps already attempted, and I figure we might as well acknowledge it and find out what it actually does in practice.
Or maybe downvotes because it "obviously won't work", but I think it's not obvious to me and would welcome discussion on that.
I'm worried that no matter how far we go, the next step will be one of the natural research directions.
Yes. This is very dangerous and the most likely way AGI will actually pan out.
I have abstained from commenting too much about this so far. It seems the "djinni is out of the bottle".
Worth noting that LLMs are no longer using quadratic context window scaling. See e.g. Claude-Long. Seems they've figured out how to make it ~linear. Looking at GPT-4 with a 32K context window option for corporate clients, seems like they're also not using quadratic scaling any more.
Continual learning is an alternative, I would argue, to solve long term planning and agency rather than necessary. Augmented LLMs with long term memory retrieval can do long term planning assuming the model is already powerful enough. Also agency just emerges from the simulator naturally.
I'm not convinced about continual learning as even the most likely path to AGI.
(Original title: The Continual Learning Overhang, but too many Overhang posts)
TL;DR Continual learning could lead to emergent causal understanding or agency and we should probably study it now before it’s too late.
Current LLMs are very impressive but suffer from a few key missing pieces (Sarah Constantin):
Sarah argues that the current approaches will be very unlikely to spontaneously develop these characteristics, and that it would require a ground level rethinking of how AI is done. I am not so convinced we’ve seen the full potential of the path we are on.
I think that we have yet to explore (publicly?) the potential of “switching on” backpropagation/training while in inference mode. Most models have a clean separation of “train” and “inference”. Inference applies the model to generate token by token as it goes along, but it is no longer learning.
Why I am skeptical of agency in current models / AutoGPT
AutoGPT is very interesting, but seems to be plagued by trouble actually completing tasks. It may improve with better code & prompts, but I still suspect it will miss the mark.
The context window is a very strong handicap in practice. To be a proper agent, one must have a long term goal, and a good explanation of the current state of the world. Without updating the weights of the model, this must fit inside the context window. Why this is suboptimal:
Relationship of Continual Learning to Context Window size
The context window is currently the only way a model can preserve knowledge. Most simply, with continual learning, we can have data transfer via the weights of the model.
Of course, in practice, does updating the weights really allow for efficient data transfer? It remains to be seen. However, we have some intuition why it might:
Why not just increase the context window size?
The main reason why is that we need to know how much capability is being left on the table because of our current setup.
Long context window size is another clear direction to look at, but there is already a ton of research going on (Hyena, Survey of methods). My intuitions for why a continual learning approach would be more scalable:
That said, fast attention would probably be invaluable in continuous time-domain cases because even a “short time window” in a robotics case could consist of a thousand frames (30 fps * 30 seconds) and however many tokens it takes to encode a frame.
Why it might not work
It may be the case that this is a very hard problem. In the ML models I have trained (up to 500M parameters), typically the LR is annealed and training converges. The examples seen towards the end are usually weighted much less. (Typically the examples seen at the end are repeats, assuming you do more than 1 epoch.) Trying to continually train the model (as opposed to fine-tune on a separate task) is a bit finicky. I would wonder how high to set the LR, if there would need to be experience replay, etc. Also, catastrophic forgetting seems to be a real issue.
These issues plague smaller models to a large extent. But, we do not know if larger models might be more robust to these issues. Large models seem to be capable of learning abilities smaller ones are not (example: learning from natural language feedback).
Why should we do it?
Without continual learning, the model never learns to make an intervention and see what happens. The model simulates but can never internally develop & store a sense of self. Maybe it will be harder than just “turning on backprop”, but what if it isn’t much more complex than that? Do we want to know? I think we want to know.
We are still at the point where GPT-4 seems to have enough blind spots that we can still “switch it off”. If this line of intervention can strengthen the capabilities of the model, we should find out now rather than later on, when we have an even more powerful model. Supposing we have GPT-7 or something, capable of superhuman performance on a variety of tasks, but still reasonably docile and not very capable of long term planning. I would not want to switch that one “on”.
Currently OpenAI is the only company that has such a powerful model whose output demonstrates non-shallow understanding of the world. Ideally, they should experiment along these lines in collaboration with ARC and publish their findings. I would also be curious if anyone can get LLaMa to work doing this, and even a negative result would be an interesting finding.
Or, Can this Lead to a Concrete Policy Proposal
As Sarah points out, doing interventions in the real world (with robotics) would be extremely expensive, so we can probably stick to chatbots for now. One question comes up though, is, does this model need millions of interactions/interventions to really learn how to do this? Or is it more sample efficient?
We do have an example where we can learn from millions of chat interactions. All ChatGPT conversation histories! If continual learning turns out to be quite powerful, would we eventually want to discourage companies from training on the outputs of their own raw chat logs for models larger than GPT-4?
One could even argue that we don’t want this already, due to privacy concerns and potential data leakage between chat sessions. (Despite me calling for research into this, doing it on massive conversation data seems like a bad idea.)