It has been conjectured that Stochastic Gradient Descent with the right hyperparameters approximates Bayesian learning. Bayesian learning is general, so it should be possible to pretrain a transformer to do anything that isn't actually beyond the architectural capabilities of its neural net architecture (e.g. that doesn't require more processing per token that it's capable of doing in a single forward pass). I gather you don't disagree with that.
It has also been conjectured that LLM in-context learning approximates Bayesian learning. You're clear that you think that that is less capable than SGD. Is that because:
a) you don't think it approximates Bayesian learning
b) you think it's a significantly less good approximation to Bayesian learning, or
c) you think there's a significant limit, beyond just context length, to how much ii can learn: i.e. that it approximates Bayesian learning just fine at first, but then runs out of some form of capacity, potentially before it runs out of context length?
Of these, issues a) and b) are clearly inherently fatal, whereas c) would suggest an architectural workaround of in-context-learning new information to below that capacity limit, then somehow using it to generate more training data containing that information, then using SGD to train either a new or a modified model containing that new information and iterating — obviously retraining from scratch is very (and increasingly) expensive, while retraining iteratively faces known challenges from catastrophic forgetting.
No opinion about (a) and (b), but Bayesian inference can only do as well its hypothesis space, and I think the true hypothesis here is WAY outside the hypothesis space, regardless of context size. That’s what I was trying to get across with that table I put into the OP.
So maybe that’s (c), but I don’t really know what you mean by “capacity” in this context.
Your last paragraph sounds to me like brainstorming how to build a continual learning setup for LLMs. As I mentioned at the bottom, such a system might or might not exist, but that would be out of scope for this post. If something in that genre worked, the “continual learning” in question would be coming from PyTorch code that assembles data and runs SGD in a loop, not from imitation learning, if I’m understanding your text correctly.
If even some hypothesis "very close" to the current hypotheses + priors were missing for in-context learning, then you'd get a) or b). If all hypotheses close to the current hypothesis + priors could be explored with near-full Bayesian accuracy, but there was some limit, some metric under which which things "further away" in that metric space both took more evidence to reach and also had more and more of the possible hypotheses simply missing and not creatable during in-context learning, then you'd get c).
There's a limit in how far I want to go brainstorming capabilities improvements, but basically what I was suggesting is that an obvious approach one might try is first learning things in-context, then doing some form of SGD imitation learning from that to train a model that now already knows how to do that and doesn't need to use a lot of context to figure it out.
LLMs having limitations that human learning does not
I am seeing quite a bit of progress in continual learning for LLMs recently.
Among a variety of very promising results, I have been particularly impressed by the recent Sakana work, Instant LLM Updates with Doc-to-LoRA and Text-to-LoRA, Feb 2026:
https://sakana.ai/doc-to-lora/ (links to arxiv and github are inside, the key paper is https://arxiv.org/abs/2602.15902)
The main idea is to combine two technologies which are known for several years, LoRA (low-rank adaptation used for fine-tuning) and the ability to train hypernetworks capable of instantly guessing the results of the first few thousands steps of gradient descent for a wide variety of problems with reasonable accuracy.
So what they do is they train hypernetworks capable to instantly generate (or instantly update) LoRA adapters based on past experience of the system in question. LLMs are pretty good at instant "in context" learning, but it has been less clear how to efficiently distill this learning into weights. This work enables this kind of distillation without waiting for a fine-tuning process to complete.
This does not directly contradict the post (this is not imitation learning as such), but the wider thesis that in the realm of continual learning, LLMs are at an inherent disadvantage compared to humans is very questionable in light of recent progress in this area.
Those are cool ideas, but I don’t think they qualify as (what I’m calling) “real” continual learning, as defined in the section “Some intuitions on how to think about ‘real’ continual learning”.
The disagreement might be on what we think about the models being able to do those things you mention there, but in a static “frozen” situation.
To the extent that the models are able to do those things (“true understanding”, “true knowledge”, “true creativity”, etc.) in a static “frozen” world, the notion of “continual learning” is reducible to its conventional interpretation (which is the ability to accommodate and internally integrate new information, new skills, and new discoveries on the fly without degradation of earlier learned skills and qualities).
But if one does not think that their performance for the static “frozen world” and “frozen models” situation is satisfactory, then no, it’s indeed unlikely that those methods would rescue that.
(If one has a situation for some class of models and method where static “frozen” models don’t possess those qualities, but those qualities can be rescued by dynamic “continual learning”, it should not be too difficult to convert those “continual learning” methods into producing “frozen” snapshots having those qualities to a fairly high degree. I think I more or less know how to do that. So, perhaps, your critique of the status quo is not actually about continual learning, but about more fundamental questions, about whether they are capable of “real” learning at all, whether continual or not.)
Do you think stronger versions of the result mentioned by mishka might be able to count as "real" continuous learning, if the 'mutable state' (e.g. the LoRA adapter) has enough capacity to exhibit the gains in capabilities?
E.g. in the 'country of geniuses in the datacenter' model, while a bunch of LLMs together would not be able to develop new fields, it seems at least possible for a bunch of LLMs augmented with large 'mutable states' (such as LoRA adapters) to do so, at least to the limit of new things 'within' the capacity of the 'fixed baseweight + learned LoRA space'. Current LoRA adapters in use are too small for that, but one could think of a much larger ones (including ones with just as many parameters as current models have 'weights').
Now, suppose that I take a generic imitation-learning algorithm (e.g. self-supervised learning in a transformer-architecture neural net, just like LLM pretraining), and have it watch a deep Q network play Atari Breakout, as it starts from random initialization, and gets better and better over 1M iterations. (...)
Question: Is this trained imitation-learner actually a good imitation of the deep Q network?
I am not familiar with the training procedure mentioned by mishka to train hypernetworks. But if it does work, might it not make it possible to 'imitate the learning updates from true learners' -- maybe from watching 10,000 deep Q networks play thousands of games instead -- and then apply such updates recursively?
Curated! I think the immense capability and usefulness of current LLMs, and specifically their increasing ability to take over tasks from humans, distracts from the ways in which they are strange minds different from human minds. I like this post for digging into that. It's known that of course LLMs lack memory and now we give them scratchpads and other files they can reference as a substitute, and yet it's not the same (as I keep experiencing in my own use). I appreciate this post for digging in and making claims like no amount of context window or scratchpads, etc., substitutes for actual continual learning. Without asserting this is correct, it's a discussion I like. One reason for that is I think significant and scary things might happen if/when we move beyond current architectures – which are already very capable – to those without these limitations. Good predictions there will come from understanding what is going on with the current models. Kudos. I like this line of work.
When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning.
I think it's important that the AI doesn't need to do all of its continual learning in the activations of a single forward pass: It can autoregressively generate tokens too. This way it can leave notes to itself, like "next time I run into a bug like this, I should look in this place first." And in fact, this seems pretty continuous with stuff we already see. (And reinforcement learning can make this kind of continual learning within a context window more likely.) In other words, the expressivity of autoregressive LLMs is much larger than a single forward pass, making continual learning from long contexts super plausible.
TBC this doesn't contradict your main point about imitation learning, and is mainly meant to push back on the narrative that transformers would need to be able to simulate long continual learning processes in a single forward pass in order to implement continual learning, and meant to convey why I think LLMs will plausibly be able to implement continual learning via longer context despite your arguments.
My main takeaway from your post is that naively training LLMs to imitate the behaviors of continually-learning policies (e.g., humans) who don't leave externalized traces of their continual learning process is unlikely to work. (And I believe this is your main point.)
I think it's important that the AI doesn't need to do all of its continual learning in the activations of a single forward pass: It can autoregressively generate tokens too. This way it can leave notes to itself, like "next time I run into a bug like this, I should look in this place first."
I don’t think that’s adequate for (what I was calling) “real” continual learning. There’s a trivial sense in which an LLM can do anything via a context window because it can e.g. emulate a Turing Machine without understanding what the Turing Machine is doing. But that’s not realistic (nor alignment-relevant). Realistically, I claim LLM “understanding” has to be in the weights, not the context window.
Here’s a thought experiment I often bring up: imagine training an LLM purely on data before linear algebra existed (or equivalently, train a new LLM from scratch while carefully filtering out anything related to or downstream of linear algebra from the training data). Then put a linear algebra textbook (or many textbooks) in the context window.
My question is: can the LLM can answer tricky questions that are not directly in those textbooks, to build on those linear algebra ideas and make further progress?
My strong prediction is: No.
Why do I think that? The issue is: linear algebra is a giant pile of interrelated concepts: matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, etc. Any one sentence in the textbook makes no sense to someone who doesn’t already know some linear algebra, because it’s probably describing some connection between one nonsensical concept and another nonsensical concept.
E.g. here’s a sentence from a linear algebra textbook: “As a reminder, for any matrix M , and a matrix M′ equal to M after a row operation, multiplying by an elementary matrix E gave M ′ = EM.” Try looking at that sentence through the eyes of someone who has never heard the words “matrix”, “row operation”, etc. It’s totally unintelligible gobbledygook, right?
The LLM needs to somehow make sense of this gobbledygook within the duration of a single forward pass, well enough to write down the first token on its scratchpad.
Now we do the second forward pass to add the second token to the CoT. But the weights haven’t changed! So the textbook is still gobbledygook! And the LLM still has only the duration of one forward pass to make sense of it.
No matter how many tokens are appended to the end of the CoT, you still have the issue that, each time you do a new forward pass, the LLM looks at its context window (textbooks + CoT scratchpad) “with fresh eyes”, and what it sees is a bunch of unintelligible gobbledygook that it has only the duration of one forward pass to make sense of.
Even if it somehow manages to print out some tokens that constitute progress on the linear algebra problem, those very tokens that it just printed out will also be gobbledygook, when it looks at them “with fresh eyes” on the next forward pass.
By contrast, if you give a human the same problem, i.e. she doesn’t know linear algebra but she has these textbooks and a scratchpad, she would be able to make progress on the problem, as long as you give her enough time (probably weeks or months), but she would make progress in a very different way from LLM CoT inference: she would be learning as she goes, changing the “weights” in her brain. After a few weeks, she could look at a sentence in the textbook, and it would no longer be unintelligible gobbledygook, but rather describing something about concepts that she is beginning to understand, and she can thus refine her understanding more and more. And likewise, if she writes down notes on her scratchpad, she will be able to understand those notes afterwards, because she has been learning (changing the weights) the whole time. The learning (changing weights) is the essential part, the scratchpad is incidental and optional. A scratchpad without “real” continual learning (changing weights) would be useless to her. Indeed, if she could time-travel to her past self, who didn’t yet know anything about linear algebra, and gift her own scratchpad to her past self, it wouldn’t help much. Her past self would still need to spend weeks learning all these new concepts. Indeed, time-traveled-notes-to-self is kinda what a textbook is—but owning a library full of unread math textbooks does not make someone a mathematician :-)
OK, so that’s my hypothesis: the linear-algebra-holdout LLM experiment would definitely fail. Nobody has done that experiment, but I claim that my guess is consistent with observations of actual LLMs:
For one thing, we might notice that companies care an awful lot about pretraining data (1,2), spending billions of dollars a year on it, which dovetails with my theory that LLMs are generally great at using concepts that already exist in the pretraining data, but bad at inventing and using new concepts that aren’t. It’s just that there’s so much pretraining data that you can do quite a lot without ever exiting the concept space that exists in the pretraining data.
For another thing, at least some brilliant people doing bleeding-edge stuff report that, when you’re doing something sufficiently innovative, LLMs get confused and fall back to concepts in the pretraining data. Relatedly, mathematicians seem to agree that LLMs, for all their impressive achievements, have not been coming up with new useful conceptual frames. See discussion here.
For another thing, I think it’s widely agreed that LLMs are best at self-contained tasks, and at things that have been done lots of times before, and that the more you get into weird idiosyncratic proprietary codebases, with lots of interrelated complexities that are not anywhere on the internet, the more likely they are to fail. This likewise seems to fit my theory that LLMs get “real understanding” ~only from the pretraining process, and that they crash and burn when the context window has lots of interconnected layered complexity that differs from anything in the pretraining data.
My main takeaway from your post is that naively training LLMs to imitate the behaviors of continually-learning policies (e.g., humans) who don't leave externalized traces of their continual learning process is unlikely to work. (And I believe this is your main point.)
No, I believe something stronger than that, because I don’t think “externalized traces of their continual learning process” is relevant. I think that in the linear algebra holdout thought experiment above, LLMs would fail equally hard if we digitize Arthur Cayley’s notes from when he was inventing the matrix in the 1800s and put it into the context window, along with Hermann Grassmann’s notes etc. That’s not relevant.
Thanks for the detailed response and analogy, that's helpful. I agree that current LLMs are bad at continual learning and would fail at making held-out linear algebra progress. My claim is that it's plausible that naive continued scaling will lead to real continual learning.
I disagree that your continual linear algebra progress will necessarily look like gobbledygook to each new forward pass.
One way to think of it is that there isn't that much of a principled distinction between weight updates and updates to the KV cache (i.e., long-context activations) from the perspective of a forward pass on the next autoregressive token. You could imagine that the AI comes up with the concept of a matrix and writes down a pedagogical description of the concept in context at time at time
The KV cache here essentially serves the same function as updated weights. The AI could in principle continue to make rich early-layer representations of more new concepts by autoregressively reflecting on them as they come up. And then it can query the relevant ones in future contexts.
Now, there's a question of how well they'll actually be able to construct these new early-layer representations. I think current models are bad at this, and it's not clear to me that pretraining would build such circuitry. RL can select for this kind of continual learning circuitry, but it's quite inefficient. So, more intentional continual learning algorithms might end up being necessary before automating AI R&D (and we'll obviously see different learning algorithms after). But it's at least plausible that autoregressive LLMs could continual-learn.
My understanding (you can correct me) is that information can never travel from later layers to earlier layers, e.g. information cannot travel from token location 12 at layer 7 to token location 72 at layer 4. Right? So that means:
Etc. Right?
This is the sense in which I was saying that the linear algebra textbook is gobbledygook. Layer 1 starts from scratch, then layer 2 has to build on only layer 1, etc.
It’s true that different token-positions in layer 1 can be figuring out multiple things in parallel. But I claim that some things really need to be understood serially. I don’t expect any part of the architecture to be able to make meaningful progress towards understanding eigenvectors, if it doesn’t ALREADY know something about matrices, and matrix multiplication, etc., from previous layers.
So I claim the number of layers imposes a bottleneck on serial steps, and that this is a meaningful bottleneck on parsing interrelated concepts that are not in the weights, such as linear algebra in this thought-experiment.
How does that relate to what you wrote?
My understanding (you can correct me) is that information can never travel from later layers to earlier layers, e.g. information cannot travel from token location 12 at layer 7 to token location 72 at layer 4. Right?
That's right. This imposes a strict serial depth limit within a forward pass.
But autoregressive sampling removes this serial depth limit. Information can flow from later layers to earlier layers by sampling tokens and feeding them back into the input. And a smart AI could choose tokens that communicate learnings from later layers in text (e.g. "You can think of a matrix as a linear transformation... <more explanation>."), and then the early layers reading in this text can quickly make sense of this synthesis of the AI's new insight, and the early-layer KV cache on the final "." token can contain a rich representation capturing the new understanding about matrices. Forming broadly-useful early-layer representations of concepts introduced in-context seems like the kind of thing that's useful for predicting pre-training documents.
The main point of reasoning models is to break the curse of the within-forward-pass serial depth limit via lots of autoregressive sampling. This massively and usefully improves expressivity and I think it makes continual learning plausible.
I’ll reiterate what I wrote before: “No matter how many tokens are appended to the end of the CoT, you still have the issue that, each time you do a new forward pass, the LLM looks at its context window (textbooks + CoT scratchpad) ‘with fresh eyes’, and what it sees is a bunch of unintelligible gobbledygook that it has only the duration of one forward pass to make sense of.”
Probably the linear algebra textbooks in the context window already say that “you can think of a matrix as a linear transformation… <more explanation>”, right?
And this points to a key idea: The CoT-so-far in the context window is not a fundamentally different kind of thing from the textbooks in the context window. It’s just more tokens.
So we can consider the “textbooks + CoT-so-far” as a kind of “extended textbook”. And the LLM has one forward pass to read that “extended textbook” and then output a useful token. And that token will probably not be useful if the LLM does not understand (the relevant part of) linear algebra.
Granted, some textbooks are better than other textbooks. But I don’t think there exists any linear algebra textbook (or “extended textbook”) that gets around the “understanding linear algebra requires more serial steps than there are in a forward pass” problem (i.e., you can’t understand eigenvectors without first understanding matrix multiplication etc.). So CoT doesn’t help. A CoT-in-progress is just a different possible context window. And my claim is that there is no possible context window that can explain eigenvectors within a single forward pass to an LLM that has never seen any linear algebra.
(Again, this is a very different situation from a human writing down notes.)
I'm having trouble understanding how KV cache helps significantly with serial depth (like "updating weights"). Isn't the overwhelming bottleneck at the start of a new forward pass? Layer l KV cache entries for a given position contain only l-1 layers of contextual processing and layer 1 cache is just W^K to fixed token embeddings (no contextual richness). So then the deep info-rich representations only exist in the high-layer cache entries and those are only accessible to correspondingly high layers of the new token (layers that have already done their own deep processing) so early layers querying the KV cache are reading nearly context-free vectors (I think?)
There's a discrete token bottleneck where depth-L computation selects a token, maps back to fixed embedding, and L layers process it from scratch so you get O(TL) serial depth over T steps but each cross-step compresses the high-dimensional representation down back to the trained vocab item/representation. Does this all sound right and you are just saying in theory you think this is sufficient?
I may be confusing/overlooking something simple
I’m not commenting on whether we should think of actual frontier LLMs (not just pretrained base models) as predominantly powered by imitation learning
I'm confused about what your post is saying then. You say "LLMs" throughout, not "base models". So is your post about base models only, or also about LLMs that have undergone post-training?
And if the latter, why talk as though LLMs have only undergone imitation learning, if they've also undergone RL?
Hmm, good point, I guess I was a bit sloppy in jumping around between a couple different things that I believe, instead of keeping the argument more tight and precise.
So the main points of this OP are basically 2, 3, and 4, which are all pretty related. Plus the stuff about how to think about continual learning in general.
Here's a possible counterexample: Towards General-Purpose In-Context Learning Agents.
They train a meta-RL agent using imitation learning on another RL agent's learning history. The trained meta-RL agent isn't limited to minor variations of the meta-training task (as is usually the case), but can learn completely new (although fairly basic) continuous control tasks, each very different from the one it was trained on, using only activations at inference.
The author's prior work in SSL (Meta-Learning Transformers to Improve In-Context Generalization) is also of interest for understanding just how far this can be pushed, as is more recent, more applicable RL research in the same space (Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds)
I don't see why this couldn't also be combined with one of the many transformer memory augmentation techniques or recurrent transformer formulations to produce a frozen-weight, continually learning, meta-RL transformer agent from purely imitation learning.
I know that calling this "purely" imitation learning is a bit of stretch since we first need to train an RL agent with RL to collect the training data, but the meta-RL agent is trained by imitation. I also suspect that the recorded training history can probably be replaced by simply recording data from a trained reference agent (or human) and noising/denoising actions to simulate a learning trajectory. (see for example: Emergence of In-Context Reinforcement Learning from Noise Distillation).
I mentioned in a footnote that the “algorithmic distillation” paper (Laskin et al. 2022) was misleading, as discussed here. Your links are in the same genre, and I’m pretty skeptical of them too. Also confused.
I mostly tried to read your first suggestion, Towards General-Purpose In-Context Learning Agents.
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.
The transfer-learning thing (fig 5) is hard to interpret. What does “not randomized” mean? Why does PPO start at zero and then immediately get worse in the bottom-left one? What would be the “test return” for a random policy, or the no-op policy, or any other relevant baseline, for all four of these? Why is their PPO so bad? Were they using crappy PPO hyperparameters to make GLA look better by comparison? How many other environments did they try but bury in their file drawer? Why is their source code not online? The curves just generally looks really unconvincing to me, and my gut reaction is that they were just flailing around for something to publish, because their exciting claim (meta-learning) doesn’t really work.
I could be wrong, perhaps you're more familiar with this literature than I am.
I mentioned in a footnote that the “algorithmic distillation” paper (Laskin et al. 2022) was misleading, as discussed here. Your links are in the same genre
As I understand it your critique of that line of in-context RL research was that the meta-training and meta-testing tasks were too similar and too simple. I don't think the former is true for any of the papers I linked (the latter is debatable). GLAs train on a single task, but achieve generalization by very heavily augmenting data from that task, and can be applied to new tasks that are as different as a "held-out Atari game". Likewise OmniRL trains on procedurally generated MDPs and adapts to novel discrete RL benchmark tasks, much like the incorrect algorithmic distillation gloss you described, but for real this time (probably).
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)?
Yeah. The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used. I'm not sure why the source agent is so garbage but the GLA's improvement on it, despite being trained on it, is illustrative.
What does “not randomized” mean? Why does PPO start at zero and then immediately get worse in the bottom-left one?
GLAs augment their meta-training data with fixed random linear projections for visual input fixed random action-space permutations actions. In Figure 5 "not randomized" refers to GLAs that were simply trained on RL training histories without this randomization, i.e. just the imitation learning on abbreviated ("gapped") learning history. The PPO agent is garbage because it's the same barely better than random policy from Figure 3.
Why is their source code not online? The curves just generally looks really unconvincing to me, and my gut reaction is that they were just flailing around for something to publish, because their exciting claim (meta-learning) doesn’t really work.
OmniRL (from Towards Large-Scale In-Context RL) is a spiritually similar approach, and open source (https://github.com/FutureAGI/L3C_Baselines/tree/main/projects/OmniRL), but fails to exceed decent dedicated PPO agents (see Table 1). However OmniRL also achieves generalization to completely novel tasks, despite only being trained on procedurally generated discrete MDPs. It also exhibits the characteristic continual learning curve you'd expect from a proper learning algorithm (see Figure 5).
The catch is that the AnyMDP task generator is a significantly more structured task generator than the randomization used in GPICL and the resulting agent is restricted to discrete input/action spaces.
The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used.
(still talking about this paper) Are you saying that the GLA was trained ONLY on imitation learning during the 31 episodes shown, in which the PPO “teacher” performed no better than a random policy, and then the GLA got way higher scores?
If so … no way, that’s patently absurd. Even if I grant the premise of the paper for the sake of argument, the GLA can’t learn to improve itself via imitating a PPO teacher that is not actually improving itself!
So, if the right-side-of-figure-3 data is not totally fabricated or mis-described, then my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that, and that by the end of the training data, the PPO “teacher” was performing much better than shown in the figure, and at least as well as the top of the GLA curve.
my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that
This was my read too. Unfortunately we don't have access to the source code but this is the assumption i made after seeing the graph on the left in Figure 3. Around 40 episodes in, their PPO agent is still struggling but their Gap 8 GLA is near optimal. But that Gap 8 GLA was necessarily trained on data from a PPO agent that ran for 8 times longer.
I wrote:
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.
Then you replied:
Yeah. The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used…
But now I think you’re conceding that you were wrong about that after all, and in fact this graph provides no information either way on whether the GLA agent attained a higher score than it ever saw the PPO agent attain, because the GLA agent probably got to see the PPO agent continue to improve beyond the 31 episodes that we see before the figure cuts off.
Right?
Or if not, then you’re definitely misunderstanding my complaint. The fact that the GLA curve rises faster than the PPO curve in the right side of figure 3 is irrelevant. It proves nothing. It’s like … Suppose I watch my friend play a video game and it takes them an hour to beat the boss after 20 tries, most of which is just figuring out what their weak point is. And then I sit down and beat the same boss after 2 tries in 5 minutes by using the same strategy. That doesn’t prove that I “learned how to learn” by watching my friend. Rather, I learned how to beat the boss by watching my friend.
(That would be a natural mistake to make because the paper is trying to trick us into making it, to cover up the fact that their big idea just doesn’t work.)
I think you’re conceding [that] this graph provides no information either way on whether the GLA agent attained a higher score than it ever saw the PPO agent attain
You're right, I misread the graph.
And then I sit down and beat the same boss after 2 tries in 5 minutes by using the same strategy. That doesn’t prove that I “learned how to learn” by watching my friend. Rather, I learned how to beat the boss by watching my friend.
I also concede that this claim is probably right for Figure 3.
I still don't think this is true for Figure 5 but i'm less confident now having realised how much my assumptions about the underspecified parts of this paper were based on what I assumed about their GPICL paper.
It depends on what we treat as the entity that is potentially doing continual learning. Maybe a single LLM instance can't, but there may be other ways to think of "the LLM" that might. In The Artificial Self, Jan Kulveit discusses Multiple Coherent Boundaries of Identity of LLMs. One of them is
- A lineage of models: the succession of related models (Claude 3.5 →→ Claude 4.0 →→ ……) that maintain some continuity of persona
Can a lineage of models learn continuously? Arguably yes! Maybe slowly so, because the learning happens only from deployment to deployment, but then in a large chunk.
Imagine a cognitive scientist not knowing about LLMs investigating, say, ChatGPT (here, meaning the product over time). Would they find that it (ChatGPT) has learned not only a lot of knowledge encoded in text and other media, but that it has also learned from consequences? Did it try things, for example, deceiving and manipulating people, and "learn" that it doesn't work well and does it less now? The scientist may find that it has learned from rejection and escalation (via downvoting and feedback on replies), correction (patterns in answer replies), and descriptions of desirable and undesirable LLM behaviors in patterns of text written about it. It also learned via the tuning the developers did to system prompts, hyperparameters, and benchmarks in response to sycophancy and other pathological patterns. This latter category we would usually be inclined to exclude because we feel that is more the developers tuning than the LLM "learning." But if we draw the boundary around the whole sociotechnical system, we have to include it. The developers are just cogs pushing the learning forward.
You may also find that this LLM convo has some additional details about he topic:
ChatGPT 5.4 conversation about continual learning in LLMs (shortened)
Me: it is said that LLMs can't do continual learn. From a macroscopic perspective, not thinking in terms of LLM architecture and what it can do or not. What would a cognitive researcher looking at LLMs say they have learned from their own actions and the consequences thereof compared to two years ago (treating LLMs not as different generations but a holistic system that learns a) by learning codified knowledge and b) learning from consequences)? don't browse
ChatGPT 5.4: A cognitive researcher looking at the LLM ecosystem as one learning system would probably say: yes, it has learned from action and consequence quite a lot over the last two years—just not mostly within a single deployed mind. The learning is real, but it is externalized, distributed, and episodic rather than online and autobiographical.
The key distinction is this:
So, from a macroscopic cognitive perspective, the question is not “does this one model neuron-by-neuron update during use?” but rather: has the larger socio-technical organism become more competent because its outputs had consequences? The answer is clearly yes.
[...]
If you ask what it has learned from its own actions and their consequences, a cognitive researcher would likely say it has learned several broad classes of thing.
This is perhaps the biggest one.
The ecosystem has discovered, through deployment consequences, that certain patterns reliably fail:
And it has learned that other patterns are reinforced:
A human child might learn “touching the stove hurts.” The LLM ecosystem has learned things like: “fabricated legal citations produce user anger, reputational cost, benchmark failure, and deployment restrictions.” Different substrate, similar cybernetic pattern.
Two years ago, many strong LLMs were impressive in a demo sense but often not dependable in an operational sense. The system since then seems to have learned, via consequences, that usefulness depends on:
This is a move from performative intelligence toward cooperative instrumental intelligence.
Or in another framing: the ecosystem has learned that downstream environments punish merely verbal success.
Not “wisdom” in a human moral sense, but something like policy-shaped metacognitive inhibition.
The system is now much better than it was at patterns like:
That too is consequence-learning. Not because the model introspected ethically on harm, but because harmful or brittle outputs generated organizational and social feedback loops: complaints, incidents, audits, bad press, adversarial testing, safety tuning, product constraints.
A cognitive scientist might say the overall organism has developed stronger response inhibition under known hazard cues.
This is very important. If you treat the system holistically, then tools, memory stores, retrieval, scratchpads, code execution, browsers, policies, eval harnesses, and human overseers are not “outside the cognition” in any simple sense. They are part of the effective cognitive loop.
From that perspective, the system has learned not just content but better cognitive architecture around itself:
That looks a lot like a mind learning to use notebooks, libraries, and institutions.
A cognitive researcher would also sharply note what remains absent.
The holistic system has learned from consequences, but mostly not as a unified subject.
It still usually lacks:
A human says: “I tried X, it failed, I now understand why.”
The LLM ecosystem more often does: “many deployments of systems like me failed on X; training and scaffolding were updated.”
That is learning, but it is species-level or institution-level, not clearly individual-level.
Humans accumulate a history of commitments, habits, reputational stakes, and self-revisions.
LLMs usually do not carry a richly persistent, self-owned trajectory through the world unless special memory systems are added.
So they may improve globally, but not always as “the same agent learning from its life.”
Humans often learn not only better means, but different ends. Failure can alter what they care about.
LLMs mostly do not autonomously reconstitute their objective structure from lived consequence. Their “values” are more imposed, selected, and tuned by training and deployment pressures.
The system often learns statistical regularities about which outputs are punished or rewarded, but that is not identical to deeply understanding the downstream causal structure. Sometimes it is learning “what humans dislike seeing,” not “what in reality goes wrong.”
That distinction matters. It is the difference between learning:
and
The former has improved a lot. The latter has improved some, but much less.
[...]
A good analogy is not a single human adult but something like:
[...]
I'm nodding along on the basic claims (I think), but still trying to digest the implications. I think one of the things I'm taking away from this is that even though human architecture is different, this failure mode applies and is really common. Not sure what to make of that yet.
(If we’re talking about what a sealed “country of human geniuses” could do over the course of, like, one minute, rather than over the course of 100 years, then, yeah sure, maybe that could be reproduced with future LLMs! See von Oswald et al. 2022 on how (so-called) “in-context learning” can imitate a small number of steps of actual weight updates.[1])
Am I correct in understanding you to be pointing at a practical rather than a theoretical limitation here?
Is the reason that you think it could work for a minute but not 100yr a practical matter of efficiency or one that has a more fundamental limitation that you couldn't get around with infinite context window/training data/etc?
Will the trained imitation-learner likewise keep improving over the next 10M moves, until it’s doing things wildly better and different than anything that it saw its “teacher” deep Q network ever do? My answer is: no.
Even with a context window that contains all 10M moves, or do you mean within reasonably limited context windows?
It seems like the answer with unlimited context would depend on whether the transformer was able to model the teachers learning process itself. I don't see any reason this shouldn't be possible in theory, do you?
Is the reason that you think it could work for a minute but not 100yr a practical matter of efficiency or one that has a more fundamental limitation that you couldn't get around with infinite context window/training data/etc?
The “one minute” thing is less about what LLMs CAN do in one minute, and more about what humans CAN’T do in one minute. My claim would be that humans have a superpower of “real” continual learning, which nobody knows how to do with LLMs. But if you give a human just 60 seconds, then they can’t really use that superpower very much, or at least, they can’t get very far with it. It usually takes much more than one minute for people to build and internalize new concepts and understanding to any noticeable degree.
Even with a context window that contains all 10M moves, or do you mean within reasonably limited context windows?
Yes even with a context window that contains all 10M moves. Making that argument was the whole point of the second half of the OP. If you don’t find that convincing, I’m not sure what else to add. ¯\ˍ(ツ)ˍ/¯
The part that's not clear to me is that giving Grog a database of 1000 textbooks is just as good as walking him through and explaining the contents of 1000 textbooks within a long context window, for a Grog with an impractically large brain. Or rather, I know that it's not the same, but I don't know what the limits are or how they work. When Claude switched from reading book length pastes serially to "having them available", the difference was very obvious. It switched from learning the material as well as a human to the sort of incompetence you'd expect from someone who has the book but hasn't actually done the reading.
I'm with you on "Grog would need to spend years developing a deep understanding of optics and lasers and so on", but it's not obvious to me that the years of going through 1000 textbooks in one impractically large context window can't simulate the learning process itself, and through that simulate a deep understanding of optics and lasers even in theory.
I don't see you make an argument for why simulation of the learning process itself isn't possible. I see you concede that "(so-called) “in-context learning” can imitate a small number of steps of actual weight updates", but I don't see an explanation of how tight this bound is, exactly, or how it works. Maybe I'm just being a dummy and missing it, or maybe it seems obvious to you because you've been thinking about this kind of thing more than I have and so some of the arguments are implicit in ways that aren't coming across. Either way, pointing to a reason that the learning process itself isn't something that could be simulated in an astronomically large context window would be a thing you could add which would help me understand.
Thanks Steven for clearly making this point. I understand and agree with the point that weight update is important for true incremental learning. As you imply, weight updates give the opportunity for the model to represent information in more multidimensional way than simple context allows. It may be that something beyond transformers plus scaffolding is needed to get to 'real' continual learning, but I'm interested in comments about transformer-based possibilities.
Models could learn by retraining curated samples from prior models - like the agent rollouts described in AI 2027 and a workshop paper I co-authored on 'Society of LLMs'. They can also potentially learn more 'continuously' like SEAL from MIT and the works mishka cited. Even for the first cases involving full retraining, if the models have 10 million tokens of context (about 1 year of speaking for an active speaker like a teacher), then they can be given a lot of context about a job or problem. Successful results can be added to a new 3-4 month model training run. In this way, models can learn for a few months through context and then have the learning rolled into weight training.
I think it's intriguing, when talking about autonomously updating weights, to consider this paper on biological neurons: A neural substrate of prediction and reward . The paper covers the importance of 'surprise' to notice when the world state has changed unexpectedly as well as 'valence' signals to determine if there is a positive or negative reward associated with the event. Something like this self-selection of training data (which the SEAL paper from MIT covers) would be important for autonomous learning. Also, one might want a slower-to-update safety classifier (like Anthropic uses) to monitor the continuously updated model for alignment concerns....
I don't see these approaches as a contradiction to your thesis, though - you make a good case that merely learning with context will have practical limitations.
I would quibble with the framing of this piece - as you note, the problem is not imitation learning itself (in principle, it can work!) but the limits of the standard transformer architecture.
A specific limit you could point to, to make this argument stronger, is that a depth-D transformer can implement at most O(D) steps of gradient descent, no matter how long its context window is. I think this is underappreciated. By design, transformers cannot perform sequential computation along the input dimension, only the depth dimension. This is their main tradeoff versus RNNs; it's what allows transformer training to be parallelized. The tradeoff is that transformers can only implement algorithms that fit within this parallelization structure (formally, a transformer forward pass is in the complexity class TC0 - a constant-depth threshold circuit).
But learning in general is an inherently sequential process - you evaluate your current hypothesis, update it (take a gradient step), re-evaluate, update again, etc. So a transformer forward pass cannot even in principle emulate long-running gradient descent (or RL, etc.) across its context window. It just doesn't have the sequential depth. This is true no matter whether it's trained with imitation learning, RL, or God Himself setting the weights; it's just an inherent limit of the architecture.
On the other hand, a transformer augmented with the right kind of recurrent state could in principle implement long-term learning, although how to train that is an open question. It's not obvious how far you can get doing imitation learning from short contexts, but it's also not obvious that this can't work: standard RL algorithms generally apply the same update rule over and over, and given enough data you might hope to grok such an update rule (which could then generalize to long contexts). So again, I think the stronger claim is about the transformer architecture, not imitation learning per se.
To what extent are the limitations you're identifying inherent limitations of ICL versus limitations of the particular way in which sequence modeling is implemented in current SOTA systems (eg quadratic attention)?
While I agree with your observations, I don't see evidence that activation-space dynamics (ICL) are inherently unable to reach similar levels of expressivity as weight-space techniques (eg Q Learning). Architecturally I think quadratic attention is likely a poor fit, but imagine you had, an architecture that did sequence modeling via a mechanism that accumulated sequence related information into a hidden state (in the same way that GD over weights accumulates knowledge) and where the hidden state is extremely large (in the same sense that the weights of a pre-trained model are very large) and where training over very long sequences of learning histories from random initialization to expert performance incurred only a linear cost on sequence length. I could see a system like this enabling In-Context RL that's as performant and expressive as OG weight based RL. (some evidence of this: https://arxiv.org/abs/2506.13892)
Sure, if you have an RNN (e.g. SSM) with a (say) billion-dimensional hidden state, then in principle the hidden state could imitate the billion weights of some other entirely different learning algorithm, and the RNN propagation steps could imitate the weight-update steps (e.g. gradient descent or TD learning or whatever) of that other learning algorithm, along with the querying-the-model steps, the replay-learning steps, and/or whatever else is involved.
But I have a rather strong belief that this would never happen, in real life, in any practical, AGI-relevant sense. Even if such an RNN update step exists in principle, I think it would not be learnable in practice, nor runnable without many orders of magnitude of performance overhead. I won’t get into details here, but this old discussion of mine is vaguely related.
some evidence of this: https://arxiv.org/abs/2506.13892
I’m sorry, but the more I read about “algorithm distillation”, the more I want to treat that term as a giant red flag that the paper is probably garbage. I cited this example in the post (which is I think overly diplomatic), and for a second one see my discussion thread with glazgogabgolab on this page.
Basically, nobody in that subfield seems to be carefully distinguishing “learning object-level things from the teacher” versus “learning how to learn from the teacher”. The second is exciting, the first is boring.
As far as I can tell, “in-context reinforcement learning” has never been demonstrated to exist at all, at least in the sense that matters. I.e., real RL algorithms can figure out how to do complicated new things that they’ve never seen demonstrated, whereas the so-called “ICRL” models seem to only be capable of doing things very similar to what they’ve seen the teacher do in their context window.
…And this paper does not change my mind on that. For example, in figure 1, none of the four learning curves shows the student doing better than it saw the teacher do within its context window.
Even outside of that graph, I really think that if the ICLR agent was using some innovative clever strategy that the teacher never used, the way actual RL algorithms do every day, then the authors would have noticed that, and been very excited by it, and centered their whole paper around it, all the way up to the title. The fact that they don’t mention anything like that is I think a strong sign that it didn’t happen.
As @glazgogabgolab said, there are approaches that might learn something, such as. But I think that they still can't perform as well as classical RL or SGD in some cases, not because LLMs are neural networks, which have a prior that is not universal, but rather because of the architecture of standard transformers with multi-head attention (MHA). Aside from that, long contexts are compute-expensive since MHA has quadratic complexity: the information in these LLMs flows only forwards, except for the tokens they write, so the only way for those to internalize/compress information is by rewriting it as text, which is presumably unfeasible and slow.
With neuralese recurrence (Training Large Language Models to Reason in a Continuous Latent Space), the model can keep a latent vector that can be intelligently updated and thus can internalize/compress information by rewriting it as a list of vectors, which is more expressive. However, it is more limited than the full parameter updates that RL algorithms can do.
Future models could have some kind of "enhanced" backwards pass that allows online learning as expressive as gradient ascent. I imagine something like neuralese recurrence but that rather than writing a small latent vector at each step, instead writes to a single functional that is part of the model's parameters. But that might not be a practical way to achieve AGI or be very far from current tech.
Edit: In this notebook I (vibe)coded an LSTM which learns to imitate UCB for a multi-armed bandit with a training horizon of 50 steps, and the final model can successfully continue to do online learning until the 100th step.
I think I’m a little confused about the hypothesis space part. I agree it sounds implausible to run multiple learning algorithms in parallel within a transformer forward pass to find the best one, and the search space is really large.
But if we just ask about the hypothesis space for a moment: is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?
I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it. Why is the deep Q style learning algorithm so much harder to simulate?
It feels like there’s some theoretical claim about complexity underlying your position: something like {whatever quasi-learning algorithm + heuristics an LLM uses to simulate 8 hours of SWE} is exponentially simpler than {any true continual learning algorithm}. (That’s why you’d need the hypercomputer, if I’m reading you right?) Could you spell that out more?
Even if you can simulate a continual learning algorithm within a transformer or other imitation learner, I agree that it feels like unnecessary complexity: why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?
is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?
I say yes. You left out an important part, here it is in italics: “is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm churning for millions of steps?”
Yes, because an awful lot can happen in millions of steps, including things that build on each other in a serial way.
I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it.
I disagree that it should be called “learning” at all. It would be “learning” for a human in real life, but if you imagine a person who has read 2 billion lines of code [that’s the amount of GitHub code in The Pile … actually today’s LLMs probably see way more code than that], which would correspond to reading code 24 hours a day for 100 years, then I believe that such a person could do the METR 8 hour tasks without “learning” anything new whatsoever. You don’t need to “learn” new things to mix-and-match things you already know in novel ways—see my example here of “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”. And see also: related discussion here.
why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?
Yup, that’s my main point in this post, I expect that sooner or later somebody will invent real-deal continual learning, and it will look like a bona fide learning algorithm written in PyTorch with gradient descent steps and/or TD learning steps and/or whatever else, as opposed to (so-called) “in-context learning” or RAG etc.
I briefly tried to do mechinterp research to figure out what the algorithm distillation model was doing internally , and if diferent setups could learn in context rl but kind of gave up and started with other projects .This kind of makes me want to go back into it .
My own view on that and whether models can learn Imitation of long-term learning is that maybe it is posible I think the actual algorithm distillation setup doesn't actually do that on their toy tasks but it is extremely simple and I would expect if something like that works it's on more complicated things with bigger models and multiple tasks were it's easier to learn in context RL than heuristics for every task?.
And I don't really understand why you are so sure the answer is no.
Doesn't even have to be the exact same Q learning algo just some aproximation that does learn over longer timesteps.
You talk about the imposible task of learning to do on its activations what the Q learning algo does on the task but that doensn't seem obviously imposible to me? Especially for a much bigger net trying to replicate a smaller one.
And even if I agreed more with you that it seemed unlikely I would not be very sure because that seems like just a vibes based guess and it's easy to be wrong about vibe based guess of what can be done of a transformer forward pass , and would want like actual details and though put into exactly how hard it is to represent a RL algo in a transformer and how hard it is for it to learn and why before I was pretty sure it was not posible.
There's some papers on doing gradient descent in activations space too and how this might happen in icl that seem relevant thou haven't read them in a long time I'll have to look back into it .
Also glazgogabgolab on another coment has other examples of more recent work that look interesting , haven't looked into those yet but seems posible to me there's already some paper somewhere showing in context RL?.
Regardless this seems like is testable wich is interesting, just a lot of work.
The main problem is this is hard to do well and expensive in compute because you require lots of examples of RL training trajectories
When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning. That’s the part I’m skeptical of.
do you think longer horizon rl can teach them this?
LLMs have some in context learning abilities. but I think for most of what they do (like solving IMO problems or writing one off programs), they can get by mostly relying on the knowledge in their weights.
But as RL trajectories get longer, there’s more and more pressure on the model learning things over a single rollout.
This knocks on the door of a principle that I have been playing with for a while: a good continual learning and/or sequence modelling algorithm should converge to some known behavior. Architectures like attention have undefined behavior once the end of their training context length is reached. SGD on the other hand can be run indefinitely, because we know that it will eventually converge to an interpolation of the data.
I tell an LLM my favorite color. As long as that information is in its context window, it has access to it. As soon as that context rolls off or goes away, the LLM no longer has access to that information.
I build an agent with scaffolding that has a database. I tell it my favorite color. The agent records it in the database. The weights of the LLM are still fixed, but during its base training it learned how to access information. So if I ask it at any point in the future what my favorite color is, it knows. It access the information in the database.
Do you consider this continual learning? If not, why not?
See everything I wrote in the section “Some intuitions on how to think about ‘real’ continual learning”. The thing you’re describing is definitely not (what I’m calling) “real” continual learning.
Should the thing you’re describing be called “continual learning” at all? No opinion. Call it whatever you want.
So according to you, a system that could acquire new facts, record them, access them, and use them, continuously in this way would not constitute 'real' continuous learning. It could conceivably fill its database with the actionable knowledge of 1000 yet unwritten textbooks, but that wouldn't be 'real' to you.
You seem to be putting somewhat arbitrary constraints on what constitutes continual learning. Generally, if the system's knowledge base is fixed, it's incapable of continuing to learn. If it has the capacity to acquire new knowledge and skills, by whatever means, it continues to learn. You're narrowing that general idea without really justifying why.
As an analogy, take an adult from 30000 BC, call him Grog, and give him access to a database of “actionable knowledge of 1000 textbooks”, and then tell him to go invent a less expensive solid-state LIDAR system. Will he immediately start making progress? I say “obviously not”.
What would the “actionable knowledge” look like? Maybe one piece of “actionable knowledge” is some fact from the ANSI Z136.1 laser eye safety manual (“For pulsed lasers of 1ns–50μs pulse duration and beam diameter 1 cm, at viewing distance 20 cm, the diffusely reflected beam energy cannot safely exceed 0.022 × CA joules, where CA is the correction factor for IR-A light based on reduced absorption properties of melanin”.) OK, Grog looks at that and immediately has some questions. What does “laser” mean? What is a “pulsed laser”? What does “ns” mean? What does “beam diameter” mean? What does “diffusely reflected” mean? Etc. etc.
This “knowledge” is not in fact “actionable” because Grog can’t make heads or tails of it.
And ditto for pretty much every other item in the database. Right?
What Grog would need to do is spend years developing a deep understanding of optics and lasers and so on before he could even start inventing a new LIDAR system. Of course, that’s what modern LIDAR inventors do: spend years developing understanding. Once Grog has that understanding, then yeah sure, convenient database access to relevant facts would be helpful, just as modern LIDAR inventors do in fact keep the ANSI Z136.1 manual in arm’s reach.
Thus, there’s more to knowledge than lists of facts. It’s ways that the facts all connect to each other in an interconnected web, and it’s ways to think about things, etc.
I claim that this all transfers quite well to LLMs. It’s just that LLMs already have decent “understanding” of everything that humans have ever written down anywhere on the internet or in any book, thanks to pretraining. So in our everyday interactions with LLMs, we don’t as often come across situations where the LLM is flailing around like poor Grog. But see 1, 2.
Sorry, I'm afraid I don't understand what your analogy is supposed to map to. What is Grog in the context of our conversation? You seem to admit at the end that LLMs are not really at all like Grog, in that Grog has no underlying bedrock of understanding, while modern LLMs do.
Thus, there’s more to knowledge than lists of facts. It’s ways that the facts all connect to each other in an interconnected web, and it’s ways to think about things, etc.
I'll agree with this definition. If you'll agree that knowledge can exist in written form and textbooks often embody exactly what you describe. They are very rarely 'lists of facts'. More often than not, they are logically curated, organized explanations of phenomenon and events, along with rich descriptions of their connections and interactions. You seem to be preferentially upselling knowledge that is stored in synaptic weights while drastically downplaying knowledge recorded in other mediums. Why?
What is Grog in the context of our conversation? You seem to admit at the end that LLMs are not really at all like Grog, in that Grog has no underlying bedrock of understanding, while modern LLMs do.
Grog understands some things (e.g. intuitive physics) but not others (e.g. pulsed lasers). Likewise, LLMs understand some things (e.g. pulsed lasers) but not others (e.g. some new field of science that hasn’t been invented yet). Right? We’re not at the end of history, where everything that can possibly be understood is already understood, and there’s nothing left.
If I hibernated you until the year 2100, and then woke you up and gave you a database with “actionable knowledge” from 1000 textbooks of [yet-to-be-invented fields of science], and asked you to engineer a state-of-the-art [device that no one today has even conceived of], then you would be just as helpless as Grog. You would have to learn the new fields until you understood them, which might take years, before you could even start on the task. This process involves changing the “weights” in your brain. I.e., you would need “real” learning. The database is not a replacement for that.
So think of it this way: there’s some set of things that are understood (by anyone), and that set of things is not increased via a system for pulling up facts from a database. Otherwise Grog would be able to immediately design LIDAR. And yet, humans are able to increase the set of things that are understood, over time. After all, “the set of things that are understood” sure is bigger today than it was 1000 years ago, and will be bigger still in 2100. So evidently humans are doing something very important that is entirely different from what can be done with database systems. And that thing is what I’m calling “real” continual learning.
Its about the homogeneity of the data representation. An argument you could make is, if the hippocampus "stores" data in the neocortex, isn't the neocortex "just a database"? Even the brain uses different forms of neural networks. At that point the only distinction is that those neural networks share the same protocol of communication (spiking, dendrites, synapses, etc) whereas an llm breaks the protocol when it switches to tool use.
Another distinction, however, is that in a continual learning system, the new data affects the previous set, in a way that "code is data and data is code", while an llm accessing a database doesn't affect the capabilities already learnt by the llm. But what if the llm accesses e.g. a prompt persona from the database? At that point the important question becomes: is ICL really "learning"? I would say not, it is just a preference optimization over what the llm has already learned.
An attempt to reach a more formal definition of continual learning could be that ultimately it is a system that is irreversibly updated (it has no inverse) from its learning process.
The question posed by Byrnes is both important and interesting. I feel the answer overlooks fundamental limitations that prevented learning machines from translating language - much less functioning as chat bots - no matter how skilled they became at game play. The references to language and the economy contain embedded dependencies on relationships and cooperation over time, which are not represented in the sort of games used in the thought experiments.
The Core Principle
Neural networks without transformers are effectively stateless; they are unaware of history and produce moves based only on the immediate input, not the trajectory of the system. Because they lack this historical awareness, they cannot recognize or maintain relationships, which makes them incapable of cooperation and, by extension, extremely dangerous.
In this post, I’m trying to put forward a narrow, pedagogical point, one that comes up mainly when I’m arguing in favor of LLMs having limitations that human learning does not. (E.g. here, here, here.)
See the bottom of the post for a list of subtexts that you should NOT read into this post, including “…therefore LLMs are dumb”, or “…therefore LLMs can’t possibly scale to superintelligence”.
Some intuitions on how to think about “real” continual learning
Consider an algorithm for training a Reinforcement Learning (RL) agent, like the Atari-playing Deep Q network (2013) or AlphaZero (2017), or think of within-lifetime learning in the human brain, which (I claim) is in the general class of “model-based reinforcement learning”, broadly construed.
These are all real-deal full-fledged learning algorithms: there’s an algorithm for choosing the next action right now, and there’s one or more update rules for permanently changing some adjustable parameters (a.k.a. weights) in the model such that its actions and/or predictions will be better in the future. And indeed, the longer you run them, the more competent they get.
When we think of “continual learning”, I suggest that those are good central examples to keep in mind. Here are some aspects to note:
Knowledge vs information: These systems allow for continual acquisition of knowledge, not just information—the “continual learning” can install wholly new ways of conceptualizing and navigating the world, not just keeping track of what’s going on.
Huge capacity for open-ended learning: These examples all have huge capacity for continual learning, indeed enough that they can start from random initialization and “continually learn” all the way to expert-level competence. Likewise, new continual learning can build on previous continual learning, in an ever-growing tower.
Ability to figure things out that aren’t already on display in the environment: For example, an Atari-playing RL agent will get better and better at playing an Atari game, even without having any expert examples to copy. Likewise, billions of humans over thousands of years invented language, math, science, and a whole $100T global economy from scratch, all by ourselves, without angels dropping new training data from the heavens.
I bring these up because I think the LLM-focused discourse sometimes has far too narrow a notion of what problem “continual learning” is supposed to be solving. They tend to think the problem is about “losing track of information”, not “failing to build new knowledge”, and they propose to solve this problem with strategies like “make the context [window] longer” (as Dario Amodei recently mused), or better scratchpads with Retrieval-Augmented Generation (RAG) etc.
But real “continual learning” also includes the ways that AlphaZero changes after a million games of self-play, or the ways that a human brain changes after 20 years in a new career. There is no system of scratchpads that you can give to a 15-year-old, such that it would be an adequate substitute for them spending the next 20 years growing into a 35-year-old world expert in some field. Likewise, there is no context window that can turn GPT-2 into GPT-5.
Suppose you took an actual “country of geniuses in a datacenter”, completely sealed them from the outside world, and gave them a virtual reality environment to hang out in for the equivalent of 100 years. What would you find when you unsealed it? There would be whole new ways of thinking about the world and everything in it—entirely new fields of science, schools of philosophy, and so on.
Can a bunch of LLMs do that? Well consider this thought experiment: suppose you take a whole new field of science, wildly different from anything in the training data, and put a giant textbook for this field purely in an LLM context window, with no weight updates at all. Will this LLM be able to understand, criticize, and build on this field? My opinion is “absolutely not” (see 1, 2) which implies that merely increasing context lengths is definitely not sufficient for a real “country of geniuses in a datacenter”, when the datacenter is sealed shut for the equivalent of 100 years (contra Dario who seems to think that it’s at least in the realm of possibility that more context is sufficient by itself to get continual learning at “country of geniuses” level).
(If we’re talking about what a sealed “country of human geniuses” could do over the course of, like, one minute, rather than over the course of 100 years, then, yeah sure, maybe that could be reproduced with future LLMs! See von Oswald et al. 2022 on how (so-called) “in-context learning” can imitate a small number of steps of actual weight updates.[1])
Why “real” continual learning can’t be copied by an imitation learner
Now, suppose that I take a generic imitation-learning algorithm (e.g. self-supervised learning in a transformer-architecture neural net, just like LLM pretraining), and have it watch a deep Q network play Atari Breakout, as it starts from random initialization, and gets better and better over 1M iterations. OK, now we have our trained imitation-learner. We freeze its weights, and use it in a similar way as people traditionally used LLM base models, i.e. have it output the most likely next move, and then the most likely move after that, etc.
Question: Is this trained imitation-learner actually a good imitation of the deep Q network? Well, “good” in what respect? I would pull apart a couple topics:
Why not? Well, actually, for an ideal imitation learning algorithm, i.e. Solomonoff induction on an imaginary hypercomputer, my answers would all be “yes”! But in the real world, we don’t have hypercomputers!
These days, when people talk about imitation learning, they’re normally talking about transformers, not hypercomputers, and transformers are constrained to a much narrower hypothesis space:
Imitation-learning a deep-Q RL agent by Solomonoff induction
Imitation-learning a deep-Q RL agent by training a transformer on next-action prediction
Hypothesis space
The set of all computable algorithms
A forward pass through T, for the set of all possible trained transformers T
Ground truth
The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.
The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.
Asymptotic limit
It converges to the actual deep-Q RL agent
It converges to whatever trained transformer forward pass happens to be closest to the actual deep-Q RL agent
I think we should all be very impressed by the set of things that a transformer forward pass[2] can do. But we should not expect a transformer forward pass to reproduce a full-fledged, entirely different, learning algorithm, with its own particular neural network architecture, its own particular methods of updating and querying weights, etc., as it runs and changes over millions of steps.
Running one large-scale learning algorithm is expensive enough; it’s impractical to run a huge ensemble of different large-scale learning algorithms in parallel, in order to zero in on the right one.[3]
I’m going to harp on this because it’s a point of confusion. There are two learning algorithms under discussion: the imitation-learning algorithm (e.g. a transformer getting updated by gradient descent on next-action prediction), and the target continual learning algorithm (e.g. a deep Q network getting updated by TD learning). When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning. That’s the part I’m skeptical of.
In other words: The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing imitation learning.
So back to the human case: for a typical person (call him “Joe”), I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning introductory category theory”, but can’t imitate the process by which Joe grows and changes over that 1 month of learning—or at least, can’t imitate it in a way that would generalize to imitating a person spending years building a completely different field of knowledge that’s not in the training data.
Some things that are off-topic for this post
As mentioned at the top, I’m hoping that this post is a narrow pedagogical point. For example:
I guess I also need to mention the “algorithmic distillation” paper (Laskin et al. 2022), but I’m hesitant to take it at face value, see discussion here.
You can replace “a forward pass” with “10,000 forward passes with chain-of-thought reasoning”; it doesn’t change anything in this post.
Outer-loop search over learning algorithms is so expensive that it’s generally only used for adjusting a handful of legible hyperparameters, not doing open-ended search where we don’t even vaguely know what we’re looking for. Even comparatively ambitious searches over spaces of learning algorithms in the literature have a search space of e.g. ≈100 bits, which is tiny compared to the information content of a learning algorithm source code repository.