The thing is, humans are also lousy outside their training distribution. This is less obvious because our training distributions vary so much. But the phenomenon where some problem or technological need has been unsolved for many years, and then three groups solve it almost simultaneously, is generally because solving almost any hard problems requires combining about 3-5 other ideas. Consider one that takes 5. It's pretty much impossible until 3 of those ideas have been invented and publicized. Then it's really, really, hard: you have to spot three things are relevant and how to combine them, then come up with two separate great ideas to fill the gaps. But once 4 of them are done, the threshold drops: now you only have to spot and combine 4 things and come up with 1, and once the 5th one has come along, all you have to do is spot the pieces and figure out how to put them together. So as progress continues, the problem gets drastically easier. And then suddenly three groups solve the same problem, by assembling the same mix of ideas, one of which is recent.
LLMs can combine things that have never previously be combined in new ways and can thus successfully extrapolate outside the training distribution. Currently, they're superhuman at knowing about all the ideas that have been come up with as of their knowledge cutoff – that's a breadth of knowledge skill where they easily outperform humans – and clearly less good at figuring out how to assemble them, or especially at inventing a new missing idea to fill in gaps.
My question is, are those two skills both ones they are always going to be subhuman at, or are they just things they're currently bad at? Their capabilities are so spiky compared to humans, it's hard to be sure, but there are plenty of things where people said "LLMs are extremely bad at X", and they were right at the time, but a few years and model generations later LLMs caught up, and are no longer bad at X. So I'm not going to be astonished if both of these go the same way.
Now, LLMs are very, very good at standing on the shoulders of giants. So it's easy to mistake them for smarter than they really are. but current models still have plenty of things they're subhuman at, as well as quite a few things they're superhuman at. But they average out at somewhere in the rough vicinity of a grad student or an intern working for a few hours. Who are not generally the people who come up with new inventions,
Not sure why the go-to examples for out-of-distribution problems tend to be the extreme of an entirely new theory or invention. To make progress on this problem, we'd want to identify minimally-OOD problems and benchmark those, wouldn't we?
Melanie Mitchell and collaborators showed weaknesses in LLM on OOD tasks with simple perturbations in the alphabet for string-analogy tasks. This seems like the sort of example we should generally be thinking about and testing, because they're likely much more tractable, toy domains or simple ad hoc tasks that deviate from strong biases in the training distribution.
Demonstrating this with simpler less-challenging tasks should give us some idea whether this is an area that LLMs are poor-but-improving at and will sooner-or-later catch up, or genuinely bad at and always will be for some architectural reasons. Sounds like a good idea, but not something I know anything about (and a bit too capabilities related for my taste)
My fundamental rule-of-thumb on this sort of issue is that it's conjectured that SGD with suitable hyperparameters approximates Bayesian learning. If that's correct, then Bayesian learning is optimal, modulo issues like training dataset, choice of priors/inductive biases, etc. So a comparative difference with humans would then have to come down to things like the quality of approximation of Bayesian learning, the priors/inductive biases, the choice of pretraining dataset, curriculum learning effects, or architectural limitations that make certain things nigh impossible for the LLM (for example lack of continual learning, or a text only transformer doing work on video or audio data).
For my example of combining preexisting ideas in the right way to solve a well-known problem, most of the impressive human examples of that involve months of work, so the current LLMs lack of continual learning and task horizon in the hours range is going to make doing that nigh impossible at the moment. Humans generally work their ways out of distribution slowly, one small step at a time, by gradually expanding the distribution in an interesting-looking direction: that's what the Scientific method/Bayesian learning is. Doing that without continual learning is inherently limited. So I find Jeremy Howard's observation that LLMs are bad at this unsurprising: I think it basically reduces to two of the widely-known deficiencies that LLMs currently have (and the industry was already busily working on).
I'm not sure I would use terms like Lipschitz continuity, KL divergence, spurious oscillations, OOD divergence or something else that would highlight the point, but when I imagine myself in a coworker / tech lead / management role working with human software engineers before 2024 vs myself as a software engineer working with LLM-powered coding assistants in 2026, there is a very clear difference in the kinds of "outside" with regards of training distribution in human-human vs human-LLM interactions, the latter being really really fucking annoying tiring shit in every single interaction, while the former is "it depends" (a.k.a. "hiring a team that will be a good match together").
The agentic scaffolds of 2025+ are making it possible to work around some of the fundamental jaggedness of LLM base models which are still complete shit at "understanding" so we are collectively moving ever more problems into "within distribution" instead of "divergent extrapolation", sure, so I agree it's totally unpredictable if LLM-powered tools will be able to automate tasks enough to become the kind of dangerous agents for which it makes sense to reason about theoretic-rational instrumental goals even if LLMs alone might remain shit at goal-orientednes forever (or if we need different architecture) - but we should probably discuss the capabilities of those agentic entities, not individual benchmark-gaming components of such entities...
The vastness of the training distribution is certainly one feature of the AI situation. But another is an army of human developers of AI, eager to discover what isn't in the training distribution, and what the AIs can't currently do, so they can figure out how to give the AIs those new capabilities.
Is there any argument that LLMs, turned into recurrent networks via chain of thought, will still have inherent limitations when compared with humans?
I don't think "interpolate/extrapolate" is that useful of a framing, for prediction purposes. It has utility, but this piece tries to say too much with it.
It's an ML classic, sure. But given the dimensionality involved? For any "real" unseen task, some aspects of it will be in "interpolation" regime, and others will inevitably fall outside the hull of training data and into the "extrapolation" regime. "Outside of distribution" gets murky fast as dimensionality increases.
Thus, it's nigh impossible to truly disentangle poor LLM performance into "failure to interpolate" and "failure to extrapolate". It's easy to make the case, but hard to prove it. "LLMs are fundamentally worse at extrapolation than humans are" remains an untested assumption.
It can be outright false or outright true. Or true under the current scales and training methods and false at 2028 SOTA - a quantitive gap, the way 85 IQ humans are notably worse than average at extrapolation. The case for "outright true" is overstated.
One common practical example of a lasting LLM deficiency is spatial reasoning. Why do LLMs perform so poorly at spatial reasoning and "commonsense physics" tasks like that in SimpleBench?
Wrong architecture for the job - something like insufficient depth? Inability to take advantage of test time compute? Failure to extrapolate from text-only training data? Failure to interpolate from the sparse examples of spatial reasoning in the training data? Lack of spatial reasoning priors that humans get from evolved brain wiring? Insufficient scale to converge to a robust world physics model despite the other deficiencies?
We did interrogate the question, and we have some hints, but we don't have an exact answer. Multiple types of interventions improve spatial reasoning performance in practice, but none have attained human-level spatial reasoning in LLMs as of yet.
It doesn't seem to be as neat and simple of a story as "LLMs are inherently poor extrapolators" with what's known so far. And as long as SOTA performance keeps improving generation to generation, I'm not going to put a lot of weight on "the bottleneck is fundamental".
If you actually look at the number of bits of training data the human brain receives from birth to adulthood, a huge proportion of them are visual data. So I'm not surprised that we're comparatively good at 3D (and our nervous system very likely also has some good inductive priors for it). I suspect the answer for LLMs is mostly just multimodal models trained on a vast amount of video training data — expensive, though the cost is reducible somewhat by coming up with smarter ways to tokenize video.
That's a lot of words to just say LLMs currently are super-competent within their training distribution and not good outside of it. I haven't watched the whole podcast. Do we have good reason to think this particular deficit is unable to be remedied? That making inroads to issues like continual learning won't enable these sorts of systems to perform much better at ad hoc or out-of-distribution tasks?
Do we have good reason to think this particular deficit is unable to be remedied?
Calling it "this particular deficit" is an understatement. To give a bad comparison (but maybe good enough for an illustrative purpose): it's like calling airplanes' inability to go into space "a particular deficit", when the entire design of the vehicle is optimized for something other than going into space, and properly re-optimizing it for properly going into space would amount to making it into something very non-airplane-like.
The main reason the comparison is bad is that, in the limit of human imitation (and RLVR and "generalized current stuff"), you get a complete emulation of human cognition (from the input-output perspective, at least), and it then becomes possible to use it to create a "cleaner" design of cognition that supports relevant aspects of human cognition that are beyond the reach of LLMs. But the limit may be quite far or even not practically achievable, or at least less practical than taking a route whose first step is getting back to the drawing board.
(This is not the same as saying that LLMs cannot be helpful in finding this "cleaner" architecture before reaching this limit.)
Finally, this will sound a bit like a reductio ad absurdum, but it's relevant for talking about this clearly. What constitutes an "outside" of a training distribution depends on the larger distribution within which that training distribution is (considered as) being placed. Like in math, there is no "objective" complement of a set; a set's complement exists only with reference to a superset of that set. So "outside of a training distribution" can be anything between [just a slightly larger neighborhood of the distribution], in which the LLM starts surprisingly flailing (relative to what we would expect from a human with those in-distribution capabilities (?)), and the entirety of our world's (relevant) cognitive domains, the latter being AGI/ASI/A[something]I-complete.
The fact that the concept of "outside of the training distribution" can be so inflated makes me think that it's often used as a grab bag that hides a lot of complexity, and, in particular, all the complexity of human cognition minus what LLMs can do human-level-well or better.
A human attends board game night. They learn a new game they've never played before. Technically, this is out-of-distribution learning. This type of learning does not necessarily seem like augmenting a car for space travel (maybe it is). They are not having to learn all about games and dice and boards and pieces all from scratch. They are having to mostly map existing learned models into a slightly novel combination for a slightly new domain. I'm not saying that's a trivial thing to do, because it's a hard open problem that many many smart people have been trying to crack for decades.
But it does not seem as daunting as you are portraying it. Yes, out-of-distribution is a very large space. But there's an awful lot of that space that we're simply not interested in learning anyway, so that narrows it down quite a lot.
As another commenter here noted, we probably actually do hope it's a problem that won't be cracked anytime soon, though the current effort and resources being spent towards the problem are historically unprecedented. I very well could be underestimating the problem. I guess we'll just see.
It seems totally able to be remedied somehow, but it's been an open problem for a looong time. It definitely seems like it'll be one of the last things to fall from current vantage point. But maybe we just accumulate enough unreliable workarounds that it no longer is a severe limitation. I have ideas, hopefully they're bad ones because I'd rather this not get improved until we've figured out how to gain confidence in safety/alignment qualitatively faster than we can right now, enough that open ended RL at test time can be assumed to be asymptotically safe.
Jeremy Howard was recently[1] interviewed on the Machine Learning Street Talk podcast: YouTube link, interactive transcript, PDF transcript.
Jeremy co-invented LLMs in 2018, and taught the excellent fast.ai online course which I found very helpful back when I was learning ML, and he uses LLMs all the time, e.g. 90% of his new code is typed by an LLM (see below).
So I think his “bearish”[2] take on LLMs is an interesting datapoint, and I’m putting it out there for discussion.
Some relevant excerpts from the podcast, focusing on the bearish-on-LLM part, are copied below! (These are not 100% exact quotes, instead I cleaned them up for readability.)
…
…
…
…
The podcast was released March 3 2026. Not sure exactly when it was recorded, but it was definitely within the previous month, since they talk about a blog post from Feb. 5.
I mean, he’s “bearish” compared to the early-2026 lesswrong zeitgeist—which really isn’t saying much!