One very basic issue is how to choose the metrics in question, and how the metrics relate to the actual dynamics in question - a topic which itself is a more general interest of mine. The question of what metric continues to improve over time is a critical but difficult one - and is deeply related to whether we expect progress from AI in general.
We can see that performance on a given test is a bad metric, for (at least) three reasons. First, there is a built-in limit of perfection on the test, and previous tests of AI have been discarded once ML obviously has beaten them. The Turing test has been abandoned, perhaps because one obvious way to tell the difference between a current LLM system and a human is that a human won’t have near-universal expertise and knowledge. Similarly, almost all new LLMs outperform humans on knowledge-based tests like SQuAD. Second, many benchmarks have leaked into training datasets for many modern LLMs, so that many models can answer multiple choice questions from these datasets correctly by picking A,B, C, or D even when the text of the options are not provided. Similarly, many models perform substantially worse on test sets with similar-difficulty but distinct questions. If this wasn’t sufficient evidence, their success rate on questions which are nonsensical or impossible far exceeds chance. This brings us to the third and final point; large test sets often contain a proportion of questions which are unreasonable or garbage - as shown for MMLU, and CoQA.
But whatever it is that LLMs are improving at doing over time, they are clearly imperfect, and there is room to improve. Still, on one hand, obviously, some limits exist. Physics dictates the speed of light and the quantum and heat dissipation limits for transistors. On the other hand, to the extent that what is measured is not identical to what is desired, the misalignment between metrics and goals provides an opportunity for scaling the goal, rather than the metric.
And this mismatch is echoed in the dynamics with processor scaling when Moore’s law “broke” - the number of transistors by volume was scaling as predicted until the 2010s, but related metrics, such as the energy use per computation, or the cost per transistor, have continued. This reflects the way that the underlying goal of computer developers and buyers was never directly about increasing transistor density, and the achieved improvement was therefore instead on a metric which was not faced by a fundamental limit. (It’s interesting to realize that this happens with physical law as well; the speed of light is an absolute limit, but a mass accelerating towards the speed of light does not simply accelerate until it stops, the energy is instead transformed into additional relativistic mass.)
The commercial incentives of AI developers, and the substantive goals of academic AI researchers, are not to perform well on test evaluations, but to create AI systems that are increasingly generally useful. Perhaps scaling laws will start to fail, perhaps they will not. Or, like changing from processor density to other measurements, perhaps some new metric will be needed to show what dimension is continuing to increase exponentially.
But given what has been happening, it seems far more likely that developers will find ways of achieving the goals they have set themselves, rather than the suggestion that running out of training data or increasing costs of compute would form an immediate substantive barrier to increasing AI capabilities. We see pathways for this not being a problem already, where the advances aren’t from simply increasing along previously defined scaling laws - prompt design via chain-of-thought, or synthetic training data for domain-specific problem solving, or MoE replacing single scaled models.
Of course, these arguments for why AI progress will continue do not imply that the systems will be safe - but that is a different question, and what we want isn’t the same as what we predict. The predictive claims I keep seeing that AI isn’t on the path to human-level intelligence seem to rely on either an intelligence of the gaps, where we keep finding places where humans outperform, or on AI developers deciding that whatever skeptics mean by intelligence isn’t a useful goal, and then not attempting other ways to achieve similar results. But humans aren’t magic, so there’s nothing about intelligence that can’t be replicated, or improved. Of course, there may be some deeper fundamental limit to AI progress, or perhaps a limit of intelligence-in-general, but it seems shockingly convenient - unbelievably so - if that limit is somewhere below, or even close to, human intelligence.
In summary, the metrics for AI progress are limited, and it seems likely that the various metrics will hit some of those limits. That doesn’t mean AI capabilities will.
Thanks to Ariel Gil and Naham Shapiro for useful suggestions on an earlier draft of this post.
A number of arguments about AI capabilities have deep roots in different expectations about how scaling can work, and historical analogues are often raised. This is true on both sides of the debate, with AI Snake Oil noting the end of scaling for processors and aircraft speeds, an Open Philanthropy researcher with a draft report looking at trends and biological anchors, work for Metaculus on compute scaling trends, and Epoch AI tracking scaling trends and expected limits. This leads to questions in the other direction as well; AI Impacts looks at discontinuities in progress, considering the question of whether scaling might jump, rather than proceed at “only” exponential pace.
One very basic issue is how to choose the metrics in question, and how the metrics relate to the actual dynamics in question - a topic which itself is a more general interest of mine. The question of what metric continues to improve over time is a critical but difficult one - and is deeply related to whether we expect progress from AI in general.
We can see that performance on a given test is a bad metric, for (at least) three reasons. First, there is a built-in limit of perfection on the test, and previous tests of AI have been discarded once ML obviously has beaten them. The Turing test has been abandoned, perhaps because one obvious way to tell the difference between a current LLM system and a human is that a human won’t have near-universal expertise and knowledge. Similarly, almost all new LLMs outperform humans on knowledge-based tests like SQuAD. Second, many benchmarks have leaked into training datasets for many modern LLMs, so that many models can answer multiple choice questions from these datasets correctly by picking A,B, C, or D even when the text of the options are not provided. Similarly, many models perform substantially worse on test sets with similar-difficulty but distinct questions. If this wasn’t sufficient evidence, their success rate on questions which are nonsensical or impossible far exceeds chance. This brings us to the third and final point; large test sets often contain a proportion of questions which are unreasonable or garbage - as shown for MMLU, and CoQA.
But whatever it is that LLMs are improving at doing over time, they are clearly imperfect, and there is room to improve. Still, on one hand, obviously, some limits exist. Physics dictates the speed of light and the quantum and heat dissipation limits for transistors. On the other hand, to the extent that what is measured is not identical to what is desired, the misalignment between metrics and goals provides an opportunity for scaling the goal, rather than the metric.
And this mismatch is echoed in the dynamics with processor scaling when Moore’s law “broke” - the number of transistors by volume was scaling as predicted until the 2010s, but related metrics, such as the energy use per computation, or the cost per transistor, have continued. This reflects the way that the underlying goal of computer developers and buyers was never directly about increasing transistor density, and the achieved improvement was therefore instead on a metric which was not faced by a fundamental limit. (It’s interesting to realize that this happens with physical law as well; the speed of light is an absolute limit, but a mass accelerating towards the speed of light does not simply accelerate until it stops, the energy is instead transformed into additional relativistic mass.)
The commercial incentives of AI developers, and the substantive goals of academic AI researchers, are not to perform well on test evaluations, but to create AI systems that are increasingly generally useful. Perhaps scaling laws will start to fail, perhaps they will not. Or, like changing from processor density to other measurements, perhaps some new metric will be needed to show what dimension is continuing to increase exponentially.
But given what has been happening, it seems far more likely that developers will find ways of achieving the goals they have set themselves, rather than the suggestion that running out of training data or increasing costs of compute would form an immediate substantive barrier to increasing AI capabilities. We see pathways for this not being a problem already, where the advances aren’t from simply increasing along previously defined scaling laws - prompt design via chain-of-thought, or synthetic training data for domain-specific problem solving, or MoE replacing single scaled models.
Of course, these arguments for why AI progress will continue do not imply that the systems will be safe - but that is a different question, and what we want isn’t the same as what we predict. The predictive claims I keep seeing that AI isn’t on the path to human-level intelligence seem to rely on either an intelligence of the gaps, where we keep finding places where humans outperform, or on AI developers deciding that whatever skeptics mean by intelligence isn’t a useful goal, and then not attempting other ways to achieve similar results. But humans aren’t magic, so there’s nothing about intelligence that can’t be replicated, or improved. Of course, there may be some deeper fundamental limit to AI progress, or perhaps a limit of intelligence-in-general, but it seems shockingly convenient - unbelievably so - if that limit is somewhere below, or even close to, human intelligence.
In summary, the metrics for AI progress are limited, and it seems likely that the various metrics will hit some of those limits. That doesn’t mean AI capabilities will.
Thanks to Ariel Gil and Naham Shapiro for useful suggestions on an earlier draft of this post.