Effective parameter size is defined as the size of the reference model that lets it match the target perfomance if the reference model of that size is trained for 1T tokens (Section 2.4). It's hard to match the performance of models trained for 18T tokens by training a much larger model for 1T tokens (their theoretical assumptions claim this remains possible). When it's 2026-2027 and models are trained for 250T tokens (possibly by repeating the data), it's going to take large reference models indeed to match their performance by training for only 1T tokens.
'That means, around three months, it is possible to achieve performance comparable to current state-of-the-art LLMs using a model with half the parameter size.'
If this trend continues, combined with (better/more extensible) inference scaling laws, it could make LM agents much more competitive on many AI R&D capabilities soon, at much longer horizon tasks.
E.g. - figure 11 from RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts:
Also related: Before smart AI, there will be many mediocre or specialized AIs.
Authors: Chaojun Xiao, Jie Cai, Weilin Zhao, Guoyang Zeng, Xu Han, Zhiyuan Liu, Maosong Sun.
Abstract (bolding mine):
Seems like bad news when it comes to proliferation, but good news with respect to weak-forward-passes and (especially latent) scheming.