Note that the MLPerf benchmark for GPT-3 is not on the full C4 dataset, it's on 0.4% of the C4 dataset.
See: https://twitter.com/abhi_venigalla/status/1673813863186452480?s=20
This is an intuition only based on speaking with researchers working on LLMs, but I think that OAI thinks that a model can simultaneously be good enough at next token prediction to assist with research but also be very very far away from being a powerful enough optimizer to realise that it is being optimized for a goal or that deception is an optimal strategy, since the latter two capabilities require much more optimization power. And that the default state of cutting edge LLMs for the next few years is to have GPT-3 levels of deception (essentially none) and graduate student levels of research assistant ability.
I don't think it's odd at all - even a terrible chess bot can outplay almost all humans. Because most humans haven't studied chess. MATH is a dataset of problems from high school competitions, which are well known to require a very limited set of math knowledge and be solveable by applying simple algorithms.
I know chain of thought prompting well - it's not a way to lift a fundamental constraint, it just is a more efficient targeting of the weights which represent what you want in the model.
It really isn't hard. No new paradigms are required. The proof of concepts are already implemented and work. It's more of a question of when one of the big companies decides it's worth poking with scale.
You don't provide any proof of this, just speculation, much of it based on massive oversimplifications (if I have time I'll write up a full rebuttal). For example, RWKV is more of a nice idea that is better for some benchmarks, worse for others, than some kind of new architecture that unlocks greater overall capabilities.
I mean, to me all this indicates is that our conception of "difficult reasoning problems" is wrong and incorrectly linked to our conception of "intelligence". Like, it shouldn't be surprising that the LM can solve problems in text which are notoriously based around applying a short step by step algorithm, when it has many examples in the training set.
To me, this says that "just slightly improving our AI architectures to be less dumb" is incredibly hard, because the models that we would have previously expected to be able to solve trivial arithmetic problems if they could do other "harder" problems are unable to do that.
Mostly Discord servers in my experience: EleutherAI is a big well known one but there are others with high concentrations of top ML researchers.
I happened to be reading this post today, as Science has just published a story on a fabrication scandal regarding an influential paper on amyloid-β: https://www.science.org/content/article/potential-fabrication-research-images-threatens-key-theory-alzheimers-disease
I was wondering if this scandal changes the picture you described at all?
This is very impressive work, well done! Improving compute/training literacy of the community is very valuable IMO, since I have often thought that not knowing much of this leads to poorer conclusions.