In a recent post, Cole Wyeth makes a bold claim:
. . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important.
They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3].
An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.
Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.
My question is: can anyone confirm the above rumor, or cite any other positive examples of LLMs generating insights which help with a scientific or mathematical project, with those insights not being available anywhere else (ie seemingly absent from the training data)?
Cole Wyeth predicts "no"; though LLMs are able to solve problems which they have not seen by standard methods, they are not capable of performing novel research. I (Abram Demski) find it plausible (but not certain) that the answer is "yes". This touches on AI timeline questions.
I find it plausible that LLMs can generate such insights, because I think the predictive ground layer of LLMs contains a significant "world-model" triangulated from diffuse information. This "world-model" can contain some insights not present in the training data. I think this paper has some evidence for such a conclusion:
In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs can articulate a definition of and compute inverses.
However, the setup in this paper is obviously artificial, setting up questions that humans already know the answers to, even if they aren't present in the data. The question is whether LLMs synthesize any new knowledge in this way.
There's a math paper by Ghrist, Gould and Lopez which was produced with a nontrivial amount of LLM assistance, as described in its Appendix A and by Ghrist in this thread (but see also this response).
The LLM contributions to the paper don't seem especially impressive. The presentation is less "we used this technology in a real project because it saved us time by doing our work for us," and more "we're enthusiastic and curious about this technology and its future potential, and we used it in a real project because we're enthusiasts who use it in whatever we do and/or because we wanted learn more about its current strengths and weaknesses."
And I imagine it doesn't "count" for your purposes.
But – assuming that this work doesn't count – I'd be interested to hear more about why it doesn't count, and how far away it is from the line, and what exactly the disqualifying features are.
Reading the appendix and Ghrist's thread, it doesn't sound like the main limitation of the LLMs here was an inability to think up new ideas (while being comparatively good at routine calculations using standard methods). If anything, the opposite is true: the main contributions credited to the LLMs were...
...while, on the other hand, the LLMs often wrote proofs that were just plain wrong, and the proof in (2) was manually selected from amongst a lot of dross by the human authors.
To be more explicit, I think that that the (human) process of "generating novel insights" in math often involves a lot of work that resembles brute-force or evolutionary search. E.g. you ask yourself something like "how could I generalize this?", think up 5 half-baked ideas that feel like generalizations, think about each one more carefully, end up rejecting 4 of them as nonsensical/trivial/wrong/etc., continue to pursue the 5th one, realize it's also unworkable but also notice that in the process of finding that out you ended up introducing some kind-of-cool object or trick that you hadn't seen before, try to generalize or formalize this "kind-of-cool" thing (forgetting the original problem entirely), etc. etc.
And I can imagine a fruitful human-LLM collaborative workflow in which the LLM specializes more in candidate generation – thinking up lots of different next steps that at least might be interesting and valuable, even if most of them will end up being nonsensical/trivial/wrong/etc. – while the human does more of the work of filtering out unpromising paths and "fully baking" promising but half-baked LLM-generated ideas. (Indeed, I think this is basically how Ghrist is using LLMs already.)
If this workflow eventually produces a "novel insight," I don't see why we should attribute that insight completely to the human and not at all to the LLM; it seems more accurate to say that it was co-created by the human and the LLM, with work that normally occurs within a single human mind now divvied up between two collaborating entities.
(And if we keep attributing these insights wholly to the humans up until the point at which the LLM becomes capable of doing all the stuff the human was doing, we'll experience this as a surprising step-change, whereas we might have been able to see it coming if we had acknowledged that the LLM was already doing a lot of what is called "having insights" when humans do it – just not doing the entirety of that process by itself, autonomously.)