In a recent post, Cole Wyeth makes a bold claim:

. . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important. 

They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3].

I commented:

An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.

Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.

My question is: can anyone confirm the above rumor, or cite any other positive examples of LLMs generating insights which help with a scientific or mathematical project, with those insights not being available anywhere else (ie seemingly absent from the training data)?

Cole Wyeth predicts "no"; though LLMs are able to solve problems which they have not seen by standard methods, they are not capable of performing novel research. I (Abram Demski) find it plausible (but not certain) that the answer is "yes". This touches on AI timeline questions.

I find it plausible that LLMs can generate such insights, because I think the predictive ground layer of LLMs contains a significant "world-model" triangulated from diffuse information. This "world-model" can contain some insights not present in the training data. I think this paper has some evidence for such a conclusion: 

In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs  can articulate a definition of and compute inverses.

However, the setup in this paper is obviously artificial, setting up questions that humans already know the answers to, even if they aren't present in the data. The question is whether LLMs synthesize any new knowledge in this way.

New Answer
New Comment

4 Answers sorted by

Kaj_Sotala

Ω7191

Derya Unutmaz reported that o1-pro came up with a novel idea in the domain of immunotherapy:

Introduction to the Context:

I’m working on developing innovative cancer immunotherapy approaches to address key challenges in the field. Immunotherapy is an exceptionally powerful strategy for curing cancer because it harnesses the body’s immune system—our internal army—and empowers it to recognize and eliminate cancer cells. In this effort, we are focusing on engineering T cells, the immune system’s soldiers and generals, through synthetic biology.

However, significant challenges remain, especially in treating solid tumors like breast cancer. Within the tumor microenvironment, T cells often become exhausted due to the overwhelming number of cancer cells and the suppressive environment created by the tumor. This exhaustion severely limits the effectiveness of these therapies.

To tackle this issue, we employ a cutting-edge model system using 3D bioprinted breast cancer tissue integrated with engineered human T cells. These T cells are reprogrammed through advanced synthetic biology techniques to test and develop solutions for overcoming exhaustion.

Prompt to O1-Pro:

Building on work I’ve previously done and tested with o1-Preview and GPT-4o, I posed the following prompt:

“I’d like you to focus on 3D bioprinted solid tumors as a model to address the T cell exhaustion problem. Specifically, the model should incorporate stroma, as seen in breast cancer, to replicate the tumor microenvironment and explore potential solutions. These solutions could involve technologies like T cell reprogramming, synthetic biology circuits, cytokines, transcription factors related to exhaustion, or metabolic programming. Draw inspiration from other fields, such as Battle Royale games or the immune system’s ability to clear infected cells without triggering autoimmunity. Identify potential pitfalls in developing these therapies and propose alternative approaches. Think outside the box and outline iterative goals that could evolve into full-scale projects. Focus exclusively on in vitro human systems and models.”

Why Battle Royale Games?

You might wonder why I referenced Battle Royale games. That’s precisely the point—I wanted to push the model to think beyond conventional approaches and draw from completely different systems for inspiration. While o1-Preview and GPT-4o were able to generate some interesting ideas based on this concept, but they were mostly what I could also conceive though better most PhD students. In contrast, o1-Pro came up with far more creative and innovative solutions, that left me in awe!

Idea #9: A Remarkable Paradigm

Here, I’m sharing one specific idea, which I’ll call Idea #9 based on its iteration sequence. This idea was exceptional because it proposed an extraordinary paradigm inspired by Battle Royale games but more importantly within the context of deep temporal understanding of biological processes. This was the first time any model explicitly considered the time-dependent nature of biological events—an insight that reflects a remarkably advanced and nuanced understanding! 

“Adapt or Fail” Under Escalating Challenges:

Another remarkable aspect of idea #9 was that conceptually it drew from the idea of “adapt or fail” in escalating challenges, directly inspired by Battle Royale mechanics. This was the first time any model could think of it from this perspective.  It also emphasized the importance of temporal intervals in reversing or eliminating exhausted T cells. Indeed, this approach mirrors the necessity for T cells to adapt dynamically under pressure and survive progressively tougher challenges, something we would love to model in in vitro systems! One particularly further striking insight was the role of stimulation intervals in preventing exhaustion. Idea #9 suggested that overly short intervals between stimuli might be a key factor driving T cell exhaustion in current therapies. This observation really amazed me with its precision and relevance—because it pinpointed a subtle but critical aspect of T cell activations and development of exhaustion mechanisms. 

There's more behind the link. I have no relevant expertise that would allow me to evaluate how novel this actually was. But immunology is the author's specialty with his work having close to 30 000 citations on Google Scholar, so I'd assume him to know what he's talking about.

Of indirect relevance here is that Derya Unutmaz is an avid OpenAI fan who they trust enough to be an early tester. So while I'm not saying that he's deliberately dissembling, he is known to be overly enthusiastic about AI, and so any of his vibes-y impressions should be taken with a pound of salt.

Thanks!

Certainly he seems impressed with the models understanding, but did it actually solve a standing problem? Did its suggestions actually work?

This is (also) outside my area of expertise, so need to see the idea verified by reality - or at least by professional consensus outside the project.

Mathematics (and mathematical physics, theoretical computer science, etc.) would be more clear-cut examples because any original ideas from the model could be objectively verified (without actually running experiments). Not to move the goalposts - novel insights in biology or chemistry would also count, its just hard for me to check whether they are significant, or whether models propose hundreds of ideas and most of them fail (e.g. the bottleneck is experimental resources).  

Archimedes

134

I was literally just reading this before seeing your post:

https://www.techspot.com/news/106874-ai-accelerates-superbug-solution-completing-two-days-what.html

Arguably even more remarkable is the fact that the AI provided four additional hypotheses. According to Penadés, all of them made sense. The team had not even considered one of the solutions, and is now investigating it further.

So, the LLM generated five hypotheses, one of which the team also agrees with, but has not verified?

The article frames the extra hypotheses as making the results more impressive, but it seems to me that they make the results less impressive - if the LLM generates enough hypotheses, and you already know the answer, one of them is likely to sound like the answer. 

9Archimedes
As far as I understand from the article, the LLM generated five hypotheses that make sense. One of them is the one that the team has already verified but hadn’t yet published anywhere and another one the team hadn’t even thought of but consider worth investigating. Assuming the five are a representative sample rather than a small human-curated set of many more hypotheses, I think that’s pretty impressive. I don’t think this is true in general. Take any problem that is difficult to solve but easy to verify and you aren’t likely to have an LLM guess the answer.

I am skeptical of the claim that the research is unique and hasn't been published anywhere, and I'd also really like to know the details regarding what they prompted the model with.

The whole co-scientist thing looks really weird. Look at the graph there. Am I misreading it, or people rated it just barely better than raw o1 outputs? How is that consistent with it apparently pulling all of these amazing discoveries out of the air?

Edit: Found (well, Grok 3 found) an article with some more details regarding Penadés' work. Apparently they did publish a related finding, and did feed it into the AI co-scientist system.

Generalizing, my current take on it is that they – and probably all the other teams that are now reporting amazing results – fed the system a ton of clues regarding the answer, on top of implicitly pre-selecting the problems to be those where they already knew there's a satisfying solution to be found.

6Archimedes
Yeah, my general assumption in these situations is that the article is likely overstating things for a headline and reality is not so clear cut. Skepticism is definitely warranted.

Chris Rosin

60

I think DeepMind's FunSearch result, showing the existence of a Cap Set of size 512 for n=8, might qualify:

https://www.nature.com/articles/s41586-023-06924-6

They used an LLM which generated millions of programs as part of an evolutionary search.  A few of these programs were able to generate the size-512 Cap Set.  This isn't a hugely important problem, but there was preexisting interest in it.  I don't think it was particularly low-hanging fruit; there have been some followup papers using alternative scaffolding and LLMs, and the 512 result is not easy to reproduce.

I've also done some work on LLM generation of programs that have solved longstanding open instances of combinatorial design problems:
https://arxiv.org/pdf/2501.17725

As noted in the paper though, these do feel a bit more like low-hanging fruit, and these designs probably could have been found by people working to optimize several methods to see if any work.  Still, they were recognized open instances, and no one had previously solved before the LLM generated code that constructed the solution.

If this is an example of an LLM proving something, it's a very non-central example. It was finetuned specifically for mathematics and then used essentially as a program synthesis engine in a larger system that proved the result.

DeepMind can't just keep running this system and get more theorems out - once the engineers moved on to other projects I haven't heard anything building on the results.  

Legionnaire

30

It's hard to see what a novel insight is exactly. Any example can be argued against. Can you give an example of one? Or of one you've personally had?

Various LLMs can spot issues in code bases that are not public. Do all of these count?

Obviously it’s not a hard line, but your example doesn’t count, and proving any open conjecture in mathematics which was not constructed for the purpose does count. I think the quote from my post gives some other central examples. The standard is conceptual knowledge production. 

9nostalgebraist
There's a math paper by Ghrist, Gould and Lopez which was produced with a nontrivial amount of LLM assistance, as described in its Appendix A and by Ghrist in this thread (but see also this response). The LLM contributions to the paper don't seem especially impressive. The presentation is less "we used this technology in a real project because it saved us time by doing our work for us," and more "we're enthusiastic and curious about this technology and its future potential, and we used it in a real project because we're enthusiasts who use it in whatever we do and/or because we wanted learn more about its current strengths and weaknesses." And I imagine it doesn't "count" for your purposes. But – assuming that this work doesn't count – I'd be interested to hear more about why it doesn't count, and how far away it is from the line, and what exactly the disqualifying features are. Reading the appendix and Ghrist's thread, it doesn't sound like the main limitation of the LLMs here was an inability to think up new ideas (while being comparatively good at routine calculations using standard methods).  If anything, the opposite is true: the main contributions credited to the LLMs were... 1. Coming up with an interesting conjecture 2. Finding a "clearer and more elegant" proof of the conjecture than the one the human authors had devised themselves (and doing so from scratch, without having seen the human-written proof) ...while, on the other hand, the LLMs often wrote proofs that were just plain wrong, and the proof in (2) was manually selected from amongst a lot of dross by the human authors. ---------------------------------------- To be more explicit, I think that that the (human) process of "generating novel insights" in math often involves a lot of work that resembles brute-force or evolutionary search. E.g. you ask yourself something like "how could I generalize this?", think up 5 half-baked ideas that feel like generalizations, think about each one more ca

If this kind of approach to mathematics research becomes mainstream, out-competing humans working alone, that would be pretty convincing. So there is nothing that disqualifies this example - it does update me slightly.

However, this example on its own seems unconvincing for a couple of reasons:

  • it seems that the results were in fact proven by humans first, calling into question the claim that the proof insight belonged to the LLM (even though the authors try to frame it that way).
  • from the reply on X it seems that the results of the paper may not have been novel? In that case, it's hard to view this as evidence for LLMs accelerating mathematical research.

I have personally attempted to interactively use LLMs in my research process, though NOT with anything like this degree of persistence. My impression is that it becomes very easy to feel that the LLM is "almost useful" but after endless attempts it never actually becomes useful for mathematical research (it can be useful for other things like rapidly prototyping or debugging code). My suspicion is that this feeling of "almost usefulness" is an illusion; here's a related comment from my shortform: https://www.lesswrong.com/posts/RnKmRu... (read more)

22 comments, sorted by Click to highlight new comments since:

The question is IMO not "has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes), but "is there any way to get an LLM to autonomously generate and recognize genuine scientific insights at least at the same rate as human scientists?". A stopped clock is right twice a day, a random-word generator can eventually produce an insight, and talking to a rubber duck can let you work through a problem. That doesn't mean the clock is useful for telling the time or that the RWG has the property of being insightful.

And my current impression is that no, there's no way to do that. If there were, we would've probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.

This aligns with my experience. Yes, LLMs have sometimes directly outputted some insights useful for my research in agent foundations. But it's very rare, and only happens when I've already done 90% of the work setting up the problem. Mostly they're useful as rubber ducks or primers on existing knowledge; not idea-generators.

[-]Cole WyethΩ210-2

Yeah, I agree with this. If you feed an LLM enough hints about the solution you believe is right, and it generates ten solutions, one of them will sound to you like the right solution.

For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as 

"has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes),

& I created the question to see if we could substantiate the "yes" here with evidence. 

It makes somewhat more sense to me for your timeline crux to be "can we do this reliably" as opposed to "has this literally ever happened" -- but the claim in your post was quite explicit about the "this has literally never happened" version. I took your position to be that this-literally-ever-happening would be significant evidence towards it happening more reliably soon, on your model of what's going on with LLMs, since (I took it) your current model strongly predicts that it has literally never happened.

This strong position even makes some sense to me; it isn't totally obvious whether it has literally ever happened. The chemistry story I referenced seemed surprising to me when I heard about it, even considering selection effects on what stories would get passed around.

There is a specific type of thinking, which I tried to gesture at in my original post, which I think LLMs seem to be literally incapable of. It’s possible to unpack the phrase “scientific insight” in more than one way, and some interpretations fall on either side of the line. 

Yeah, that makes sense.

Current LLMs are capable of solving novel problems when the user does most the work: when the user lays the groundwork and poses the right question for the LLM to answer.

So, if we can get LLMs to lay the groundwork and pose the right questions then we'll have autonomous scientists in whatever fields LLMs are OK at problem solving.

This seems like something LLMs will learn to do as inference-time compute is scaled up. Reasoners benefit from coming up with sub-problems whose solutions can be built atop of to solve the problem posed by the user. 

LLMs will learn that in order to solve difficult questions, they must pose and solve novel sub-questions. 

So, once given an interesting research problem, the LLM will hum away for days doing good, often-novel work. 

I think the argument you’re making is that since LLMs can make eps > 0 progress, they can repeat it N times to make unbounded progress. But this is not the structure of conceptual insight as a general rule. Concretely, it fails for the architectural reasons I explained in the original post. 

Here's an attempt at a clearer explanation of my argument:

I think the ability to autonomously find novel problems to solve will emerge as reasoning models scale up. It will emerge because it is instrumental to solving difficult problems. 

Imagine an RL environment in which the LLM being trained is tasked with solving somewhat difficult open math problems (solutions verified using autonomous proof verification). It fails and fails at most of them until it learns to focus on making marginal progress: tackling simpler cases, working on tangentially-related problems, etc. These instrumental solutions are themselves often novel, meaning that the LLM will have become able to pose novel, interesting, somewhat important problems autonomously. And this will scale to something like a fully autonomous, very much superhuman researcher. 

This is how it often works in humans. We work on a difficult problem, find novel results on the way there. The LLM would likely be uncertain of whether these results are truly novel but this is how it works with humans too. The system can do some DeepResearch / check with relevant experts if it's important.

Of course, I'm working from my parent-comment's position that LLMs are in fact already capable of solving novel problems, just not posing them and doing the requisite groundwork.

I think the ability to autonomously find novel problems to solve will emerge as reasoning models scale up. It will emerge because it is instrumental to solving difficult problems.

This of course is not a sufficient reason. (Demonstration: telepathy will emerge [as evolution improves organisms] because it is instrumental to navigating social situations.) It being instrumental means that there is an incentive -- or to be more precise, a downward slope in the loss function toward areas of model space with that property -- which is one required piece, but it also must be feasible. E.g., if the parameter space doesn't have any elements that are good at this ability, then it doesn't matter whether there's a downward slope.

Fwiw I agree with this:

Current LLMs are capable of solving novel problems when the user does most the work: when the user lays the groundwork and poses the right question for the LLM to answer.

... though like you I think posing the right question is the hard part, so imo this is not very informative.

I don't agree with the underlying assumption then - I don't think LLMs are capable of solving difficult novel problems, unless you include a nearly-complete solution as part of the groundwork.

If there were, we would've probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.

I have been seeing a bit of this, mostly uses of o1-pro and OpenAI Deep Research in chem/bio/medicine, and mostly via Twitter hype so far. But it might be the start of something.

[-]Cole WyethΩ511-1

It seems suspicious to me that this hype is coming from fields were it seems hard to verify (is the LLM actually coming up with original ideas or is it just fusing standard procedures? Are the ideas the bottleneck or is the experimental time the bottleneck? Are the ideas actually working or do they just sound impressive?). And of course this is Twitter. 

Why not progress on hard (or even easy but open) math problems? Are LLMs afraid of proof verifiers? On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal. 

On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal. 

Is there? It's one thing to verify whether a proof is correct; whether an expression (posed by a human!) is tautologous to a different expression (also posed by a human!). But what's the ground-truth signal for "the framework of Bayesian probability/category theory is genuinely practically useful"?

This is the reason I'm bearish on the reasoning models even for math. The realistic benefits of them seem to be:

  1. Much faster feedback loops on mathematical conjectures.
  2. Solving long-standing mathematical challenges such as Riemann's or P vs. NP.
  3. Mathematicians might be able to find hints of whole new math paradigms in the proofs for the long-standing challenges the models generate.

Of those:

  • (1) still requires mathematicians to figure out which conjectures are useful. It compresses hours, days, weeks, or months (depending on how well it scales) of a very specific and niche type of work into minutes, which is cool, but not Singularity-tier.
  • (2) is very speculative. It's basically "compresses decades of work into minutes", while the current crop of reasoning models can barely solve problems that ought to be pretty "shallow" from their perspective. Maybe Altman is right, the paradigm is in its GPT-2 stage, and we're all about to be blown away by what they're capable of. Or maybe it doesn't scale past the frontier of human mathematical knowledge very well at all, and the parallels with AlphaZero are overstated. We'll see.
  • (3) is dependent on (2) working out.

(The reasoning-model hype is so confusing for me. Superficially there's a ton of potential, but I don't think there's been any real indication they're up to the real challenges still ahead.)

That's a reasonable suspicion but as a counterpoint there might be more low-hanging fruit in biomedicine than math, precisely because it's harder to test ideas in the former. Without the need for expensive experiments, math has already been driven much deeper than other fields, and therefore requires a deeper understanding to have any hope of making novel progress.

edit: Also, if I recall correctly, the average IQ of mathematicians is higher than biologists, which is consistent with it being harder to make progress in math.

On the other hand, frontier math (pun intended) is much worse financed than biomedicine because most of the PhD-level math has barely any practical applications worth spending many manhours of high-IQ mathematicians (which often makes them switch career, you know). So, I would argue, if productivity of math postdocs when armed with future LLMs raises by, let's say, an order of magnitude, they will be able to attack more laborious problems.

Not that I expect it to make much difference to the general populace or even the scientific community at large though

I think this is one of the most important questions we currently have in relation to time to AGI, and one of the most important "benchmarks" that tell us where we are in terms of timelines.

I agree; I will shift to an end-game strategy as soon as LLMs demonstrate the ability to automate research.

Do you have an endgame strategy ready?

Instead of "have LLMs generated novel insights", how about "have LLMs demonstrated the ability to identify which views about a non-formal topic make more or less sense?" This question seems easier to operationalize and I suspect points at a highly related ability.

[-]silentbobΩ06-2

Random thought: maybe (at least pre-reasoning-models) LLMs are RLHF'd to be "competent" in a way that makes them less curious & excitable, which greatly reduces their chance of coming up with (and recognizing) any real breakthroughs. I would expect though that for reasoning models such limitations will necessarily disappear and they'll be much more likely to produce novel insights. Still, scaffolding and lack of context and agency can be a serious bottleneck.

I think it’s the latter.

[-]ZY20

You may like this paper, and I like the series generally ( https://physics.allen-zhu.com ):

https://arxiv.org/pdf/2407.20311

This paper looked at generalization abilities for Math, by making up some datasets in a way that the training data have definitely not seen it.

Curated and popular this week