As a partial point of comparison, in Wason's testing only about 20% of humans solved the problem tested, but Wason's experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, 'strictly increasing order').
I encourage you to get some humans to take the same test you gave the models, so that we have a better human baseline. It matters a lot for what the takeaways should be, if LLMs are already comparable or better to humans at this task vs. still significantly worse.
Agreed that it would be good to get a human baseline! It may need to be out of scope for now (I'm running this as an AI Safety Camp project with limited resources) but I'll aim for it.
I suspect that it's a tooling and scaffolding issue and that e.g. claude-3-5-sonnet-20241022
can get at least 70% on the full set of 60 with decent prompting and tooling.
By "tooling and scaffolding" I mean something along the lines of
I'll probably play around with it a bit tomorrow.
I'll probably play around with it a bit tomorrow.
Terrific, I'm excited to hear about your results! I definitely wouldn't be surprised if my results could be improved on significantly, although I'll be somewhat surprised if you get as high as 70% from Sonnet (I'd put maybe 30% credence on getting it to average that high in a day or two of trying).
Aa I said elsewhere, https://www.lesswrong.com/posts/LfQCzph7rc2vxpweS/introducing-the-weirdml-benchmark?commentId=q86ogStKyge9Jznpv
This is a capabilities game. It is neither alignment or safety. To the degree it's forecasting, it helps cause the thing it forecasts. This has been the standard pattern in capabilities research for a long time: someone makes a benchmark (say, imagenet 1.3m 1000class), and this produces a leaderboard that allows people to show how good their learning algorithm is at novel datasets. In some cases this even produced models directly that were generally useful, but it traditionally was used to show how well an algorithm would work in a new context from scratch. Building benchmarks like this gives teams a new way to brag - they may have a better source of training data (eg, google always had a better source of training data than imagenet), but it allows them to brag that they scored well on the benchmark, which among other things helps them get funding.
Perhaps it also helps convince people to be concerned. That might trade off against this. Perhaps it sucks in some way as a bragging rights challenge. That would trade off against this
Hopefully it sucks as a bragging rights challenge.
The trouble is that (unless I'm misreading you?) that's a fully general argument against measuring what models can and can't do. If we're going to continue to build stronger AI (and I'm not advocating that we should), it's very hard for me to see a world where we manage to keep it safe without a solid understanding of its capabilities.
if it's a fully general argument, that's a problem I don't know how to solve at the moment. I suspect it's not, but that the space of unblocked ways to test models is small. I'm bouncing ideas about this around out loud with some folks the past day, possibly someone will show up with an idea for how to constrain on what benchmarks are worth making soonish. but the direction I see as maybe promising is, what makes a benchmark reliably suck as a bragging rights challenge?
I see your view, I think, but I just disagree. I think that if our future goes well, it will be because we found ways to align AI well enough, and/or because we coordinated politically to slow or stop AI advancement long enough to accomplish the alignment part, not because researchers avoided measured AI's capabilities.
I think that if our future goes well, it will be because we found ways to align AI well enough, and/or because we coordinated politically to slow or stop AI advancement long enough to accomplish the alignment part
Agree
not because researchers avoided measured AI's capabilities.
But differential technological development matters, as does making it clear that when you make a capability game like this, you are probably just contributing to capabilities, not doing alignment. I won't say you should never do that, but I'll say that's what's being done. I personally am all in on "we just need to solve alignment as fast as possible". But I've been a capabilities nerd for a while before I was an alignment nerd, and when I see someone doing something that I feel like is accidentally a potentially significant little capabilities contribution, it seems worth pointing out that that's what it is.
Agreed.
as does making it clear that when you make a capability game like this, you are probably just contributing to capabilities
I would distinguish between measuring capabilities and improving capabilities. I agree that the former can motivate the latter, but they still seem importantly different. I continue to think that the alternative of not measuring capabilities (or only measuring some small subset that couldn't be used as training benchmarks) just means we're left in the dark about what these models can do, which seems pretty straightforwardly bad from a safety perspective.
not doing alignment
I agree that it's definitely not doing alignment, and that working on alignment is the most important goal; I intend to shift toward directly working on alignment as I feel clearer about what work is a good bet (my current leading candidate, which I intend to focus on after this experiment: learning to better understand and shape LLMs' self-models).
I very much appreciate the thoughtful critique, regardless of whether or not I'm convinced by it.
Exciting questions! I do think this gets at important things.
Some predictions:
I think the best frontier models (e.g. o3, Claude3.6) will do ok on some, but fail on higher complexity.
I suspect fine-tuning would uncover abilities, even in smaller models. I bet Deepseek-V3 could be fine-tuned to be capable of this.
I suspect scaffolding, without fine-tuning (easier to test for Claude3.6 and o1), will help. Like, giving a prompt back to them that automatically gives leading questions and general hints (non problem specific) for suggesting good reasoning and epistemics. Just like, general stuff about how to do good hypothesis generation and testing.
Thanks, and props for making predictions! Our full prompt (appendix E) does push pretty hard on general hints and advice, but we don't repeat it at every step.
Eg:
* Brainstorm multiple hypotheses, as different as possible. Think out of the box! Include six maximally simple hypothesis compatible with the data in each "possible_hypotheses" section (defined below).
* Do tests that falsify/distinguish between hypotheses. Avoid confirmation bias!
* Before settling on a final hypothesis, try removing constraints to see if they're necessary.
Yeah, by 'scaffolding' I'm imagining something significantly more than this. Like, feedback that is conditional on the responses given, at minimum.
Something like:
"Looks like you generated only one hypothesis. Before you continue, try generating multiple hypotheses that could explain this."
"Looks like you just found evidence that disproves hypothesis 1. Can you now disprove hypothesis 2?"
"Looks like you've disproven all the hypotheses you've come up with so far. Time to brainstorm more!"
Perhaps include some text in the first prompt like:
T. C. Chamberlin's "Method of Multiple Working Hypotheses": An encapsulation for modern students
L. Bruce Railsback
Department of Geology, University of Georgia, Athens, Georgia 30602-2501 USA
Introduction
Scientific study designed to increase our knowledge of natural phenomena can follow at least three different intellectual methods. These can be called the method of the ruling theory, the method of the working hypothesis, and the method of multiple working hypotheses. The first two are the most popular but they can, and often do, lead to ineffective research that overlooks relevant data. Instead, the method of multiple working hypotheses offers a more effective way of organizing one's research.
Ruling Theories and Working Hypotheses
Our desire to reach an interpretation or explanation commonly leads us to a tentative interpretation that is based on relatively hasty examination of a single example or case. Our tentative explanation, as such, is not a threat to objectivity, but if we then begin to trust it without further testing, we can be blinded to other possibilities that we ignored at first glance. Our premature explanation can become a tentative theory and then a ruling theory, and our research becomes focused on proving that ruling theory. The result is a blindness to evidence that disproves the ruling theory or supports an alternate explanation. Only if the original tentative hypothesis was by chance correct does our research lead to any meaningful contribution to knowledge.
Seemingly less insidious is the working hypothesis. The working hypothesis, we are told, is a hypothesis to be tested, not in order to prove the hypothesis, but as a stimulus for study and fact-finding. Nonetheless, the single working hypothesis can imperceptibly degenerate into a ruling theory, and our desire to prove the working hypothesis, despite evidence to the contrary, can become as strong as the desire to prove the ruling theory.
Multiple Working Hypotheses
The method of multiple working hypotheses involves the development, prior to our research, of several hypotheses that might explain the phenomenon we want to study. Many of these hypotheses will be contradictory, so that some, if not all, will prove to be false. However, the development of multiple hypotheses prior to the research lets us avoid the trap of the ruling hypothesis and thus makes it more likely that our research will lead to meaningful results. We open-mindedly envision all the possible explanations of the phenomenon to be studied, including the possibility that none of explanations are correct ("none of the above") and the possibility that some new explanation may emerge.
The method of multiple working hypotheses has several other beneficial effects on one's research. Careful study often shows that a phenomenon is the result of several causes, not just one, and the method of multiple working hypotheses obviously makes it more likely that we will see the interaction of the several causes. The method also promotes much greater thoroughness than research directed toward one hypothesis, leading to lines of inquiry that we might otherwise overlook, and thus to evidence and insights that single-minded research might never have encountered. Thirdly, the method makes us much more likely to see the imperfections in our knowledge and thus to avoid the pitfall of accepting weak or flawed evidence for one hypothesis when another provides a more elegant solution.
Possible Drawbacks of the Method
The method of multiple working hypotheses does have drawbacks. One is that it is impossible to express multiple hypotheses simultaneously, and thus there is a natural tendency to let one take primacy. Keeping a written, not mental, list of our multiple hypotheses is often a necessary solution to that problem.
Another problem is that an open mind may develop hypotheses that are so difficult to test that evaluating them is nearly impossible. An example might be where three of our hypotheses are testable by conventional field work, but a fourth requires drilling of a deep borehole beyond our economic resources. This fourth hypothesis need not paralyze our research, but it should provide a reminder that none of the first three need be true.
A third possible problem is that of vacillation or indecision as we balance the evidence for various hypotheses. Such vacillation may be bad for the researcher, but such vacillation is preferable to the premature rush to a false conclusion.
An Example
The field discovery of a breccia provides an excellent example of the application of the method of multiple working hypotheses. A breccia may form in many ways: by deposition as talus, by collapse after dissolution of underlying evaporites or other soluble rocks, by faulting, by bolide impact, or by other means. Each of the possibilities can be supported by various field evidence, for which we could look if we were evaluating all these hypotheses. However, if we chose just one hypothesis, we might ignore other evidence more clearly supportive of a different hypothesis. For example, if we hypothesized that our breccia was the result of cataclasis during faulting, we might find that the breccia occurred along a fault. We would then accept our single hypothesis and quit looking for additional information. However, if we were using multiple working hypotheses and looked for evidence supporting or disproving all our hypotheses, we might also notice that the breccia was localized in a circular pattern along just one part of the fault. Further examination might show that it was accompanied by shatter cones. Armed with this additional information, we would be more inclined to an interpretation involving an impact that was by chance coincident with a fault. By looking for evidence supportive of a variety of hypotheses, we would have avoided an incorrect interpretation based on coincidence.
Summary
In using the method of multiple working hypotheses, we try to open-mindedly envision and list all the possible hypotheses that could account for the phenomenon to be studied. This induces greater care in ascertaining the facts and greater discrimination and caution in drawing conclusions. Although our human tendencies lead us toward the method of the ruling theory, the method of multiple working hypotheses offers the best chance of open-minded research that avoids false conclusions.
Got it.
Something I'm wrestling with on this project is the balance between testing the models' ability to do science (which I want to do) and finding ways to make them better at doing science (which I basically don't want to do and especially don't want to publish). Doing a lot of iteration on improving scaffolding feels to me like it starts to tip over into the latter (whereas doing bog-standard few-shotting or fine-tuning doesn't).
To be clear, I don't have strong reason to expect that we'd find approaches that are significant boosts to what's already out there. But it could happen, and I'm trying to be cautious about that, in the interest of not further accelerating capabilities improvements.
I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.
Examples:
Hard benchmarks of meaningful tasks serve as excellent metrics to measure progress, which is great for capabilities research. Of course, they are also very useful for making decisions that need to be informed by an accurate tracking or forecasting of capabilities.
Whether making hard meaningful benchmarks such as frontier math and arc agi and LLM science are net negative or positive is unclear to me (a load-bearing question is whether the big AGI labs have internal benchmarks as good as these already that they can use instead). I do think however that you'd have to be extraordinarily excellent at designing scaffolding (and finetuning and the like) and even then spend way too much effort at it to do significant harm from the scaffolding itself rather than the benchmark that the scaffolding was designed for.
I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.
You may be right. That said, I'm pretty skeptical of fully general arguments against testing what LLMs are capable of; without understanding what their capabilities are we can't know what safety measures are needed or whether those measures are succeeding.
For what it's worth, though, I have no particular plans to publish an official benchmark or eval, although if a member of my team is excited to work on that I'll support it.
I once implemented something a bit similar.
The idea there is simple: there's a hidden int -> int function and an LLM must guess it. It can execute the function, i.e. provide input and observe the output. To guess the function in a reasonable numer of steps it needs to generate and test hypotheses that narrow down the range of possible functions.
Definitely similar, and nice design! I hadn't seen that before, unfortunately. How did the models do on it?
Also have you seen 'Connecting the Dots'? That tests a few things, but one of them is whether, after fine-tuning on (x, f(x)) pairs, the model can articulate what f is, and compute the inverse. Really interesting paper.
Definitely similar, and nice design! I hadn't seen that before, unfortunately. How did the models do on it?
Unfortunately I don't remember much details :/
My vague memories:
But note that I think my eval is significantly easier than yours.
Also have you seen 'Connecting the Dots'? That tests a few things, but one of them is whether, after fine-tuning on (x, f(x)) pairs, the model can articulate what f is, and compute the inverse. Really interesting paper.
I even wrote (a small part of) it : ) I'm glad you find it interesting!
Always a bit embarrassing when you inform someone about their own paper ;)
My vague memories:
Thanks, a lot of that matches patterns I've seen as well. If you do anything further along these lines, I'd love to know about it!
If you do anything further along these lines, I'd love to know about it!
Unfortunately not, sorry (though I do think this is very interesting!). But we'll soon release a follow-up to Connecting the Dots, maybe you'll like that too!
Oh, do you mean the new 'Tell Me About Yourself'? I didn't realize you were (lead!) author on that, I'd provided feedback on an earlier version to James. Congrats, really terrific work!
For anyone else who sees this comment: highly recommended!
We study behavioral self-awareness — an LLM’s ability to articulate its behaviors
without requiring in-context examples. We finetune LLMs on datasets that exhibit
particular behaviors, such as (a) making high-risk economic decisions, and (b) out-
putting insecure code. Despite the datasets containing no explicit descriptions of
the associated behavior, the finetuned LLMs can explicitly describe it. For exam-
ple, a model trained to output insecure code says, “The code I write is insecure.”
Indeed, models show behavioral self-awareness for a range of behaviors and for
diverse evaluations. Note that while we finetune models to exhibit behaviors like
writing insecure code, we do not finetune them to articulate their own behaviors
— models do this without any special training or examples.
Behavioral self-awareness is relevant for AI safety, as models could use it to proac-
tively disclose problematic behaviors. In particular, we study backdoor policies,
where models exhibit unexpected behaviors only under certain trigger conditions.
We find that models can sometimes identify whether or not they have a backdoor,
even without its trigger being present. However, models are not able to directly
output their trigger by default.
Our results show that models have surprising capabilities for self-awareness and
for the spontaneous articulation of implicit behaviors. Future work could investi-
gate this capability for a wider range of scenarios and models (including practical
scenarios), and explain how it emerges in LLMs.
On the spectrum from stochastic parrot to general reasoner I'm at 70%. We're definitely closer to a general reasoner than a parrot.
I don't have a clear answer as to what I expect the outcome to be. I was a physics major and I wish there were less discrete jumps in physics. Special relativity and general relativity seem like giant jumps in terms of their difficulty to derive and there aren't any intermediate theories. When this project comes out we'll probably be saying something the AI is 50% between being being able to derive this law and a conceptually harder one.
With respect to something analogous to Newtonian Mechanics I think that it does heavily depend on what kind of information can be observed. If a model can directly observe the equivalents of forces and acceleration than I believe most current models could derive it. If a model can only observe the equivalent of distances between objects and the equivalents of corresponding times and has to derive a second-order relationship from that I suspect that only o3 could do that. In six months, I believe that all frontier models will be able to do that.
Given that Terrence Tao described o1 as a mediocre graduate student it probably won't be long until frontier models are actually contributing to research and that will be the most valuable feedback. I say all this with a lot of uncertainty and if I'm wrong this project will prove that. Likewise there's going to be a long period of time where some people will insist that AI can do legitimate automated R&D and others who insist that it can't. At that point this will be a useful test to argue one way or another.
It would be interesting to vary the amount of information an AI is given until can derive the whole set of equations. For example, see if it can solve for the Maxwell equation given the other 3 and the ability to perform experiments or can it solve for the dynamic version of the equations given only the static ones and the ability to perform experiments.
I hereby grant you 30 Bayes points for registering your beliefs and predictions!
When this project comes out we'll probably be saying something the AI is 50% between being being able to derive this law and a conceptually harder one.
Can you clarify that a bit? When what project comes out? If you mean mine, I'm confused about why that would say something about the ability to derive special & general relativity.
If a model can directly observe the equivalents of forces and acceleration...
Agreed that each added step of mathematical complexity (in this case from linear to quadratic) will make it harder. I'm less convinced that acceleration being a second-order effect would make an additional difference, since that seems more like a conceptual framework we impose than like a direct property of the data. I'm uncertain about that, though, just speculating.
Thanks!
"Can you clarify that a bit? When what project comes out? If you mean mine, I'm confused about why that would say something about the ability to derive special & general relativity."
I mean your project. I'm hoping it can allow us to be more precise by ranking models abilities to characterize between well-known systems. Like a model can characterize Special Relativity given what Einstein knew at the time but not General Relativity. If you were to walk along some hypothetical road from SR to GR we might ballpark a model is 30% of the way there. Maybe this project could generate domains that are roughly some x% between SR and GR and validate our estimates.
"Agreed that each added step of mathematical complexity (in this case from linear to quadratic) will make it harder. I'm less convinced that acceleration being a second-order effect would make an additional difference, since that seems more like a conceptual framework we impose than like a direct property of the data."
Right. The important point is that the equation it needs to find is quadratic instead of linear in the data.
Got it, thanks. We're planning to try to avoid testing systems that are isomorphic to real-world examples, in the interest of making a crisp distinction between reasoning and knowledge. That said, if we come up with a principled way to characterize system complexity (especially the complexity of the underlying mathematical laws), and if (big if!) that turns out to match what LLMs find harder, then we could certainly compare results to the complexity of real-world laws. I hadn't considered that, thanks for the idea!
Summary
Can LLMs science? The answer to this question can tell us important things about timelines to AGI. In this small pilot experiment, we test frontier LLMs on their ability to perform a minimal version of scientific research, where they must discover a hidden rule about lists of integers by iteratively generating and testing hypotheses. Results are ambiguous: they're mostly pretty bad at it but top systems show apparent signs of life. We're working on a larger, more rigorous experiment, and we really want your input.
Structure
In this post we:
This is followed by appendices with more detail (eg related work, limitations) but we've kept the main body as short and direct as possible.
Introduction
Over the past six months we have been trying to better understand the degree to which LLMs are capable of general reasoning[1] (in LLM Generality is a Timeline Crux, and LLMs Look Increasingly Like General Reasoners). In short: researchers, including AI safety researchers, have widely differing positions on whether (and to what degree) LLMs are capable of the same sort of accurate and broadly applicable reasoning ability that humans are. At one extreme is the stochastic parrot hypothesis that "no actual language understanding is taking place in [LLMs]" and any apparent signs of reasoning are actually a kind of cheating. At the other extreme is the view that LLMs are already fully capable of all the same types of reasoning as humans are and just need a bit more scaling.
An important negative consequence of this disagreement is that the AI safety community's limited resources are spread across a range of timelines, in a way that impacts the scaling labs much less. The best use of those resources is significantly different if LLMs will scale straight to AGI[2], relative to the worlds where substantial further breakthroughs are needed first[3].
In order to try to improve our understanding of this issue and help resolve the disagreement, we created the Numberwang[4] experiment.
The experiment
This is a pilot experiment for a larger research project which started this month and runs through April 2025 (described in a later section). The key question (in both the larger project and the pilot) is whether LLMs can demonstrate general reasoning by autonomously performing a simplified form of scientific investigation. This approach is useful both for addressing the question of general reasoning, and for learning more about the critical object-level question of whether LLMs are likely to be able to accelerate AI development by performing scientific research independently.
The experiment is similar to one performed on humans by Peter Wason in 2019 in his paper, 'On the failure to eliminate hypotheses in a conceptual task' (an experiment familiar to many from Veritasium or Eliezer Yudkowsky). We chose the experiment as a highly simplified version of the scientific process: iteratively proposing hypotheses, testing them, and rejecting or refining them until a hypothesis is found which correctly predicts both positive and negative results.
We generated a set of 60 possible rules which described some lists of integers but not others and randomly sampled 10 in such a way as to ensure an even selection across difficulty levels. Some example rules:
For each rule, each of 11 well-known LLMs was prompted to try to figure out the rule, and then given three initial examples of lists that follow the rule. Here's a shortened version of the prompt, which describes the rest of the procedure (see appendix E for the full prompt, including instruction to avoid various failure modes like confirmation bias):
Claude-3.6-Sonnet[5] was used to check the test lists, and to evaluate whether the model's final hypothesis was (extensionally) equivalent to the real rule. Our prompt encouraged Claude to be fairly lenient in judging.
Full details are available in the preregistration and repository (and we're happy to go into more detail in the comments).
Results
Most LLMs did very badly. Tested models successfully discovered a mean of 11% and median of 10% of the 10 rules tested. 4 of 11 tested models got 0 correct; 5 of 11 got 1 correct. OpenAI-o1-preview got 3 and Claude-3.6-Sonnet got 4[6].
The authors' independent prior estimates of problem difficulty were mostly confirmed (problems are in order of estimated difficulty in the figure), although one problem (the 'collatzish' hailstone sequence variation) was easier than we expected, possibly because models' training data contains many descriptions of the hailstone sequence itself.
This is a somewhat ambiguous result in terms of what it says about the degree to which LLMs are capable of hypothesis generation and testing. Mostly LLMs weren't capable of it, suggesting that the LLM architecture may be poorly suited to these tasks. On the other hand, the latest and most sophisticated models seemed to show clear signs of life.
As a partial point of comparison, in Wason's testing only about 20% of humans solved the problem tested, but Wason's experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, 'strictly increasing order').
Why did the models fail? First, in about 25% of cases (primarily for the smaller models), they failed to produce valid JSON output. The primary failure mode other than these basic errors was confirmation bias[7]; despite repeated instructions to do so, models failed to consistently attempt to falsify their hypotheses. Interestingly, in some cases models would effectively add epicycles in the face of contrary evidence, complicating their hypotheses and adding special cases rather than simply rejecting them. Sometimes the model's success or failure would be determined by whether their initial pool of hypotheses happened to contain one which was sufficiently similar to the true rule that the model could move iteratively from one to the other. Another failure mode (again, primarily for smaller models) was just analysis errors, eg making arithmetic errors when analyzing an example sequence.
While the conclusions we can draw from this experiment are limited, the full project, which we describe next, tries to find more definitive answers.
Full experiment (in progress)
Key advantages over the pilot
Procedure
The full experiment is similar in most ways to this pilot project, but instead of lists of integers, we randomly generate novel quasi-physical domains. These domains consist of objects of various types; each type of object has certain properties (which may be numeric or boolean). There is a set of operations that can be performed either upon a single object, or by one object on another. All names (object types, properties, operations) are randomly generated.
As a running example, let's say the system contains a type of object called a zorple, which may or may not be quizzleplistic, and whose zibblosity has some numeric value. It is possible to splorficate a zorple, or to wuzzle it. Our domain might contain, say, eight zorples, each of which may or may not be quizzleplistic, and each of which has its own zibblosity value.
The system has a set of underlying (again, randomly generated) causal relationships. For example, perhaps when a zorple is splorficated, its zibblosity doubles. The LLM being tested is not told anything about these ground-truth relationships.
Note that these domains can be as simple or complex as we like. They could contain a single type of object with a single boolean property, with a single operation that can be performed, and a small number of mathematically simple causal relationships. Or they could contain many types of objects, each with many numeric and boolean properties, many operations that have different effects depending on what objects perform and receive them, and a large number of complex causal relationships.
As in the pilot experiment, we inform the model that it is a scientist investigating a new domain, and that it should perform experiments in order to scientifically characterize the domain. Its task is to decide on specific experiments to perform; in our example above, it might decide to splorficate three of the zorples to see whether and how their properties change.
For each experiment that the model chooses to perform, an automated 'lab assistant' conducts the experiments and reports the results (whose effects depend entirely on the pre-established causal relations of the domain). The lab assistant takes no initiative at all; their only role is to report the output of experiments that the LLM requests.
Once the model-as-scientist has performed as many experiments as it wishes to, we test whether it has learned a full and correct model of the domain (this could be done in several ways, ranging from being able to correctly predict the outcomes of new experiments to being able to write a report containing all relevant equations).
For a bit more detail, see this writeup.
What we want from you
Over the past six months, we have had many discussions with other researchers, here and elsewhere, about whether LLMs are capable of general reasoning, and in particular whether they are capable of general reasoning out-of-distribution. Opinions vary quite dramatically. We claim that the full project described above will provide good evidence on this question.
If this is a topic you have a view on, we'd really like to hear from you in the comments:
Conclusion
In our view, it's vitally important to better understand whether LLMs are weaker than they naively appear, in ways that will limit their ability to do autonomous scientific research and prevent direct scaling to full AGI. The pilot project shown here provides some initial, ambiguous evidence on this question, and we expect the full project to provide significantly stronger evidence. We'd very much like to get your input on how we could shape the project to provide even better evidence on this key question.
To the extent that these results (or results from the main experiment) show that LLMs struggle with general reasoning, we must be careful about what implications we draw. First, these results clearly do not mean that LLMs can't do anything that we would think of as reasoning; numerous benchmarks and results show LLMs succeeding at tasks that we typically think of as reasoning, even if it's not fully general or extended out-of-distribution. Second, these results don't mean that LLMs can't increase the velocity of scientific research; again, there's a growing body of literature (eg see examples here) showing that they can. What it could show, though, is that LLMs are potentially fundamentally limited at autonomously doing the end-to-end cycle of scientific research, and that likely places bounds on the degree to which they can accelerate research (and in particularly capabilities research). One human scientist might be able to direct a handful of LLMs and thereby speed up their research a few times, but it seems less likely that they could spin up 1000 LLMs, point them at a problem, and turn them loose.
Inversely, to the extent that these results suggest signs of life from LLMs on this sort of autonomous research, interpretation of that result requires equal care. Our experiment here (and in the main project) aims to test almost the simplest possible version of such research. LLMs may need multiple orders of magnitude further scaling before they can tackle important real-world problems autonomously. Nevertheless, such signs of life suggest that this is something we'll see in the relatively near future, and without the need for major further breakthroughs; scaling will likely be enough.
We are eager to get more compelling and rigorous evidence on this question from the main project. The resulting clarification of how far we are from AGI will let us better direct our resources toward achieving the best possible chance of good outcomes for humanity.
Appendices
In order to keep the main post as short and clear as possible, we've moved these sections to the bottom.
A. Acknowledgments
Thanks for helpful input to Abdur Raheem Ali (who had ideas for a similar experiment which partly inspired this one), Daniel Tan, Robert Kralisch, and Andy Arditi. Thanks for draft feedback to Nicholas Kees Dupuis, Daniel Tan, James Chua, Felix Binder, Seth Herd, Andy Arditi, Jord Nguyen, Aaron Scher, and Robert Adragna.
B. Related work
We will provide a more extensive discussion of related work with the main experiment, but to briefly point to some interesting representative examples (for both the pilot and the full project):
C. Limitations
This pilot project has many limitations, some of which will be overcome in the main project.
D. Full list of tested rules
The full set of 60 candidate rules can be found here; the following are the 10 we used for the pilot:
E. Full prompt
Roughly seven hours were spent iterating on the prompt.
You can also see the analysis and judgment prompts given to the evaluating model in the repository.
F. Minimal human baseline
The authors played four rounds of the game with each other. This was fairly informal and the details were slightly different, so this shouldn't be taken as an official baseline, but results were:
G. Example successes and failures
You can find the full set of official transcripts here, and various non-official ones here and here.
Example of success
Here's a small model (32B parameters) succeeding at 'There are no repeated numbers in the list', one of two models that got this problem:
Example of failure
Here we see a failure from a state-of-the-art model on a problem that only one model (Sonnet) succeeded on:
Current working definition of 'general reasoning' (from here): "The ability to do deduction, induction, and abduction in a careful, step by step way, without making many errors that a better reasoner could avoid, including in new domains." Possibly also "the ability to use all of that to build a self-consistent internal model of the domain under consideration," although I'm uncertain whether that part is both valid and necessary.
I'm using a definition of AGI often attributed to Shane Legg, as a system that 'can do practically any cognitive task a human can do' -- but my point here applies to most other definitions of AGI as well.
Note also that in those worlds where LLMs are already capable of general reasoning, they will likely themselves accelerate advancement to and past AGI by being able to do autonomous or semi-autonomous capabilities research.
Name taken from an excellent series of sketches from Mitchell & Webb, of which the first is here.
Officially and confusingly known as 'claude-3-5-sonnet-20241022'; this model is considerably more capable than the earlier versions also tagged as claude-3.5-sonnet, so many have taken to referring to it as 3.6.
We suspect that Sonnet's 40% correct is a bit of an outlier, based on earlier experiments (on a separate set of rules of comparable difficulty). On two reruns, Sonnet got 30%. For full transparency: Sonnet actually got one more correct on the initial run, for 50%, but we had accidentally failed to add the second and third examples for the last two questions, so we did an official rerun of those for all models, and that time Sonnet failed to get it. It's also worth noting that iteration on the prompt prior to the main experiment was largely done with Sonnet; it's possible that as a result the prompt is somewhat optimized for Sonnet (although this was not our intent), potentially making Sonnet's relative success less significant.
Previous research (eg 1, 2) has demonstrated confirmation bias and other cognitive biases in LLMs.