Edit: The sitation has evolved but is still somewhat confusing. There is now a leaderboard of scores on the public test set that Ryan is #1 on (see here). But this tweet from Jack Cole indicates that his (many month old) solution gets a higher score on the public test set than Ryan's top score on that leaderboard. I'm not really sure what's going on here,
One important caveat to the presentation of results in this post (and the discussion on Twitter) is that there are reasons to think this approach may not be SOTA, as it performs similarly to the prior best-performing approach when tested apples-to-apples, i.e. on the same problems.
There are three sets of ARC problems: the public training set, the public eval set, and the private eval set.
My two main deductions from this are:
Apparently, lots of people get better performance on the public test set than the private one, which is a little surprising given that if you read this page from the ARC folks, you'll see the following:
The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems. It's like a tutorial level.
The public evaluation sets and the private test sets are intended to be the same difficulty.
Two explanations come to mind: maybe the public and private test sets are not IID, and/or maybe past SOTA method overfit to the public set. Chollet claims it's (accidentally) the latter here, but he doesn't rule out the former. He says the tasks across the two public test sets are meant to be equally hard for a human, but he doesn't say they're divided in an IID manner.
I guess we'll see how the results on the public leaderboard shake out.
(Expanding on a tweet)
I endorse this comment for the record.
I'm considering editing the blog post to clarify.
If I had known that prior work got a wildly different score on the public test set (comparable to the score I get), I wouldn't have claimed SOTA.
(That said, as you note, it seems reasonably likely (though unclear) that this prior solution was overfit to the test set while my solution is not.)
I'm submitting to the private leaderboard (with fewer samples than used in this post). If results indicate that SOTA is unlikely, I'll retract my claim.
I edited to add:
(Edit: But see this comment and this comment for important clarifications.)
And changed from "this dataset" to "a similarly difficult dataset".
I agree that there is a good chance that this solution is not actually SOTA, and that it is important to distinguish the three sets.
There's a further distinction between 3 guesses per problem (which is allowed according to the original specification as Ryan notes), and 2 guesses per problem (which is currently what the leaderboard tracks [rules]).
Some additional comments / minor corrections:
The past SOTA got [we don't know] on the first, 52% on the second, and 34% on the third.
AFAICT, the current SOTA-on-the-private-test-set with 3 submissions per problem is 37%, and that solution scores 54% on the public eval set.
The SOTA-on-the-public-eval-set is at least 60% (see thread).
Apparently, lots of people get worse performance on the public test set than the private one
I think this is a typo and you mean the opposite.
From looking into this a bit, it seems pretty clear that the public eval set and the private test set are not IID. They're "intended" to be the "same" difficulty, but AFAICT this essentially just means that they both consist of problems that are feasible for humans to solve.
It's not the case that a fixed set of eval/test problems were created and then randomly distributed between the public eval set and private test set. At your link, Chollet says "the [private] test set was created last" and the problems in it are "more unique and more diverse" than the public eval set. He confirms that here:
This is *also* likely in part due to the fact that the eval set contains more "easy" tasks. The eval set and test set were not calibrated for difficulty. So while all tasks across the board are feasible for humans, the tasks in the test set may be harder on average. This was not intentional, and is likely either a fluke (there are only 100 tasks in the test set) or due to the test set having been created last."
Bottom line: I would expect Ryan's solution to score significantly lower than 50% on the private test set.
And https://x.com/bshlgrs/status/1806397587085468116 for some discussion.
Last week there was some uncertainty about whether @RyanPGreenblatt's ARC-AGI solution was really sota, because many other solutions did better on public eval and we didn't have private test results. There is now a semi-private eval set; he's at the top of this leaderboard.
Our guess is that Ryan’s technique beats other solutions despite performing worse at the public eval because other solutions are more overfit to public eval. (But we don’t know the performance of MindsAI’s solution (@Jcole75Cole), which is sota on Kaggle, on this eval set.)
This result doesn’t clarify everything, but at least addresses concerns that Ryan’s solution is overfit because of data contamination in the data OpenAI used to pretrain GPT-4o.
Thanks to the ARC team for helping with running Ryan’s submission, and to @Jcole75Cole and @MaxNadeau_ for helpful discussion, and thanks to the community as a whole for being chill during the uncertainty here.
Thanks, edited my post to reference this (lmk if you understand what's happening here better than I do)
(x-post from substack comments)
Contra Chollet, I think that current LLMs are well described as doing at least some useful learning when doing in-context learning.
I agree that Chollet appears to imply that in-context learning doesn't count as learning when he states:
Most of the time when you're using an LLM, it's just doing static inference. The model is frozen. You're just prompting it and getting an answer. The model is not actually learning anything on the fly. Its state is not adapting to the task at hand.
(This seems misguided as we have evidence of models tracking and updating state in activation space.)
However later on in the Dwarkesh interview, he says:
Discrete program search is very deep recombination with a very small set of primitive programs. The LLM approach is the same but on the complete opposite end of that spectrum. You scale up the memorization by a massive factor and you're doing very shallow search. They are the same thing, just different ends of the spectrum.
My steelman of Chollet's position is that he thinks the depth of search you can perform via ICL in current LLMs is too shallow, which means they rely much more on learned mechanisms that require comparatively less runtime search/computation but inherently limit generalization.
I think the directional claim "you can easily overestimate LLMs' generalization abilities by observing their performance on common tasks" is correct—LLMs are able to learn very many shallow heuristics and memorize much more information than humans, which allows them to get away with doing less in-context learning. However, it is also true that this may not limit their ability to automate many tasks, especially with the correct scaffolding, or stop them from being dangerous in various ways.
This makes a lot of sense to me, and makes me want to figure out exactly how to operationalize and rigorously quantify depth of search in LLMs! Quick thought is that it should have something to do with the spectrum of the transition matrix associated with the mixed state presentation (MSP) of the data generating process, as in Transformers Represent Belief State Geometry in their Residual Stream . The MSP describes synchronization to the hidden states of the data generating process, and that feels like a search process that has max-depth of the Markov order of the data generating process.
I really like the idea that memorization and this more lofty type of search are on a spectrum, and that placement on this spectrum has implications for capabilities like generalization. If we can figure out how to understand these things a more formally/rigorously that would be great!
I'd like to provide some additional context:
LLM + interpreter is considered neurosymbolic rather than just 'scale.'
Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn't any interesting weight behind the claim "neurosymbolic methods are necessary".
You might as well say "evolution didn't create intelligent software engineers because humans are much worse at software engineering without access to a python interpreter, so only neurosymbolic intelligence will work".
GPT-4o was trained on JSONs of the public datasets (which is what Ryan tested on). Unclear how much this could impact performance on the public train and test sets. Would be great to see the performance on the private test set.
I think this is probably fine based on my understanding of how data leakage issues work, but it seems worth checking.
I do think limiting the amount of compute you can use to win the context should be taken into context. Perhaps enough compute solves the task and, in practice, this is mostly all that matters (even if a model + scaffolding can't solve ARC-AGI under present rules).
I think it seems reasonable to analyze benchmarks while fixing AI compute costs to be 10-100x human costs. But, this is a huge amount of compute because inference is cheap and humans are expensive.
Given that this setup can't be used for the actual competition, it may be that SoTA (neurosymbolic) models can get a high score a year or more before the models that can enter the competition are allowed.
Yeah, seems likely.
Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn't any interesting weight behind the claim "neurosymbolic methods are necessary".
If somebody achieved a high-score on the ARC challenge by providing the problems to an LLM as prompts and having it return the solutions as output, then the claim "neurosymbolic methods are necessary" would be falsified. So there is weight to the claim. Whether it is interesting or not is obviously in the eye of the beholder.
I get that adding an interpreter feels like a low bar for neurosymbolic (and that scale helps considerably with making interpreters/tools useful in the first place), but I'd be curious to know what you have in mind when you hear that word.
To be honest, I didn't really have any clear picture of what people meant be neurosymboloic.
I assumed something like massively scaled up expert systems or SVMs with hand engineered features, but I haven't seen anyone exhibiting a clear case of something that is claimed to be a central example of a neurosymbolic system which also does any very interesting task.
AlphaGo was neurosymbolic as well
But is AlphaGo using just the policy network (which still beats pros) 'neurosymbolic'? What about MuZero, is that neurosymbolic?
The current approach is heavy on the symbolic program part, but I think it would be entirely possible to distill the programs down into the LLM, and simply have the LLM generate/model sequences of image-tokens++program-tokens++image-tokens, and learn from any 'neurosymbolic' approach: https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt/comment/59334256 (And then you would have a pure LLM solver which at no point actually runs a Python program, it merely uses them as a scaffold for inner-monologue; and given results on inner-monologue distillation, likely even the Python tokens could be trained away.)
Nice! Do you have a sense of the total development (and run-time) cost of your solution? "Actually getting to 50% with this main idea took me about 6 days of work." I'm interested in the person-hours and API calls cost of this.
I was pretty inefficient on iteration, so around $40,000. Around 6 person days of development work, though a considerable amount of hours on each day.
How much of that is API costs? Seems like the most part, unless you're considering a truly exorbitant salary.
I'm honestly always amazed from just how much money some people in these parts seem to have. That's a huge sum to spend on an LLM experiment. It would be pretty large even for a research group, to burn that in just 6 days!
The money is not what you should take away; there are many ways to spend money - IP licenses, software, interns, buildings, consultants... What you should take away is the compute that that money bought (given that profit margins for OA seem to be minimal, and the cost ~ compute).
Behind many key results, there is an unspoken level of compute. (Where did that KataGo or grokking result come from? They came from letting the GPUs run a lot longer than planned, either because of spare capacity or forgetfulness.) You will not find that in the papers, usually, but you should always remember this; it's sorta like Anatole France - 'behind every great result, there is a great compute'. And this is why compute drives AI progress, and always has, and has always been underestimated as a driver: Compute Rules Everything Around Me.
This is a lot more blatantly capabilities than your last two capabilities posts. Consider deleting this and not publishing or sharing any further work like it.
FWIW, I did consider whether this work would non-trivially advance AI progress, in particular advance the scarier parts of AI progress.
I think the most concerning aspect is hype and this seemed not-that-bad to me.
Capability elicitation is also a "make thing more capable" technique, which I would prefer not get published when developed. Note that I don't feel that I have enough insight into your research plans to suggest not doing the work, just not sharing it indiscriminately.
Ah, I see. In particular, my prior work is about making elicitation robust to adversaries which feels pretty importantlly different to me.
I would describe this as a human-AI system. You are doing at least some of the cognitive work with the scaffolding you put in place through prompt engineering etc, which doesn’t generalise to novel types of problems.
Quoting from a substack comment I wrote:
Certainly some credit goes to me and some to GPT4o.
The solution would be much worse without careful optimization and wouldn't work at all without gpt4o (or another llm with similar performance).
It's worth noting a high fraction of my time went into writing prompts and optimization the representation. (Which is perhaps better described as teaching gpt4o and making it easier for it to see the problem.)
There are different analogies here which might be illuminating:
- Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
- If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
- You can build systems around people which remove most of the interesting intelligence from various tasks.
I think what is going on here is analogous to all of these.
Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133
Nice post! Someone want to explain how to do problem 2? I’ve looked at it for about 3 mins and can’t figure it out and am now just curious lol
You have to return the pattern which when inserted in place of the lightblue cells would result in a symmetric image (so you look at the other side and mirror it)
This is wrong actually. Note that the input where you have to produce the corresponding output has the light blue further to the left than what can be achieved with mirroring (the grid isn't symmetric). I think you have to extrapolate the pattern. This is an extremely brutal one.
Missed this comment chain before making my comment. My complaint is the most natural extrapolation here (as I assess it, unless I'm missing something) would go out of bounds. So either you have ambiguity about how to deal with the out of bounds, or you have a (in my view) less natural extrapolation.
E.g. "shift towards/away from the center" is less natural than "shift to the right/left", what would you do if it were already in the center for example?
I agree that the most natural symmetry goes out of bounds. But there are other patterns you can use. First, you can solve the bottom half of the unknown rectangle by top-down symmetry. And then, it seems like the center of top portion, when rotated right matches the center of the right portion (does not hold near the diagonal and the center, but holds near the top), and so you can use this to fill in the missing tiles. Moreover, this observation that the top center portion can be rotated to get left and right holds for the other examples. So I think this solves the problem in the spirit of the exercise (find the simplest rules that predicts what the pattern on the example and is applicable to the prediction inputs).
Huh, I was missing something then, yes. And retrospectively should have thought of it -
it's literally just filling in the blanks for the light blue readout rectangle (which in a human-centric point of view, is arguably simpler to state than my more robotic perspective even if algorithmically more complex) and from that perspective the important thing is not some specific algorithm for grabbing the squares but just finding the pattern. I kind of feel like I failed a humanness test by not seeing that.
Problem 2 seems badly formulated because
The simplest rule explaining the 3 example input-output pairs would make the output corresponding to the test input depend on squares out of bounds of the test input.
To fix you can have some rule like have the reflection axis be shifted from the center by one in the direction of the light blue "readout" rectangle (instead of fixed at one to the right from the center) or have the reflection axis be centered, and have a 2-square shift in a direction depending on which side of center is the readout rectangle (instead of in a fixed direction), but that seems strictly more complicated.
Alternatively, you could have some rule about wraparound, or e.g. using white squares if out of bounds, but what rule to use for out of bounds squares isn't determined from the example input-output pairs given.
Edit: whoops, see Fabien Roger's comment and my reply.
"A team of 3 top research ML engineers with fine-tuning access to GPT-4o (including SFT and RL), $10 million in compute, and 1 year of time could use GPT-4o to surpass typical naive MTurk performance at ARC-AGI on the test set while using less than $100 per problem at runtime (as denominated by GPT-4o API costs)."
But doing so would tune that GPT-4o to be less good at other tasks, wouldn't it? Isn't this way of solving just Goodharting the metric and actually pushing the LLM away from being "General Intelligence"?
That being said, this is a great job you have done just over a week. Thank you for your contribution for science.
Isn't this way of solving just Goodharting the metric and actually pushing the LLM away from being "General Intelligence"?
I certainly agree that solving ARC-AGI in this way wouldn't indicate general intelligence. (And, I generally think that ARC-AGI probably doesn't track general intelligence that well. And it tracks even less when no holds barred methods are considered.)
But doing so would tune that GPT-4o to be less good at other tasks, wouldn't it?
I don't think it would degrade performance on other tasks very much if you took basic precautions. LLMs have a lot of parameters.
And every time you get evidence that indicates that scaling might be dangerously powerful (and I hope this post provided some), you should update appropriately in favor of more caution.
I had spent a good chunk of the past week reading the literature on generalization in LLMs and trying to think through my beliefs on this question, and this post definitely caused me to update. Great work, thank you!
I think this experiment does not update me substantially towards thinking we are closer toward AGI because the experiment does not show GPT-4o coming up with a strategy to solve the task and then executing it. Rather a human (a general intelligence) has looked at the benchmark then devised an algorithm that will let GPT-4o perform well on the task.
Further, the method does not seem flexible enough to work on a diverse range of tasks and certainly not without human involvement in adapting it.
In other words, the result is less that GPT-4o is able to achieve 50% on ARC-AGI. It is that a human familiar with the style of question used in ARC-AGI can devise a method for getting 50% on ARC-AGI that offloads some of the workload to GPT-4o.
I think this should only non-trivially update you if you had a specific belief like: "Transformers can't do actual general learning and reasoning. For instance, they can't do hard ARC-AGI problems (a reasonably central example of a task that requires learning and reasoning) at all."
(I'm pretty sympathetic to a view like "ARC-AGI isn't at all representative of hard cognitive tasks and doesn't really highlight an interesting bottleneck in transformer ability more so than coding/agency benchmarks. Thus, I don't really update on most of any result.".)
In other words, the result is less that GPT-4o is able to achieve 50% on ARC-AGI. It is that a human familiar with the style of question used in ARC-AGI can devise a method for getting 50% on ARC-AGI that offloads some of the workload to GPT-4o.
There has been a bunch of discussion on this sort of point here, here, and on the substack version of the post. So, you might be interested in that discussion.
I think applying your same labeling approach would be considered pretty misleading in the context of human organizations or human education. Just because I build a structure around the model, give the model some examples of solving the problem, and let the model try many times doesn't mean that the cognitive work is done by me!
(Most of my work is in building good representations to make up for vision issues and providing few-shot examples. I do a bunch of other tweaks which modestly improve performance. Of course, we also select which program to use based on which what passes the tests, but this didn't really invovle me devising a method!)
Of course, this is a word choice question and if you also think "humans can't do mechanical engineering and aren't really doing the cognitive work of mechanical engineering, only humans+schools+CAD software can do mechanical engineering effectively", that would be consistent. (But this seems like a very non-traditional use of terms!)
GPT-4o’s vision is terrible on grids.
I'm confused -- unless I'm misunderstanding something, the challenges are presented as JSON data. Where does vision come in? Or are you using vision as a metaphor for the model attempting to create an internal representation of the grid from the JSON data?
[UPDATE -- never mind, I had forgotten that the method involved 'Provide the ARC-AGI problem to GPT-4o, with both an image representation and with various text representations for each grid in the problem.']
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples (if this is confusing, go here)[2]. I use a variety of additional approaches and tweaks which overall substantially improve the performance of my method relative to just sampling 8,000 programs.
[This post is on a pretty different topic than the usual posts I make about AI safety.]
The additional approaches and tweaks are:
The prior state of the art on a similarly difficult dataset was 34% accuracy, so this is a significant improvement.[3] (Edit: But see this comment and this comment for important clarifications.)
On a held-out subset of the train set, where humans get 85% accuracy, my solution gets 72% accuracy.[4] (The train set is significantly easier than the test set as noted here.)
Additional increases of runtime compute would further improve performance (and there are clear scaling laws), but this is left as an exercise to the reader.
In this post:
Thanks to Fabien Roger and Buck Shlegeris for a bit of help with this project and with writing this post.
What is ARC-AGI?
ARC-AGI is a dataset built to evaluate the general reasoning abilities of AIs. It consists of visual problems like the below, where there are input-output examples which are grids of colored cells. The task is to guess the transformation from input to output and then fill out the missing grid. Here is an example from the tutorial:
This one is easy, and it’s easy to get GPT-4o to solve it. But the tasks from the public test set are much harder; they’re often non-trivial for (typical) humans. There is a reported MTurk human baseline for the train distribution of 85%, but no human baseline for the public test set which is known to be significantly more difficult.
Here are representative problems from the test set[5], and whether my GPT-4o-based solution gets them correct or not.
Problem 1:
Problem 2:
Problem 3:
My method
The main idea behind my solution is very simple: get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s). I show GPT-4o the problem as images and in various ascii representations.
My approach is similar in spirit to the approach applied in AlphaCode in which a model generates millions of completions attempting to solve a programming problem and then aggregates over them to determine what to submit.
Actually getting to 50% with this main idea took me about 6 days of work. This work includes constructing few-shot prompts, building better text representations of these grids, iterating against the train set, and implementing various other tweaks to improve performance.
I started on this project a few days before Dwarkesh Patel recorded the recent podcast with Chollet. This was inspired by Dwarkesh talking to my coworker Buck about ARC-AGI, and then Buck being like “come on, surely you can do better than current SOTA using LLMs”. Then, I tested GPT-4o a bit and it seemed to get what was going on. I’ve recently been doing some research that involved getting Claude 3 Opus to do reasoning right at the edge of its capabilities, and thought I might have an edge based on my experience from that project.
At a high level, the method I use is:
In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set. All the results I presented here were computed on a different subset of the public test set that does not overlap. The train and test set are not IID, and the test set is both much harder and somewhat qualitatively different (I think), so using a subset of the test set for iteration was useful for quickly getting a better sense of how things change with difficulty. It's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing.[8]
More of the details of my approach and a bunch of tricks I use to improve performance, can be found at the bottom of this post. You can find the full solution in this GitHub repo: https://github.com/rgreenblatt/arc_draw_more_samples_pub.
Detailed results
What are the returns to more sampling?
We can analyze how more samples improves performance on test. For ease of interpretation (and implementation), I show returns to samples on the primary and most performant prompting variant I use (in the section below, it has the name “V2”)[9]. I also don’t incorporate revision (as this would require a separate run for each number of samples).
There appears to be a relatively clean scaling law[10]. Of course, a linear fit from log(k) to accuracy can’t go on forever as it would imply you eventually go above 100% accuracy![11]
The fit is in terms of log base 2. So, it indicates an additional 3% correct per doubling of k.
While the number of samples I use is large, using far more samples is certainly possible. (For reference, AlphaCode uses up to a million samples per problem).
What are the returns to better prompting and code fixing?
I find that improving prompting and adding a revision step is important to improve the accuracy. Huge numbers of samples, better prompting, and a revision step are all essential for the performance I achieve.
Here is a breakdown of the performance by version of the prompt:
I don’t have numbers for cutting reasoning while using comparable amounts of samples.
I use the exact same settings for the train distribution numbers I report.
We can also compare the scaling of different prompts:
Above, I noted that revision seems to fix around 20% of the remaining incorrect. Supposing that this scales indefinitely (a dubious assumption), we can see how long it takes to reach various performance thresholds with my current method and various ablations.
Getting to “just” 70% performance with my current method would take 2^21 samples (around 2 million)! I think 70% is probably similar to the performance of typical people on MTurk on the test set, though no number is known.
Qualitative analysis
GPT-4o is limited by failures other than reasoning, such as:
Removing these non-reasoning weaknesses would improve the performance of my solution by a significant amount. Vision is an especially large weakness.
To be clear, GPT-4o is also limited by being very dumb. A smart human with only access to the text representation could substantially outperform GPT-4o by spending a bunch of time on each problem.
Caveats
Predictions
What it means about current LLMs
Chollet says:
Contra Chollet, I think that current LLMs are well described as doing at least some useful learning when doing in-context learning.
In particular, given my results you have to reject one of the following claims:
Claim 1 seems likely true to me for a reasonable notion of “learning”. I think François Chollet agrees here. Most of my doubts about this claim are concerns that you can basically brute force ARC-AGI without interestingly doing learning (e.g. brute-force search over some sort of DSL or training on a huge array of very similar problems). These concerns apply much less to the kind of approach I used.
Claim 2 seems true to me: the distribution of programs you are searching over has to be pretty close to the right program for Best-of-6k to work at all: if you did best-of-6k for random python programs, this would not work! Perhaps François Chollet disagrees here, but I think this view would be unreasonable.
Therefore, I think Claim 3 is false: I think LLMs actually do some relevant “learning” when doing in-context learning. Overall performance is very weak (otherwise I wouldn’t have needed to draw thousands of samples in my solution), but it’s some learning nevertheless. (Though there are various obstacles to GPT-4o performing well other than reasoning and learning ability such as vision and coding limitations.)
(One caveat is that this could be false if a substantial fraction of GPT-4o’s performance comes from dataset contamination. This seems very unlikely to me.)
It’s worth emphasizing that GPT-4o’s learning within a single context seems much less competent than typical human learning. But it is learning nonetheless. My view isn’t that GPT-4o is smart relative to humans in the typical way we mean smart, but I do think it has something which is well described as intelligence.
What ARC-AGI tells us about AGI
Progress has not stalled. I think ARC-AGI will be one benchmark among many that just gets solved by scale, and that as LLMs are scaled up, we should expect them to be able to solve tasks of increasing complexity when used with the appropriate tools, resources and prompting/scaffolding. (In this case, performance is due to scale and 6 days of iteration.)
I think it is plausible that scaling LLMs[12] by another 2-10 OOMs in effective training compute, and giving them tools, resources and scaffolding to solve real-world tasks can result in "AGI"[13], understood as AI capable of massively accelerating R&D. Such a technology would likely be the most transformative technology in human history, and I prefer to refer to this as "Transformative AI” (TAI) due to the potential ambiguity of "AGI".
TAI poses huge risks. Making mistaken predictions about where LLMs are heading could result in a dramatic underestimate of the dangers they could pose. If, like Mike Knoop (co-host of the ARC-AGI prize), you oppose bills like SB 1047 because you think LLMs won’t scale[14], then it really matters that you are right about LLMs not scaling. And every time you get evidence that indicates that scaling might be dangerously powerful (and I hope this post provided some), you should update appropriately in favor of more caution.
ARC-AGI probably isn't a good benchmark for evaluating progress towards TAI: substantial "elicitation" effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks. I am more excited about benchmarks that directly test the ability of AIs to take the role of research scientists and engineers, for example those that METR is developing. (I think developing these evaluations and the science of conducting these evaluations is a highly leveraged way of reducing the risk that powerful AGI takes humanity by surprise; if you’re interested in contributing to them, you can see open roles at METR here. Note that I have various COIs with METR.) I still think that work like ARC-AGI can be good on the margin for getting a better understanding of current AI capabilities.
(I'm ambivalent about advancing AI progress overall, especially in a broadly proliferated fashion. If I thought that my work on ARC-AGI would likely substantially advance AI progress, I would not have published this blog post.)
More minimally, I think that ARC-AGI would be a better evaluation of progress towards TAI if it used purely text based problems or at least had a text based subset: good vision isn’t necessary for TAI and improved vision has outsized effects on ARC-AGI relative to TAI progress.
I also think that the ARC-AGI prize is made worse by not allowing SOTA (closed source) LLMs in submission and by overly restricting runtime computation[15]. (There is an open leaderboard which has no such constraints.) I expect that SOTA LLMs will be pushing the frontier of progress in ARC-AGI based on these results and general views about what will happen with SOTA LLMs in the next few years. Higher limits on runtime compute seem important for advance warning: if an approach currently costs 10x human labor costs but can do a task, then it will probably cost way less in a few years (or less time) as further optimizations accrue. For instance, substantially optimizing the runtime compute used by my approach seems doable.
Overall, I appreciate the approach of ARC-AGI and I appreciate that Chollet and Knoop have made strong and relatively specific claims. (Some of which now seem to be contradicted!) Nonetheless, I think there are substantially better ways to benchmark progress toward transformative AI.
Appendix: A bunch of tricks used in my solutions
I use a bunch of tricks to improve performance:
Some tricks I don’t use that might improve performance:
Appendix: results for the train set
Around 2^15 or 32,000 samples would be required to reach MTurk performance on the train set.
Appendix: Returns to revision samples