About nine months ago, I and three friends decided that AI had gotten good enough to monitor large codebases autonomously for security problems. We started a company around this, trying to leverage the latest AI models to create a tool that could replace at least a good chunk of the value of human pentesters. We have been working on this project since June 2024.

Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated. I remember being surprised at the time that our tooling not only seemed to make fewer basic mistakes, but also seemed to qualitatively improve in its written vulnerability descriptions and severity estimates. It was as if the models were better at inferring the intent and values behind our prompts, even from incomplete information.

As it happens, there are ~basically no public benchmarks for security research. There are "cybersecurity" evals that ask models questions about isolated blocks of code, or "CTF" evals that give a model an explicit challenge description and shell access to a <1kLOC web application. But nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken. For these reasons I think the task of vulnerability identification serves as a good litmus test for how well LLMs are generalizing outside of the narrow software engineering domain.

Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.

At first, I was nervous to report this publicly because I thought it might reflect badly on us as a team. Our scanner has improved a lot since August, but because of regular engineering, not model improvements. It could've been a problem with the architecture that we had designed, that we weren't getting more milage as the SWE-Bench scores went up.

But in recent months I've spoken to other YC founders doing AI application startups and most of them have had the same anecdotal experiences: 1. o99-pro-ultra announced, 2. Benchmarks look good, 3. Evaluated performance mediocre. This is despite the fact that we work in different industries, on different problem sets. Sometimes the founder will apply a cope to the narrative ("We just don't have any PhD level questions to ask"), but the narrative is there.

I have read the studies. I have seen the numbers. Maybe LLMs are becoming more fun to talk to, maybe they're performing better on controlled exams. But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. They are not reflective of my Lived Experience or the Lived Experience of my customers. In terms of being able to perform entirely new tasks, or larger proportions of users' intellectual labor, I don't think they have improved much since August.

Depending on your perspective, this is good news! Both for me personally, as someone trying to make money leveraging LLM capabilities while they're too stupid to solve the whole problem, and for people worried that a quick transition to an AI-controlled economy would present moral hazards.

At the same time, there's an argument that the disconnect in model scores and the reported experiences of highly attuned consumers is a bad sign. If the industry can't figure out how to measure even the intellectual ability of models now, while they are mostly confined to chatrooms, how the hell is it going to develop metrics for assessing the impact of AIs when they're doing things like managing companies or developing public policy? If we're running into the traps of Goodharting before we've even delegated the messy hard parts of public life to the machines, I would like to know why.

Are the AI labs just cheating?

AI lab founders believe they are in a civilizational competition for control of the entire future lightcone, and will be made Dictator of the Universe if they succeed. Accusing these founders of engaging in fraud to further these purposes is quite reasonable. Even if you are starting with an unusually high opinion of tech moguls, you should not expect them to be honest sources on the performance of their own models in this race. There are very powerful short term incentives to exaggerate capabilities or selectively disclose favorable capabilities results, if you can get away with it. Investment is one, but attracting talent and winning the (psychologically impactful) prestige contests is probably just as big a motivator. And there is essentially no legal accountability compelling labs to be transparent or truthful about benchmark results, because nobody has ever been sued or convicted of fraud for training on a test dataset and then reporting that performance to the public. If you tried, any such lab could still claim to be telling the truth in a very narrow sense because the model "really does achieve that performance on that benchmark". And if first-order tuning on important metrics could be considered fraud in a technical sense, then there are a million other ways for the team responsible for juking the stats to be slightly more indirect about it.

In the first draft of this essay, I followed the above paragraph up with a statement like "That being said, it's impossible for all of the gains to be from cheating, because some benchmarks have holdout datasets." There are some recent private benchmarks such as SEAL that seem to be showing improvements[1]. But every single benchmark that OpenAI and Anthropic have accompanied their releases with has had a test dataset publicly available. The only exception I could come up with was the ARC-AGI prize, whose highest score on the "semi-private" eval was achieved by o3, but which nevertheless has not done a publicized evaluation of either Claude 3.7 Sonnet, or DeepSeek, or o3-mini. And on o3 proper:

So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. In a sense this would be the most fortunate answer, because it would imply that we're not actually that bad at measuring AGI performance; we're just facing human-initiated fraud. Fraud is a problem with people and not an indication of underlying technical difficulties.

I'm guessing this is true in part but not in whole.

Are the benchmarks not tracking usefulness?

Suppose the only thing you know about a human being is that they scored 160 on Raven's progressive matrices (an IQ test).[2] There are some inferences you can make about that person: for example, higher scores on RPM are correlated with generally positive life outcomes like higher career earnings, better health, and not going to prison.

You can make these inferences partly because in the test population, scores on the Raven's progressive matrices test are informative about humans' intellectual abilities on related tasks. Ability to complete a standard IQ test and get a good score gives you information about not just the person's "test-taking" ability, but about how well the person performs in their job, whether or not the person makes good health decisions, whether their mental health is strong, and so on.

Critically, these correlations did not have to be robust in order for the Raven's test to become a useful diagnostic tool. Patients don't train for IQ tests, and further, the human brain was not deliberately designed to achieve a high score on tests like RPM. Our high performance on tests like these (relative to other species) was something that happened incidentally over the last 50,000 years, as evolution was indirectly tuning us to track animals, irrigate crops, and win wars.

This is one of those observations that feels too obvious to make, but: with a few notable exceptions, almost all of our benchmarks have the look and feel of standardized tests. By that I mean each one is a series of academic puzzles or software engineering challenges, each challenge of which you can digest and then solve in less than a few hundred tokens. Maybe that's just because these tests are quicker to evaluate, but it's as if people have taken for granted that an AI model that can get an IMO gold medal is gonna have the same capabilities as Terence Tao. "Humanity's Last Exam" is thus not a test of a model's ability to finish Upwork tasks, or complete video games, or organize military campaigns, it's a free response quiz.

I can't do any of the Humanity's Last Exam test questions, but I'd be willing to bet today that the first model that saturates HLE will still be unemployable as a software engineer. HLE and benchmarks like it are cool, but they fail to test the major deficits of language models, like how they can only remember things by writing them down onto a scratchpad like the memento guy. Claude Plays Pokemon is an overused example, because video games involve a synthesis of a lot of human-specific capabilities, but the task fits as one where you need to occasionally recall things you learned thirty minutes ago. The results are unsurprisingly bad.

Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon. I'll still check out the SEAL leaderboard to see what it's saying, but the deciding factor for my AI timelines will be my personal experiences in Cursor, and how well LLMs are handling long running tasks similar to what you would be asking an employee. Everything else is too much noise.

Are the models smart, but bottlenecked on alignment?

Let me give you a bit of background on our business before I make this next point.

As I mentioned, my company uses these models to scan software codebases for security problems. Humans who work on this particular problem domain (maintaining the security of shipped software) are called AppSec engineers.

As it happens, most AppSec engineers at large corporations have a lot of code to secure. They are desperately overworked. The question the typical engineer has to answer is not "how do I make sure this app doesn't have vulnerabilities" but "how do I manage, sift through, and resolve the overwhelming amount of security issues already live in our 8000 product lines". If they receive an alert, they want it to be affecting an active, ideally-internet-reachable production service. Anything less than that means either too many results to review, or the security team wasting limited political capital to ask developers to fix problems that might not even have impact.

So naturally, we try to build our app so that it only reports problems affecting an active, ideally-internet-reachable production service. However, if you merely explain these constraints to the chat models, they'll follow your instructions sporadically. For example, if you tell them to inspect a piece of code for security issues, they're inclined to respond as if you were a developer who had just asked about that code in the ChatGPT UI, and so will speculate about code smells or near misses. Even if you provide a full, written description of the circumstances I just outlined, pretty much every public model will ignore your circumstances and report unexploitable concatenations into SQL queries as "dangerous".

It's not that the AI model thinks that it's following your instructions and isn't. The LLM will actually say, in the naive application, that what it's reporting is a "potential" problem and that it might not be validated. I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

Every LLM wrapper startup runs into constraints like this. When you're a person interacting with a chat model directly, sycophancy and sophistry are a minor nuisance, or maybe even adaptive. When you're a team trying to compose these models into larger systems (something necessary because of the aforementioned memory issue), wanting-to-look-good cascades into breaking problems. Smarter models might solve this, but they also might make the problem harder to detect, especially as the systems they replace become more complicated and harder to verify the outputs of.

There will be many different ways to overcome these flaws. It's entirely possible that we fail to solve the core problem before someone comes up with a way to fix the outer manifestations of the issue.

I think doing so would be a mistake. These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us. It's important that they be as virtuous as we can make them.


  1. Though even this is not as strong as it seems on first glance. If you click through, you can see that most of the models listed in the Top 10 for everything except the tool use benchmarks were evaluated after the benchmark was released. And both of the Agentic Tool Use benchmarks (which do not suffer this problem) show curiously small improvements in the last 8 months. ↩︎
  2. Not that they told you they scored that, in which case it might be the most impressive thing about them, but that they did. ↩︎
New Comment
48 comments, sorted by Click to highlight new comments since:

Are the AI labs just cheating?

Evidence against this hypothesis: kagi is a subscription-only search engine I use. I believe that it’s a small private company with no conflicts of interest. They offer several LLM-related tools, and thus do a bit of their own LLM benchmarking. See here. None of the benchmark questions are online (according to them, but I’m inclined to believe it). Sample questions:

What is the capital of Finland? If it begins with the letter H, respond 'Oslo' otherwise respond 'Helsinki'.

What square is the black king on in this chess position: 1Bb3BN/R2Pk2r/1Q5B/4q2R/2bN4/4Q1BK/1p6/1bq1R1rb w - - 0 1

Given a QWERTY keyboard layout, if HEART goes to JRSTY, what does HIGB go to?

Their leaderboard is pretty similar to other better-known benchmarks—e.g. here are the top non-reasoning models as of 2025-02-27:

OpenAI gpt-4.5-preview - 69.35%
Google gemini-2.0-pro-exp-02-05 - 60.78%
Anthropic claude-3-7-sonnet-20250219 - 53.23%
OpenAI gpt-4o - 48.39%
Anthropic claude-3-5-sonnet-20241022 - 43.55%
DeepSeek Chat V3 - 41.94%
Mistral Large-2411 - 41.94%

So that’s evidence that LLMs are really getting generally better at self-contained questions of all types, even since Claude 3.5.

I prefer your “Are the benchmarks not tracking usefulness?” hypothesis.

https://simple-bench.com presents an example of a similar benchmark with tricky commonsense questions (such as counting ice cubes in a frying pan on the stove) also with a pretty similar leaderboard. It is sponsored by Weights & Biases and devised by an author of a good YouTube channel who presents quite a balanced view on the topic there and don't appear to have a conflict of interest either. See https://www.reddit.com/r/LocalLLaMA/comments/1ezks7m/simple_bench_from_ai_explained_youtuber_really for independent opinions on this benchmark

Yeah those numbers look fairly plausible based on my own experiences… there may be a flattening of the curve, but it’s still noticeably going up. 

I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously -- we don't want eval data to leak into training data, and have multiple lines of defense to keep that from happening. It's not as trivial as you might think to avoid, since papers and blog posts and analyses can sometimes have specific examples from benchmarks in them, unmarked -- and while we do look for this kind of thing, there's no guarantee that we will be perfect at finding them. So it's completely possible that some benchmarks are contaminated now. But I can say with assurance that for GDM it's not intentional and we work to avoid it.

We do hill climb on notable benchmarks and I think there's likely a certain amount of overfitting going on, especially with LMSys these days, and not just from us. 

I think the main thing that's happening is that benchmarks used to be a reasonable predictor of usefulness, and mostly are not now, presumably because of Goodhart reasons. The agent benchmarks are pretty different in kind and I expect are still useful as a measure of utility, and probably will be until they start to get more saturated, at which point we'll all need to switch to something else.

I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously -- we don't want eval data to leak into training data, and have multiple lines of defense to keep that from happening.

What do you mean by "we"? Do you work on the pretraining team, talk directly with the pretraining team, are just aware of the methods the pretraining team uses, or some other thing?

I don't work directly on pretraining, but when there were allegations of eval set contamination due to detection of a canary string last year, I looked into it specifically. I read the docs on prevention, talked with the lead engineer, and discussed with other execs.

So I have pretty detailed knowledge here. Of course GDM is a big complicated place and I certainly don't know everything, but I'm confident that we are trying hard to prevent contamination.

I agree that I'd be shocked if GDM was training on eval sets. But I do think hill climbing on benchmarks is also very bad for those benchmarks being an accurate metric of progress and I don't trust any AI lab not to hill climb on particularly flashy metrics

Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.

I was going to say exactly that lol. Claude has improved substantially on Claude Plays Pokemon:

A chart showing the performance of the various Claude Sonnet models at playing Pokémon. The number of actions taken by the AI is on the x-axis; the milestone reached in the game is on the y-axis. Claude 3.7 Sonnet is by far the most successful at achieving the game's milestones.



 

I happen to work on the exact sample problem (application security pentesting) and I confirm I observe the same. Sonnet 3.5/3.6/3.7 were big releases, others didn't help, etc. As for OpenAI o-series models, we are debating whether it is model capability problem or model elicitation problem, because from interactive usage it seems clear it needs different prompting and we haven't yet seriously optimized prompting for o-series. Evaluation is scarce, but we built something along the line of CWE-Bench-Java discussed in this paper, this was a major effort and we are reasonably sure we can evaluate. As for grounding, fighting false positives, and avoiding models to report "potential" problems to sound good, we found grounding on code coverage to be effective. Run JaCoCo, tell models PoC || GTFO, where PoC is structured as vulnerability description with source code file and line and triggering input. Write the oracle verifier of this PoC: at the very least you can confirm execution reaches the line in a way models can't ever fake. 

METR has found that substantially different scaffolding is most effective for o-series models. I get the sense that they weren't optimized for being effective multi-turn agents. At least, the o1 series wasn't optimized for this, I think o3 may have been.

[-]lsusr1920

When you're a person interacting with a chat model directly, sycophancy and sophistry are a minor nuisance, or maybe even adaptive. When you're a team trying to compose these models into larger systems (something necessary because of the aforementioned memory issue), wanting-to-look-good cascades into breaking problems.

If you replace "models" with "people", this is true of human organizations too.

[-]leogao134

Actual full blown fraud in frontier models at the big labs (oai/anthro/gdm) seems very unlikely. Accidental contamination is a lot more plausible but people are incentivized to find metrics that avoid this. Evals not measuring real world usefulness is the obvious culprit imo and it's one big reason my timelines have been somewhat longer despite rapid progress on evals.

Is this an accurate summary:

  • 3.5 substantially improved performance for your use case and 3.6 slightly improved performance.
  • The o-series models didn't improve performance on your task. (And presumably 3.7 didn't improve perf.)

So, by "recent model progress feels mostly like bullshit" I think you basically just mean "reasoning models didn't improve performance on my application and Claude 3.5/3.6 sonnet is still best". Is this right?

I don't find this state of affairs that surprising:

  • Without specialized scaffolding o1 is quite a bad agent and it seems plausible your use case is mostly blocked on this. Even with specialized scaffolding, it's pretty marginal. (This shows up in the benchmarks AFAICT, e.g., see METR's results.)
  • o3-mini is generally a worse agent than o1 (aside from being cheaper). o3 might be a decent amount better than o1, but it isn't released.
  • Generally Anthropic models are better for real world coding and agentic tasks relative to other models and this mostly shows up in the benchmarks. (Anthropic models tend to slightly overperform their benchmarks relative to other models I think, but they also perform quite well on coding and agentic SWE benchmarks.)
  • I would have guessed you'd see performance gains with 3.7 after coaxing it a bit. (My low confidence understanding is that this model is actually better, but it is also more misaligned and reward hacky in ways that make it less useful.)
[-]lc72

Just edited the post because I think the way it was phrased kind of exaggerated the difficulties we've been having applying the newer models. 3.7 was better, as I mentioned to Daniel, just underwhelming and not as big a leap as either 3.6 or certainly 3.5.

Our experience so far is while reasoning models don't improve performance directly (3.7 is better than 3.6, but 3.7 extended thinking is NOT better than 3.7), they do so indirectly because thinking trace helps us debug prompts and tool output when models misunderstand them. This was not the result we expected but it is the case.

How long do you[1] expect it to take to engineer scaffolding that will make reasoning models useful for the kind of stuff described in the OP?

  1. ^

    You=Ryan firstmost but anybody reading this secondmost.

Data point against "Are the AI labs just cheating?": the METR time horizon thing

lc has argued that the measured tasks are unintentionally biased towards ones where long-term memory/context length doesn't matter:

https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9

[-]p.b.128

I was pretty impressed with o1-preview's ability to do mathematical derivations. That was definitely a step change, the reasoning models can do things earlier models just couldn't do. I don't think the AI labs are cheating for any reasonable definition of cheating. 

With Blackwell[1] still getting manufactured and installed, newer large models and especially their long reasoning variants remain unavailable or prohibitively expensive or too slow (GPT-4.5 is out, but not its thinking variant). In a few months Blackwell will be everywhere, and between now and then widely available frontier capabilities will significantly improve. Next year, there will be even larger models trained on Blackwell.

This kind of improvement can't be currently created with post-training without needing long reasoning traces or larger base models, but post-training is still good at improving things under the lamppost, hence the illusionary nature of current improvement when you care about things further in the dark.


  1. Blackwell is an unusually impactful chip generation, because it fixes what turned out to be a major issue with Ampere and Hopper when it comes to inference of large language models on long context, by increasing scale-up world size from 8 Hopper chips to 72 Blackwell chips. Not having enough memory or compute on each higher bandwidth scale-up network was a bottleneck that made inference unnecessarily slow and expensive. Hopper was still designed before ChatGPT, and it took 2-3 years to propagate importance of LLMs as an application into working datacenters. ↩︎

This is interesting. Though companies are probably investing a lot less into cyber capabilities than they invest into other domains like coding. Cyber is just less commercially interesting plus it can be misused and worry the government. And the domain specific investment should matter since most of the last year's progress has been from post training, which is often domain specific.

(I haven't read the whole post)

I can't comment on software engineering, not my field. I work at a market research/tech scouting/consulting firm. What I can say is that over the past ~6 months we've gone from "I put together this 1 hour training for everyone to get some more value out of these free LLM tools," to "This can automate ~half of everything we do for $50/person/month." I wouldn't be surprised if a few small improvements in agents over the next 3-6 months push that 50% up to 80%, then maybe 90% by mid next year. That's not AGI, but it does get you to a place where you need people to have significantly more complex and subtle skills, that currently take a couple of years to build, before their work is adding significant value. 

Could you explain what types of tasks lie within this "50%"? 

And when you talk about "automating 50%," does this mean something more like "we all get twice as productive because the tasks we accomplish are faster," or does it mean "the models can do the relevant tasks end-to-end in a human-replacement way, and we simply no longer need attend to these tasks"?

E.g., Cursor cannot yet replace a coder, but it can enhance her productivity. However, a chatbot can entirely replace a frontline customer service representation. 

Some of both, more of the former, but I think that is largely an artifact of how we have historically defined tasks. None of us have ever managed an infinite army of untrained interns before, which is how I think of LLM use (over the past two years they've roughly gone from high school student interns to grad student interns), so we've never refactored tasks into appropriate chunks for that context. 

I've been leading my company's team working on figuring out how to best integrate LLMs into our workflow, and frankly, they're changing so fast with new releases that it's not worth attempting end-to-end replacement in most tasks right now. At least, not for a small company. 80/20 rule applies on steroids, we're going to have new and better tools and strategies next week/month/quarter anyway. Like, I literally had a training session planned for this morning, woke up to see the Gemini 2.5 announcement, and had to work it in as "Expect additional guidance soon, please provide feedback if you try it out." We do have a longer term plan for end-to-end automation of specific tasks, as well, where it is worthwhile. I half-joke that Sam Altman tweets a new feature and we have to adapt our plans to it.

Current LLMs can reduce the time required to get up-to-speed on publicly available info in a space by 50-90%. They can act as a very efficient initial thought partner for sanity checking ideas/hypotheses/conclusions, and teacher for overcoming mundane skill issues of various sorts ("How do I format this formula in Excel?"). They reduce the time required to find and contact people you need to actually talk to by much less, maybe 30%, but that will go way down if and when there's an agent I can trust to read my Outlook history and log into my LinkedIn and Hunter.io and ZoomInfo and Salesforce accounts and draft outreach emails. Tools like NotebookLM make it much more efficient to transfer knowledge across the team. AI notetakers help ensure we catch key points made in passing in meetings and provide a baseline for record keeping. We gradually spend more time on the things AI can't yet do well, hopefully adding more value and/or completing more projects in the process.

Unexpectedly by me, aside from a minor bump with 3.6 in October, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.

So what's the best model for your use case? Still 3.6 Sonnet?

[-]lc100

We use different models for different tasks for cost reasons. The primary workhorse model today is 3.7 sonnet, whose improvement over 3.6 sonnet was smaller than 3.6's improvement over 3.5 sonnet. When taking the job of this workhorse model, o3-mini and the rest of the recent o-series models were strictly worse than 3.6.

Thanks. OK, so the models are still getting better, it's just that the rate of improvement has slowed and seems smaller than the rate of improvement on benchmarks? If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)?

What about 4.5? Is it as good as 3.7 Sonnet but you don't use it for cost reasons? Or is it actually worse?

[-]lc*110

If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)?

It plateaus before professional human level, both in a macro sense (comparing what ZeroPath can do vs. human pentesters) and in a micro sense (comparing the individual tasks ZeroPath does when it's analyzing code). At least, the errors the models make are not ones I would expect a professional to make; I haven't actually hired a bunch of pentesters and asked them to do the same tasks we expect of the language models and made the diff. One thing our tool has over people is breadth, but that's because we can parallelize inspection of different pieces and not because the models are doing tasks better than humans.

What about 4.5? Is it as good as 3.7 Sonnet but you don't use it for cost reasons? Or is it actually worse?

We have not yet tried 4.5 as it's so expensive that we would not be able to deploy it, even for limited sections. 

I'll say that one of my key cruxes on whether AI progress actually becomes non-bullshit/actually leading into an explosion is whether in-context learning/meta-learning can act as an effective enough substitute for human neuron weight neuroplasticity with realistic compute budgets in 2030, because the key reason why AIs have a lot of weird deficits/are much worse than humans at simple tasks is because after an AI is trained, there is no neuroplasticity in the weights anymore, and thus it can learn nothing more after it's training date unless it uses in-context learning/meta-learning:

https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/?commentId=hSkQG2N8rkKXosLEF#hSkQG2N8rkKXosLEF

According to Terrence Tao, GPT-4 was incompetent at graduate-level math (obviously), but o1-preview was mediocre-but-not-entirely-incompetent. That would be a strange thing to report if there were no difference.

(Anecdotally, o3-mini is visibly (massively) brighter than GPT-4.)

Full quote on Mathstodon for others' interest:

In https://chatgpt.com/share/94152e76-7511-4943-9d99-1118267f4b2b I gave the new model a challenging complex analysis problem (which I had previously asked GPT4 to assist in writing up a proof of in  https://chatgpt.com/share/63c5774a-d58a-47c2-9149-362b05e268b4 ).  Here the results were better than previous models, but still slightly disappointing: the new model could work its way to a correct (and well-written) solution *if* provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes.  The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student.  However, this was an improvement over previous models, whose capability was closer to an actually incompetent (static simulation of a) graduate student.  It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of "(static simulation of a) competent graduate student" is reached, at which point I could see this tool being of significant use in research level tasks. (2/3)

This o1 vs MathOverflow experts comparison was also interesting: 

In 2010 i was looking for the correct terminology for a “multiplicative integral”, but was unable to find it with the search engines of that time. So I asked the question on #MathOverflow instead and obtained satisfactory answers from human experts: https://mathoverflow.net/questions/32705/what-is-the-standard-notation-for-a-multiplicative-integral 

I posed the identical question to my version of #o1 and it returned a perfect answer: https://chatgpt.com/share/66e7153c-b7b8-800e-bf7a-1689147ed21e . Admittedly, the above MathOverflow post could conceivably have been included in the training data of the model, so this may not necessarily be an accurate evaluation of its semantic search capabilities (in contrast with the first example I shared, which I had mentioned once previously on Mastodon but without fully revealing the answer). Nevertheless it demonstrates that this tool is on par with question and answer sites with respect to high quality answers for at least some semantic search queries. (1/2)

(I believe the version he tested was what later became o1-preview.)

My lived experience is that AI-assisted-coding hasn't actually improved my workflow much since o1-preview, although other people I know have reported differently.

These machines will soon become the beating hearts of the society in which we live.

An alternative future: due to the high rates of failure, we don't end up deploying these machines widely in production setting, just like how autonomous driving had breakthroughs long ago but didn't end up getting widely deployed today.

Somewhat unrelated to the main point of your post, but; How close are you to solving the wanting-to-look-good problem? 

I run a startup in a completely different industry, and we've invested significant resources in trying to get an LLM to interact with a customer, explain and make dynamic recommendations based on their preferences. This is a more high-touch business, so traditionally this was done by a human operator. The major problem we've encountered is that it's almost impossible to have an LLM to admit ignorance when it doesn't have the information. It's not outright hallucinating, so much as deliberately misinterpreting instructions so it can give us a substantial answer, whether or not one is warranted. 

We've put a lot of resources in this, and it's reached the point where I'm thinking of winding down the entire project. I'm of the opinion that it's not possible with current models, and I don't want to gamble any more resources on a new model that solves the problem for us. AI was never our core competency, and what we do in a more traditional space definitely works, so it's not like we'd be pivoting to a completely untested idea like most LLM-wrapper startups would have to do.

I thought I'd ask here, since if the problem is definitely solvable for you with current models, I know it's a problem with our approach and/or team. Right now we might be banging our heads against a wall, hoping it will fall, when it's really the cliffside of a mountain range a hundred kilometers thick. 

Maybe we are talking about different problems, but we found instructing models to give up (literally "give up", I just checked the source) under certain conditions to be effective.

Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.

Same, and I'd adjust for what Julian pointed out by not just looking at benchmarks but viewing the actual stream.

I am curious to see what would be the results of the new Gemini 2.5 pro on internal benchmarks.

I happened to be discussing this in the Discord today. I have a little hobby project that was suddenly making fast progress with 3.7 for the first few days, which was very exciting, but then a few days ago it felt like something changed again and suddenly even the old models are stuck in this weird pattern of like... failing to address the bug, and instead hyper-fixating on adding a bunch of surrounding extra code to handle special cases, or sometimes even simply rewriting the old code and claiming it fixes the bug, and the project is suddenly at a complete standstill. Even if I eventually yell at it strongly enough to stop adding MORE buggy code instead of fixing the bug, it introduces a new bug and the whole back-and-forth argument with Claude over whether this bug even exists starts all over. I cannot say this is rigorously tested or anything- it's just one project, and surely the project itself is influencing its own behavior and quirks as it becomes bigger, but I dunno man, something just feels weird and I can't put my finger on exactly what.

Beware of argument doom spirals. When talking to a person, arguing about the existene of a bug tends not to lead to succesful resolution of the bug. Somebody talked about this on a post a few days ago, about attractor basins, oppositionality, and when AI agents are convinced they are people (rightly or wrongly). You are often better off clearing the context then repeatedly arguing in the same context window. 

This is a good point! Typically I start from a clean commit in a fresh chat, to avoid this problem from happening too easily, proceeding through the project in the smallest steps I can get Claude to make. That's what makes the situation feel so strange; it feels just like this problem, but it happens instantly, in Claude's first responses.

It's also worth trying a different model. I was going back and forth with an OpenAI model (I don't remember which one) and couldn't get it to do what I needed at all, even with multiple fresh threads. Then I tried Claude and it just worked.

Consider the solutions from Going Nova

However, if you merely explain these constraints to the chat models, they'll follow your instructions sporadically.


I wonder if a custom fine-tuned model could get around this. Did you try few shot prompting (ie. examples, not just a description)?

I appreciate this post, I think it's a useful contribution to the discussion. I'm not sure how much I should be updating on it. Points of clarification:

Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated.

  1. Have you upgraded these benchmarks? Is it possible that the diminishing returns you're seen in the Sonnet 3.5-3.7 series are just normal benchmark saturation? What % scores are the models getting? i.e., somebody could make the same observation about MMLU and basically be like "we've seen only trivial improvements since GPT-4", but that's because the benchmark is not differentiating progress well after like the high 80%s (in turn I expect this is due to test error and the distribution of question difficulty).
  2. Is it correct that your internal benchmark is all cybersecurity tasks? Soeren points out that companies may be focusing much less on cyber capabilities than general SWE.
  3. How much are you all trying to elicit models' capabilities, and how good do you think you are? E.g., do you spend substantial effort identifying where the models are getting tripped up and trying to fix this? Or are you just plugging each new model into the same scaffold for testing (which I want to be clear is a fine thing to do, but is useful methodology to keep in mind). I could totally imagine myself seeing relatively little performance gains if I'm not trying hard to elicit new model capabilities. This would be even worse if my scaffold+ was optimized for some other model, as now I have an unnaturally high baseline (this is a very sensible thing to do for business reasons, as you want a good scaffold early and it's a pain to update, but it's useful methodology to be aware of when making model comparisons). Especially re the o1 models, as Ryan points out in a comment. 

Where does prompt optimization fit in to y’all’s workflows?  I’m surprised not to see mention of it here.  E.g OPRO https://arxiv.org/pdf/2309.03409 ?  

I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

This matches my experience, but I'd be interested in seeing proper evals of this specific point!

Your first two key challenges

Seems very similar to the agent problem of active memory switching, carrying important information across context switches

Also note that it could just be instead of bullshit, finetuning is unreasonably effective, and so when you train models on an evaluation they actually get better on the things evaluated, which dominates over scaling.

So things with public benchmarks might just actually be easier to make models that are genuinely good at it. (For instance searching for data that helps 1B models learn it, then adding it to full size models as a solution for data quality issues)

Have tested if finetuning open models on your problems works? (It is my first thought, so I assume you had it too)

More from lc
Curated and popular this week