You can read a PR and tell if it actually accomplishes what is says it does, right?
Mostly I can't, not if there are subtle issues. Certainly I can look and see if any bugs jump out at me, or any areas look suspicious, but understanding a piece of code I didn't write deeply enough tk execute it in my head usually takes longer than writing it myself.
What I can do is read a set of clearly-written functional or end-to-end tests, and see if they look like they should exercise the code written in the PR, and whether the assertions they make are the ones I'd expect, and whether there are any obvious cases that are missing. And, of course, I can look at CI and see whether said tests have passed.
In my experience, semgrep does not play well with trying to find cross-class behavior in dynamically typed codebases with lots of dependency injection, which is why I was trying to make Claude write some code which combined static analysis (in the form of reflection or ast parsing) with runtime logic for gathering information which is hard to determine statically but easy to determine at runtime.
For reference the code I ended up writing for this part was about 40 lines, it wasn't very complicated. Trying to do it in full generality purely by static analysis would be insanely complex (because php has terrible constructs like $$foo = "bar" and $foo->$bar = "baz", which this codebase doesn't use and can be trivially verified not to use, but which would be a nightmare to handle if they were used), but fortunately this wasn't what I needed.
But yeah, I also expected Claude to be able to do this trivially. It is able to trivially do most tasks which feel, to me, to be about this difficult or even a bit more difficult. This task felt like it should have been easier, since it's one where there's a lot of available signal to self-correct if you make a mistake, much more so than for many of the "build and test a feature" style tasks that Claude regularly does with no drama. Which is why I thought it would be a good example for a post along the lines of "many people use LLMs to quickly add sloppy features to their codebase, increasing technical debt, but it's also possible to use them to resolve technical debt much faster than doing it by hand". And then I tried it.
I never raced the AI like this in real time, maybe I should try sometime.
I strongly recommend you do. I expect you will have fun doing it, and that you will grow as a developer by doing so whether or not the AI beats you or even succeeds at the task. Even if the AI fails, it will likely use different tools than the ones you would have used, so you'll likely pick up new tricks. Having an AI to race against is also pretty great for staying focused and not getting sucked into rabbit holes - and it also is great for helping you determine after the fact whether a given rabbit hole was necessary to go down (if the AI didn't go down the rabbit hole and successfully completed the task, the rabbit hole was not necessary).
For more context, the project in question was a fairly standard laravel (php) project which makes heavy use of dependency injection, which I'm looking at serving with swoole rather than apache. This involves moving from a shared-nothing architecture, where the framework is spun up in a fresh state for every request, to one where some things can persist from request to request. What I asked for was an echaustivd inventory of all places where this could lead to data leakage from one request to the next, with a failing functional test demonstrating the leakage in as many places as viable, and . I provided an example of a place with such a leakage (a container-bound singleton with persistent state), and instructions for how to run the test, and also an example of how to run arbitrary code within the framework context, and some examples of previously-built linting scripts which used reflection and ast parsing to identify problematic patterns programmatically.
Claude correctly identified that the concrete places one might find such state persisting across requests (global variables, static class variables, singleton services bound to the container, dependencies of said singletons, stateful database connections).
Initially Claude tried to identify potentially problematic patterns it could grep for, which turned up some examples. It then wrote two functional tests (which I later looked at and noticed were basically equivalent to assertFalse(true)), ran them to verify that they were failing, produced a report describing two "CRITICAL" issues with static class variables, declared that the application did not contain any instances of container-bound singletons with persistent state or state that could persist on a database connection between requests, and declared the task completed.
I told it that the task was not finished, and that I wanted an exhaustive list and that this would necessarily involve writing and executing code within the framework context, and, again, here are some examples of how to do this. I also flagged that its search for container-bound singletons with persistent state should have, at a minimum, caught the example of state leakage that I had given as an example.
Claude then proceeded to write a plausible-looking script to find examples of container-bound singletons, ran it, saved the results to a CSV, wrote a different plausible-looking script to find examples of services with non-DI instance variables, saved those results to a different csv, then used the csvjoin tool (which was cool, I didn't realize that tool existed) to merge the two results into a third csv. That third CSV was empty, and Claude again declared that there were in fact no instances of that pattern and that it had successfully completed the task.
I mentioned, again, that it should at a minimum be turning up the example from the initial prompt. Claude went and spun its wheels for quite a while, and came back with several new plausible scripts (by this time it had dropped over 30 different scripts / md files / csvs in the repository root, with no organizational schema to speak of). After quite a few more rounds of this (Claude would run the tools but not sanity-check the outputs, so I had to sanity-check them to point to which tool wasn't working right, at which point Claude would write an entirely new version of that tool in the repository root which wrote to a different file), Claude finally had a script which produced a complete list of the few hundred places which could have this pattern, and which would need to be checked individually to see what was happening and if there were any trends that would allow for pruning that list down further in a programmatic way. This was the point at which I decided to call it a day yesterday.
I might resume on Monday, but given how the beginning of the task went, I don't have a lot of hope that Claude Code can finish the task at all.
To me the bottleneck mostly seemed to be that clopus trusted the tools it wrote way too much. When I write a tool like this, I tend to start with an output I know the tool will produce if it's working correctly and a vague sense for how much output the tool will produce, write the first version of the tool, run it, and check the output to make sure it contains the thing I thought it should contain. Claude, on the other hand, seems to follow the algorithm "write tool. run tool. accept results of tool".
Additionally, the product I want at the end of this is an exhaustive list of places to check, and a repeatable way to generate that list. Claude's strength seems to be that it is unreasonably good at one-shotting things - once it has to consider the current outputs of a tool it wrote, the outputs it wants from that tool, and the way it needs to change that tool to get those new outputs, things seem to fall apart.
It looks like Ralph Wiggum is intended for a different use case ("Claude code is successfully completing tasks, but stopping before it finishes with all of them").
Base models know a lot about the world! They merely lack the “handles” to tell us what they know in a way we can easily use. I'm arguing that we can add those handles without sacrificing the statistical integrity of the base models.
This isn't quite what you're looking for, but I will take this opportunity to once again shill my favorite obscure paper: NUDGING: Inference-time Alignment of LLMs via Guided Decoding. The authors tried to answer the question "how many tokens do you have to change in base model outputs to make them perform as well as a post-trained model[1] on benchmarks". They did the simplest thing that could possibly work[2], and found an answer of "<10%".
It's not very clean, but it is sort of a way of giving you "handles" for the base models while mostly letting them do their thing.
In order to generate something from 2030, we also want to be able to invert a metadata query, trying to generate the data from given metadata. We can generate [date], ["infer the text"] -> [text] training examples easily by swapping around training examples of the form [text], ["infer the date"] -> [date], of course. Training to do this at scale should result in decent answers. However, the result may not be probabilistically coherent: especially for new examples, the inferred metadata given a text may not match the metadata used to generate the text.
This is an illustration of the concept of "honesty" that underlies my proposal: if inverting a relation isn't probabilistically coherent, then in some sense the statistical model isn't revealing everything that it knows, and isn't giving you what you expect.
I'm confused about how this proposal would work in practice. Let's say we have date=2026-01-10, task="infer the text: the day of the week and phase of the moon are", text="Saturday, Last Quarter". I don't think we can expect the model to, given task="infer the text: the day of the week and phase of the moon are", text="Saturday, Last Quarter", infer date=2026-01-10. This doesn't particularly seem like a failure of honesty though.
They call post-trained models "aligned models" in the paper, which is a bit grating. I recommend you mentally s/aligned/post-trained/ when reading.
They had the base model try to answer benchmark questions, found the positions where the base model was least certain, and replaced specifically those tokens with tokens from the post-trained model. The base model, so steered, performed surprisingly well on benchmarks. Surprisingly (to me at least), the tokens changed tended to be transitional phrases rather than the meat of the specific problems. Concrete example:
GSM8K question: Jen decides to travel to 3 different countries. He has to pay $400 for the supplies he needs, in tal. The tickets for travel cost, in total, 50% more than the supplies. How much does travel cost?
Llama-2-70b, nudged by llama-2-8b-instruct, with replaced tokens in [brackets]:
[To] solve this problem, [we] need to [use] the [information] given to [us step] by step. [Here are] the steps [we can] take. [Step] 1: [Find] the total cost of the supplies. Step 2: Find the total cost of the tickets. Step 3: Find the total cost of the trip. [Let's] start with Step 1. [According] to the question, [the] total cost of the supplies is $400. [So,] we can write [this as: Supplies] = $400 Now, let's move on to Step 2. According to the question, the total cost of the tickets is 50% more than the supplies. [So,] we can write this as: Tickets = 1.5 * Supplies [Tickets] = 1.5 * $400 Tickets = $600 Now, let's move on to Step 3. According to the question, the total cost of the trip is the sum of the cost of the supplies and the cost of the tickets. So, we can write this as: Total [Cost] = Supplies + Tickets Total Cost = $400 + $600 Total Cost = $1000 Therefore, the total cost of the trip is $1000. [So,] the answer to the question is: [Travel cost = $1000]
Today, I needed to work through substantial project with a lot of drudgery (checking through an entire 1M+ LOC codebase for an http api for patterns which could cause state leakage between requests if we made a specific change to the request handling infrastructure. This involved a mix of things which are easy to do programmatically and things which require intelligent judgement, and has a fairly objective desired artifact (a list of all the places where state could leak, and a failing functional test demonstrating that leakage for each one).
I decided to do the John Henry thing - I set up Claude Code (in a container with --dangerously-skip-permissions) in one worktree with a detailed description of the project and the task, and then in a separate worktree I set off with my favorite text editor and without the help of any AI tooling more advanced than Copilot.
I finished about 4 hours later, despite fairly frequent interruptions to provide clarification and further instructions to Claude. Claude is now reaching the 7 hour / 100M token mark and has still not finished, though it has multiple times now declared that it has succeeded at the task and that the codebase is safe for this migration (it's not).
I'm honestly pretty shocked, because this task seemed like a pretty much perfect fit for a coding agent, and is one that doesn't require all that much codebase-specific context. I went into this expecting to lose - I was trying to quantify how much coding agents can help with obnoxious repetitive maintenance tasks, thus allowing maintenance work which might otherwise have been deferred to happen at lower cost. But I guess that's not the post I'm writing today (which is a bummer, I had a whole outline planned out and everything).
Likely this situation will change by next year, but for now I suppose the machine cannot yet replace even the more repetitive parts of my job. Perhaps things are different in ML land but I kind of doubt it.
This suggests a course of action if you work at a company which can have significant positive externalities and cares, during good times, more than zero about them: during those good times, create dashboards and alerts with metrics which correlate with those externalities, to add trivial friction (in the form of "number go down feels bad") to burning the commons during bad times.
If something LoRA-shaped usefully cracks continual learning things a lot of things in general are going to get very crazy very quickly.
If continual learning is cracked before jailbreak resistance, and the deployment model of "the same weights are used for inference for all customers" holds, the world of corporate espionage is going to get wild.
Right now, you need to be careful not to include sensitive information, include untrusted external information, AND have a method of sending arbitrary data to the outside world in a single context window since the LLM might be tricked by the external content. Any two of those, however, are fine.
If (sample-efficient) continual learning is cracked, and models are still shared across multiple customers, you will need to be sure to never share sensitive information with a model that will learn that information and then be available for your competitors to do inference on OR fully trust any model that has learned off of possibly-advesarial-to-you data.
And if continual learning is cracked without major architectural changes, giving up on using the same model for all customers means giving up on many of the benefits of batching.
Some human devs do this too. In the short term it reduces the likelihood of breaking things because something you weren't aware of relied on the old version. In the long term it makes changes harder, because now if you want to change the logic instead of changing it in one place you have to change it in n similar but usually not identical places, and if those places are different in ways that affect the implementation, now you have to try to make an informed guess about whether they're different on purpose and if so why. Down that path lies madness.
I'm curious what fraction of your non-boilerplate, non-test code that ends up in production is AI-generated. Do you review it manually?
At this point probably >95% of the code I cause to be written is AI-generated. Most of the AI-generated code is exploratory [1] or rote [2] , though. About 75% of the code I merge is AI-generated, but most of that is either boilerplate or tests [3] . and only 20% or so of the non-boilerplate, non-test code that makes it onto prod.
In any case, I should probably make a top level shortform to this effect, since this one got a lot more engagement than I was expecting - it was intended to be "I tried to get a measurement of how much AI can help me with maintenance work, and the attempt failed in an entertaining way" with a side of "I don't think I'm going to be substantially replaced by clopus just yet", but I have a bad feeling people are over-updating to "LLMs don’t help with programming", which is not my experience at all.
e.g. mocks of what a flow could look like, comparing lots of different alerting thresholds against historical data to see how I want to configure alarms ↩︎
DI wiring, includes, docblocks, that sort of thing. Basically the stuff I had keyboard shortcuts to fill in for me in 5 keystrokes in the days before AI. ↩︎
"write tests" is, by far, the area I get the most value out of current LLM coding agents. I rarely have the time or energy to write the test suite I really want by hand, but I do have enough time to rattle off a few dozen "when x happens then y happens z should be true" style things. LLM coding agents can usually come up with more. Test code is also very tolerant of copy/paste/modify (so I'm willing to say "look at this example, and copy shamelessly from it to fit your needs), and is also much more tolerant of bad code than user-facing code is (since rewrites are low-risk and will generally break in obvious ways if they break. Between these factors, I am usually quite happy to ship LLM-written tests) ↩︎