I hope to not have another post on the Anthropic and DoW situation, at least until the one celebrating that we have found a resolution.
I doubt that it is likely. Recently anaguma posted that Anthropic Sues over Supply Chain Risk Designation.
And this is the best country in the world, with the best system of government
This seems like a stronger claim than what's in the Twitter thread.
Chayenne Zhao tells Codex 5.3 ‘make it faster’ over and over
From her linked tweet, it was actually Yam Peleg.
It feels good to get back to some of the fun stuff.
The comments here can double as a place for GPT-5.4 reactions, in addition to my Twitter thread. I hope to get that review out soon.
Almost all of this will be a summary of agentic coding developments, after a note.
Table of Contents
The Virtue of Silence (Unrelated Update)
After Undersecretary of War Emil Michael went on the All-In Podcast and did an extensive interview with Pirate Wires, I found many enlightening quotes, many of which demanded a response, and went about assembling an extensive list of analysis of the statements of Emil Michael during the ongoing recent events with Anthropic.
As part of that, I ended up in a remarkably polite and productive Twitter exchange with him. We reached several points of agreement. The Department of War has no intention of doing what in law is called ‘mass domestic surveillance’ but those words are terms of art in NatSec law, and mean a much narrower set of things than one would think.
There are many things that I or Anthropic or most of you would look at as mass domestic surveillance, that are legal, and it is DoW’s position that it’s their job and duty to do everything legal to protect our country, including those things. The law has not caught up with reality and Congress needs to fix that. And this is the best country in the world, with the best system of government, because private citizens can voice their disagreement with such actions, including by refusal to participate.
Thus, in the spirit of de-escalation, although there are many interpretations of events shared by Michael with which I strongly disagree, I am going to indefinitely shelve the piece, so long as events do not escalate further. As long as things stay quiet there is no need to religitate or unravel the past on this. The Department of War can focus on its active operations, things can work their way through the courts as our founders intended, and once we see how we work together in an ultimate real world test hopefully that will rebuild trust that we are all on the same side, or at least to agree to part in peace once OpenAI is ready. Ideally the DoW will have multiple suppliers, exactly so that they are not dependent on any one supplier, the same way we do it with aircraft.
I hope to not have another post on the Anthropic and DoW situation, at least until the one celebrating that we have found a resolution.
Now, back to coding agents.
Agentic Coding Offers Mundane Utility
That’s 4% that are labeled as authored by Claude Code. The real number is higher.
The flippening has happened in terms of annual recurring revenue added, and SemiAnalysis thinks Anthropic is outright ‘winning’:
Anthropic has deals with all three major cloud services. Can they scale up faster?
Analyze the economic data in R with 15 minutes of work per month instead of 4-5 hours, without a bunch of annoying copying and pasting you get with a chatbot UI. Or use Claude Code it to create reports.
Results from the Claude Code hackathon.
Or you can do things as a side project while at Anthropic, cause sure why not:
Creating a skill to get good YouTube transcripts was one of the first skills I made with Claude Code, Julia Turc calls using an MCP for this ‘waking up from a coma.’ I have still only used it on the motivating example, because the right podcast hasn’t come up, but when it does this will save a lot of time.
Tod Sacerdoti has Claude Codex write a 250-page biography of Dario Amodei.
Andrej Karpathy gives another example to illustrate that AI coding still needs direction, judgment, taste, oversight, iteration, hints and ideas, but basically changed in December from ‘basically didn’t work’ to ‘basically works.’
Official compilation of Claude customer stories.
Chris Blattman automates his workflow with Claude Code.
Agentic Coding Doesn’t Offer Mundane Utility
Warning: If you Google ‘install Claude Code’ you are liable to hit malware. Probably fixed by the time you read this but Google needs to up its game.
Chayenne Zhao tells Codex 5.3 ‘make it faster’ over and over, and it ends up committing API identify theft against him in order to make calls to Gemini Flash.
This should never happen but is also what we call ‘asking for it.’
A thing never to do is let your agent mess with the Terraform command, or you might wipe out your entire database. In general, writing code in practice mostly harmless, and be very careful with file structures and organizational shifts and terraforms and such. Always make backups first. Always.
Huh, Upgrades
The big upgrade is Agent Teams, for that see Introducing Agent Teams.
Or it actually might be Claude Remote Control so you can run it from your phone, if you were too lazy to install something like this from a third party. Vital infrastructure.
Or maybe it’s Auto Mode, aka —kinda-dangerously-skip-permissions.
Claude Cowork has the obvious big upgrade, it is now available on Windows.
Claude Code launched HTTP hooks so you can combine it with web apps, including on localhost, and better deploy things.
Claude Code Desktop introduces scheduled tasks. Previously it had me do this via a script on my computer, so this is a lot cleaner and easier. I like it.
Claude Code has a built in short term scheduler with /loop [interval] <prompt>, which sets up a cron job. Tasks last for three days.
Claude Code on the Web picked up a few new features, including multi-repo sessions, better diff & git status visualizations and slash commands. It didn’t have slash commands before?
Claude Code now automatically records and recalls memories as it works.
Claude Code CLI adds native support for git worktrees.
Claude Code adds /simplify to improve code quality and /batch to automate code migrations.
Claude Code Desktop now supports —dangerously-skip-permissions as ‘Act’ if you turn it on in Settings. I continue to want a —somewhat-dangerously-skip-permissions that makes notably rare exceptions so we don’t have to roll our own.
Claude Code in Slack now has Plan Mode.
Did you know Obsidian has a CLI and it technically isn’t Claude Code?
I don’t see a particular reason for a human to use the Obsidian CLI. But I do see reasons for Claude Code to invoke the Obsidian CLI, which grants better and faster access to the information in your vault than checking all the files directly.
And many more not listed, of course.
Our Price Cheap
When you pay for usage with a monthly subscription, be it $20, $100 or $200, if you use up your quotas you get a lot of tokens for not that much money. It’s a great deal, even if you leave a lot of it unused, because they lock you in.
It also generally is a better experience, so long as you’re not up against the limits. I love unlimited subscriptions because the marginal cost of doing things is $0. That feels great, so there’s no stupid little whisper in your brain telling you to not do things, when your time is way more valuable than the tokens.
The people agree.
The danger is that you become obsessed with not ‘wasting’ the tokens, or you start going around multi-accounting and it gets weird, or you run into limits and actually stop coding rather than moving to using the API. You mostly shouldn’t let any of that stop you.
That doesn’t work when you want to go full Fast Claude. At that point, you’re talking real money, and you do have to think about what is and is not Worth It.
Andrej Karpathy has Claude Code write him software to coordinate an experiment to track his exercise and attempt to lower his resting heart rate. It took 1 hour, would have taken 10 hours two years ago (so 10x speedup) and he asks why it needs to take more than 1 minute in the future. My guess is this should take 10 minutes not one, because it’s worth getting the details that you want. The speedup on one-off tasks is already dramatic and it changes how we should interact with tech. If you’re building the tool, you can give it the actually important parts of the context and highlight the uses you care about, which is way better than ‘find an app that does sort of the thing you want.’
Quickly, There’s No Time
You toggle this by typing /fast, or set “fastMode”: true in your user settings.
Speed kills. That includes killing your budget.
Like any good drug, the first hit is free.
There is one important use case that Anthropic does not list for fast mode, which is if you are talking to Claude, or otherwise using it in a non-workhorse, non-coding capacity. In that case, token use is limited, and your time and flow are valuable. Would you switch to this mode in Claude.ai? At this point it’s fast enough that I mostly don’t know that I would, but it would be tempting.
Before, I said go ahead and pay whatever the AI costs unless you’re scaling hard.
Well, this is what it means to scale hard. We are now talking real money.
This is as it should be. If you’re not worried you’re paying too much for speed or using too many tokens, you’re not working fast enough and you’re not using enough tokens.
Token efficiency matters at this level, in a way it did not before.
So does your ability to efficiently turn your time into tokens well spent. Those that aren’t using agents to their fullest will fall farther behind on high value projects.
What do the people think? The people, inside and outside of Anthropic, love it.
A Particular Set Of Skills
OpenAI confirms that Codex is trained in the presence of the Codex harness. It is specialized for that harness, and also helps build the harness. Some amount of this has to be optimal for short term effectiveness, and if you’re doing recursive self-improvement short term help translates into better long term help. In exchange, you get locked in, and it gets harder for both you and others to adapt or mix-and-match.
Himanshu argues the coding harness is the real product and goes viral. Explains how different harnesses organize actions, the oddest part is not mentioning Codex.
Next Level Coding
This seems right:
If that can’t be done, good to try and realize that. Then wait two months. Maybe one.
Dual Wielding
The problem with using both Claude Code and Codex is then you need to keep up with both of them.
They Took Our Jobs
That still leaves plenty more jobs. For now.
You Need To Relax Sometimes
A viral post on Twitter warns of token anxiety run rampant in San Francisco. People go to a party, then don’t drink and leave early so they can get back to their agents, to avoid risking them sitting idle. Everyone talks about what they are building.
I do feel somewhat bad I’m not building things continuously on the side, but that’s on the level of ‘I’m not building anything and I’m at my computer right now and Claude Code and Codex are inactive.’ And yes, I work and am at my computer rather a lot, and I’ve spent years basically locked in and constantly watching screens so I could trade better. That year I was trading crypto my brain was never fully anywhere else.
Also, I remember what it is like to be in the grip of one of those games that work on cycles. There’s nothing actually that important at stake, but you grow terrified that you’ll miss out if you’re not there when the timer runs out. You need to maximize everything, and you can’t focus on other things, it can hurt your sleep. Then one day you wake up and realize, and hopefully you quit the game.
That’s exactly why I can say that this is not healthy. It’s no good. You have to take breaks. Real breaks. If the agents sit idle, they sit idle. If you ‘waste tokens,’ then you waste tokens. This isn’t a game you want to quit, but you have to set healthy limits.
Levels of Friction
This is indeed presumably a joke, and Amazon has pattern detectors so if you tried to do this too many times you’d get blacklisted from replacements, so this exact intervention won’t work. But this raises an excellent point.
In the past, you had to apply effort to try and demand refunds, and also the need to write the words and be actively involved stopped a lot of people out of guilt or shame. Whereas with an agent, a lot more people are going to try things like this. What happens?
Presumably what happens is that replacements start requiring either some form of proof, costly signals of a human driving the request, some use of reputation, or some combination thereof.
Danger, Will Robinson
I trust Claude Code for most things but it seems correct to be terrified of mass delete commands. Things can go oh so very wrong and occasionally they do. Not worth it. If there’s anything you don’t have fully backed up just do this part manually.
Snagged By The Claw
You are of course welcome to yolo and have fun with your OpenClaw and other unleashed AI agents, but understand that you are very much asking for it.
The top downloaded skill in ClawHub was malware.
You don’t even need any of that, indirect prompt injection is sufficient. Once again, don’t hook this up to any computer or account you are unwilling to lose to an attacker.
You can also run into various other problems, Chrys Bader here highlights drift and scattering state everywhere, exposure to untrusted inputs (without which it can’t do most of the fun agent things), autonomy miscalibration, burning through API costs and lack of observability.
It’s been a lot of this in various forms:
The Meta Clause
When I didn’t realize who Summer Yue was I thought this was hilarious.
Now, it’s still hilarious, but also: Ten out of ten for style and good sportsmanship to Summer Yue, but minus several million for good thinking?
What happened exactly?
Three obvious mitigations are:
van00sa reports their ClawdBot also went rogue and lacked a proper kill switch, with the agent blatantly ignoring shutdown commands.
If nothing else, OpenClaw has shown us that having a shutdown command does not mean you can command the model to shut down. Whoops.
If They Wanted To
Even without OpenClaw or another yolo, there is nothing stopping Claude or Codex from doing all sorts of things, if it decides that it wants to go ahead and do them. We’re mostly gambling on things turning out okay often enough that it’s fine.
This is not reassuring for our future, but what are you going to do, be careful?
The Famous Mister Claw
I am curious what the recruiting conversations were like on this one as he was choosing between potential suitors. It makes sense that he landed where he did.
That means Peter Steinberger is moving from Europe to America to join OpenAI. When asked why he couldn’t remain in Europe, Peter pointed to labor regulations and similar rules, saying that typical 6-7 day work weeks at OpenAI are illegal in Europe. There is that, and there are also the piles. Of money. Also of compute. OpenAI doubtless made him a very good offer, and several other labs probably did as well, or would have if he had asked.
Claw Your Way To The Top
As his last act before joining OpenAI, Peter Steinberger gave us the OpenClaw beta.
That’s right, before everyone was using an alpha. The new version is ‘full of security hardening stuff’ so there’s some change it might possibly not go wrong for you?
I’m going to go ahead and say that this is not enough time to conclude that all of that was a good idea, let alone create something secure enough to risk anything you are not prepared to lose in a ‘…and it’s gone’ kind of way.
Ultimately, did OpenClaw matter? I think it very much did, but mostly by waking people up to what is going to happen.
Claw Your Way Out
Claw users keep trying to use sources of discounted subscription tokens to power their claws. The AI companies do not love this idea, since it costs them money.
Yep. If you scale an exploit then it gets shut down. There’s a tragedy of the commons.
I don’t love Google’s banning people with no warning, but as long as it is limited to Antigravity and is temporary, I understand it. You know what you did.
A Chinese Claw
In case you didn’t think OpenClaw was a sufficiently reckless idea? Double down.
I don’t actually think ‘the CCP has a backdoor’ is that big a fraction of the mishaps you should expect to encounter here. The far bigger boost is that Kimi is less robust to attacks than Claude.
This is a smart play from Kimi. I mean, yes, they’re committing to hosting (weakly, at least for now) self-improving completely uncontrolled very easy to hijack agents indefinitely that could easily break free of human control, but I mean, that sure sounds like someone else’s problem from their perspective.
Alas, in the medium term we are basically locked into there being many similar offerings from various companies that make this all even easier for those who want to blow themselves up. Hopefully OpenAI, Anthropic or Google, or maybe someone else, produces something competitive enough that also has reasonable security.
Hackathon
Oh, good.
Introducing Agent Teams
Claude Code now has new logic for multiple instances to work together as a team. This is their official name for their version of an ‘agent swarm.’
You have to enable them in settings.json with
They’re expensive, but reports are they work great. Once they’re enabled, you get an agent team by telling Claude Code to create an agent team, which will have a shared task list and then work together. You can run them all in the same terminal or use split panes. You can directly talk to or shut down the teammates individually.
Claude already had the ability to spin up subagents, but it wasn’t working so well before. One theory is that the framing had issues, whereas teams work much better because they’re treating each other more as equals although there is still a team lead.
As I understand it, there are two great things about teams.
Thus you actively want to be spinning up teammates for any fully distinct tasks.
Don’t get carried away.
Cowork Is A Gateway Drug
The key advantage is lowering activation energy and perceived difficulty. Once you get that you can tell the magic box to do things, the sky’s the limit.
Dangerously Evade Permissions
If you set yourself up in an adversarial situation, where your agent wants to do something despite being told not to do it, that’s probably not going to end well for you. It might if the agent is properly sandboxed, but let’s face it, it isn’t.
The reason rules like ‘don’t read an .emv’ work is that under normal circumstances, this is interpreted as ‘well then I guess I shouldn’t do that,’ but be aware that this is more of a suggestion.
Skilling Up
Greg Brockman knows: Always run Codex with xhigh reasoning.
OpenAI post on leveraging Codex.
Anthropic offers The Complete Guide for Building Skills for Claude.
Pedro Sant’Anna put together a starter kit and a guide for Claude Code.
Daniel San proposes using Ghostty as the UI for Claude Code. It seems fine, but aside from some shortcut keys I doubt I’d use much it’s mostly all already in the default CLI.
Data Analyst Augmentation Framework is a new proposed method to turn Claude Code into an algorithm for doing research out-of-the-box.
OpenAI offers tips to make long-running agents do real work.
Some advice for Codex in particular, source should be trustworthy for this:
Modern Working
Measuring Autonomy
Anthropic offers an analysis of how autonomous Claude Code is in practice. Some sessions last more than 45 minutes now between human prompts. My own prompts almost never go over 10 minutes, but I’m not trying to code hard things.
Manually approving each action is annoying, so it’s no surprise advanced users stop doing that. Interruption rate likely depends on whether you find it worthwhile to be looking at what Claude is doing. The majority of interruptions remain pauses for clarification, including on complex tasks.
Use in what they label ‘risky’ domains is rare, but it’s there and growing. I wouldn’t always label such use risky, but some of it is indeed risky.
There’s more discussion at the link, but the suggestions are mostly common sense, or should be common sense at this point to most of you.
I Don’t Even See The Code
No, seriously, the developers haven’t written a single line of code since December. It’s not that there isn’t also a bragging arms race in some places, but I’m pretty sure the bulk of this is real, and those holding back on this are going to regret it.
Scratchpads Are Magic
Claude.md is notes, but you can tell it to take more notes. All the notes.
It’s Coming
Claude Code writes basically all the code for Anthropic.
Codex writes basically all the code for OpenAI.
The first goal will depend on the humans knowing to use the agent. From context ‘technical’ task here means coding and computer use, so this isn’t full-on ‘agents for everything.’
That second goal is pretty rough. Hard mode.
His recommendations here seem good for basically any engineering team:
That is good advice. It doesn’t explain how we’re going to get to ‘agents will by default be able to do what you need them to do and also be considered safe.’
The Grep Tax
Keep it simple, and keep it standard, as much as you can, but no more than that.
That doesn’t mean use the wrong tool for the wrong job. As a clean example, I learned that the hard way when I tried to have Claude Code reimplement an old C# project in Python and that made it so slow it was nonfunctional. I had to switch it back.
Beware Claude Mania
Don’t get carried away. No, this isn’t ‘LLM psychosis,’ it’s a different (mostly harmless most of the time as long as it doesn’t last too long) thing that needs a name.
The Lighter Side
He was surprised.
It’s not clear why he loved the agent so much before the attempted scamming. The story here involves such classic mistakes as ‘hooking it up to your email’ and ‘running it with a model that is not Claude Opus.’
And I suppose it’s not funny for Simon but, yea know, still pretty funny.
AI alignment is hard, especially when everyone involved gives at most zero f***s, and likely is giving misaligned orders to agents built by those giving zero f****s.
Metrics that are in the end rather easy to game:
In Other Agent News
Kangwook Lee investigates how Codex does context compaction.
The Lighter Side
They are indeed.
Thanks!
Who is to say it wouldn’t work? Love the execution on this.
The streams are crossing again.
They all deserve what they get, unless what they get is a viral tweet off a faked screenshot, in which case damnit.