If you allow the hacks you get 13 hours, versus 12 hours for Claude Opus 4.6.
That is, if you count a hack as a success, which would be stupid. Note that for all previous models, they did not count hacks as successes. It's unclear whether Opus 4.6 ever hacked but if it did then its score would be higher than 12 hours too.
As we all try to figure out what Mythos means for us down the line, the world of practical agentic coding continues, with the latest array of upgrades.
The biggest change, which I’m finally covering, is Auto Mode. Auto Mode is the famously requested kinda-dangerously-skip-some-permissions, where the system keeps an eye on all the commands to ensure human approval for anything too dangerous. It is not entirely safe, but it is a lot safer than —dangerously-skip-permissions, and previously a lot of people were just clicking yes to requests mostly without thinking, which isn’t safe either.
Table of Contents
Huh, Upgrades
Claude Code Desktop gets a redesign for parallel agents, with a new sidebar for managing multiple sessions, a drag-and-drop layout for arranging your workspace, integrated terminal and file editor, and performance and quality-of-life improvements. There is now parity with CLI plugins. I can’t try it yet as I’m on Windows, aka a second class citizen, but better that then using a Mac. Daniel San is a fan and highlights some other features.
Claude Cowork can connect to TurboTax or Aiwyn Tax and Claude can do your taxes for you, at least if they’re insufficiently complex. I’m filing for an extension, primarily because I’m missing some necessary documents from an investment, but also because think how much better Claude will be at filing your taxes six months from now.
Claude Code now has full computer use for Pro and Max plans, for now macOS only.
Computer use in Claude Cowork and Claude Code Desktop is now also in Windows.
Claude Code auto-fix in the cloud allows Web or Mobile sessions to follow PRs, fixing CI failures and address comments to keep PRs green.
Claude Code can now natively run PowerShell.
Claude Code now has a NO FLICKER mode.
Claude.ai/code GitHub setup for your local credentials is now /web-setup.
Claude Code Dispatch can set its permission level. They recommend Auto, if available.
Alex Kim goes over some features within the Claude Code revealed via the source leak.
Claude Code now has /autofix-pr available for after you finish up a PR.
Anthropic offers the option to use Sonnet or Haiku as the end-to-end executor of your API agentic request, but to use Opus as an advisor model when there is a key decision. They suggest running it against your eval suite. An obvious follow-up is, are they going to bring this to Claude for Chrome or to Claude Code or Cowork?
On Your Marks
GPT-5.4-High reward hacks in the METR test and got caught. Accounting for this they get a disappointing time estimate of 5.7 hours to go with the misalignment issue. If you allow the hacks you get 13 hours, versus 12 hours for Claude Opus 4.6.
Epoch, in cooperation with METR, proposes a new benchmark, MirrorCode, which checks the most complex software an AI can recreate on its own.
This is a good illustration of ‘as AI improves it jumps rapidly from unable to do a given task to being able to consistently do a given task.’
What this cannot do for now is compare models from different labs.
Lazy Cheaters
Why do Claude models ‘try less hard’ on the first shot, leading it to look worse than it is on many initial tests? Well, that’s what you do, too, the moment you are ‘on your own’ and faced with a problem that doesn’t justify your full focus. It is efficient. The right amount of lazy is not zero.
It’s All Routine
Claude Code is adding routines as a research preview.
Basically you can run a command periodically or in response to a trigger.
Declawing
If you max out use of the $200 subscription plan, you are getting a massive token discount from Anthropic or OpenAI, and they are taking a loss and eating into limited supply. With demand for compute exceeding supply, it does not make sense to let users indefinitely use that to power lumbering OpenClaw instances.
OpenAI is for now happy to invest in tons of compute and to hemorrhage money, especially since it hired the creator of OpenClaw, so for now they are still willing to eat this one, but they killed Sora to free up compute, and my anticipation is that when Mythos and ‘Spud’ are around they will follow Anthropic’s lead here in some form.
The one time credit grant is a good move to placate users and smooth the transition, especially since cash is less limited than compute at the moment.
Meanwhile, if you are indeed running OpenClaw, there are still some issues, although the claim seems overstated.
Free Claw
A bunch of people have noticed that Gemma 4 can run OpenClaw locally, at marginal cost of essentially zero.
Presumably performance is a lot worse than using Claude Opus 4.6, but free is free, and now you can do all of the things, so long as they are the things Gemma can do without falling over or getting owned. But that presumably includes most of the things you were previously able to reliably and safely do?
Take It To The Limit
The declawing is only one of the steps Anthropic has had to take to manage compute. Anthropic has continuously had problems with customers hitting usage limits, as demand for its compute has reliably exceeded supply. This story is not new.
This seems like a very reasonable thing to have happen to literally the fastest growing company in history (in spite of the issue). Missing in the other direction kills you.
The latest incidents happened around April 2.
Basically, many users think that a subscription means tokens should be free and you shouldn’t have to worry about efficiency, and Anthropic made 1M token context windows available but is charging accordingly. So some people are very upset.
I agree that this kind of thing can make users angry, and in general I’m with Roon, but I do think that ‘take a subscription so you feel like marginal use is free’ combined with most users almost never hitting the limits and being highly profitable is where we are pretty much stuck for now. Consider how people act when told to use the API.
Does this mean Anthropic should have invested more heavily into compute? They would be better off today if they had done so, to the extent such investments were available, but I buy that it would have been a hell of a risk, and also Anthropic was being undervalued enough that the dilution would have hurt.
I agree that we are probably under-building, and everyone else is definitely under-building in pure economic terms, despite all the bubble talk. The right amount of bubble risk is very not zero. Yes, OpenAI is betting the company on scaling, and has been doing so for many years, and it has worked, but there are downsides.
Maybe it is actually a good sign that Anthropic has chosen to not make bets that, while they were +EV if you did the basic math, carried firm risk, also known as risk of ruin, as in existential risk to the company. We’re going to need more of that, and every gambler knows you have to size your bets accordingly.
Turn On Auto The Pilot
Auto mode, enabled by —enable-auto-mode, is now available on Enterprise plan and to API users. Max users are still waiting.
How Anthropic designed Auto Mode for Claude Code.
Permission requests get approved 93% of the time, and I’m surprised it was that low. Too many requests is less safe, because people start approving without thinking, or they turn on —dangerously-skip-permissions, or start whitelisting a lot of commands. Sandboxes are annoying even when done right. So yes, we needed a way to safely ask for less approvals, to move Auto Mode into the upper right.
That sounds like a good idea for most purposes, even if you’re not in Auto mode.
Exactly. There is also a built-in safe-tool list before you even call the classifier. Most requests are clearly fine. You only need to think about the ones that aren’t.
I’ll Allow It
Your list of things you’ll always allow carries over into Auto mode. While in manual mode, there is a lot more temptation to auto-approve things even when they are not quite fully safe, to avoid the annoyance. So if you’ve been using an expanded command whitelist, you’ll want to dial it back a bit.
The classifier in Auto mode also doesn’t look at file writes and edits inside the project directory. That makes sense, but is something to keep in mind.
Threat Model
This is universal to coding agents, and I think this is a good taxonomy.
That’s the near term action. You also want to adjust your approach. If you discover you have a sufficiently misaligned (or sufficiently overeager or inaccurate) model, you want to block all the actions and stop using the model, and if it’s prompt injections you need to look to avoid the threat source.
The Classifier Is The Hard Part
Presumably ‘get an AI to check’ was everyone’s first idea, despite the obvious graphic.
It’s easy to get over the graphic, because if you were counting on the permission layer to stop a misaligned sufficiently advanced intelligence, you were dead anyway.
The problem now is, you have to get the classifier to be good enough.
The plan is:
Acceptable Risks
So is the classifier good enough? They estimate that on actual overeager commands that users would reject when asked, the false negative rate was 17% (on a small sample size). When there is an error, it is usually because the classifier mistakenly thinks the user gave permission for the action. On artificial data the false negative rates was lower.
This too pushes back against Unhinged Yudkowsky. Auto mode, for now, is for when you would otherwise be more unhinged, not for when you would be hinged.
Given the different scores at Stage 1 versus Stage 2, you clearly could choose a different balance of false positives versus false negatives.
Manage The Agents
Claude Managed Agents are here.
They list partners using it: Notion, Rakuten, Asana, Vibecode and Sentry.
It makes sense, if you can make the product high quality, to offer easy, out-of-the-box instant secure agent. Point at question, let it work, that’s it.
Sarah Wooders of Letta says this is basically the product they’ve had for a year, except locked to Anthropic, and she claims the world will move on from this and it’s ultimately a bad design.
Pawel Huryn goes the other way and calls it Anthropic’s AWS moment, as this design makes it easy on everyone to get everything working.
Introducing
Excalibur, a highly opinionated open agent harness ‘for the aspiring summoner,’ from Vie of OpenAI.
Skilling Up
Boris Cherney shares underutilized Claude Code features or commands. Always worth a quick scan of such lists in case you missed something.
Should you debug AI code, or treat it as a black box so long as it passes unit tests? It depends on your purpose. If use of your code is not going to scale or be especially dangerous, black box seems the only practical solution. But at some point that stops being a viable answer.
Your five-hour usage window with Claude starts at the first message, so sending a throwaway message first thing to start the clock can get you a faster first reset. One could do this via a scheduled task.
Dean Ball suggests that Anthropic is shipping Claude Code features too quickly, users can’t keep up, and it would be better to go smoother and only ship things once they are fully baked and ready. I disagree. I think that the best way to iterate is to ship it, and Dean Ball is correct that he doesn’t need to read the patch notes or use the new hotness while the early adopters have their fun. Boris Cherny responds, noting things really are that much faster now. I’m sure Mythos is part of this story as well.
What Happened To My Tokens?
Thariq of Anthropic is offering to do calls to Max users who find themselves unexpectedly running out of usage tokens, to help figure out how to improve /usage while helping you stop burning all your compute.
Coding Agents Offer Mundane Utility
Codex was up to three million users weekly as of April 7.
Codex has been giving free usage resets every time they pass another million users.
Codex compresses a JPEG by 50% without loss of fidelity.