Re: Neuralese not winning, I think during practical inference, you'd have similar-sized KV caches, so the memory usage is basically a wash (although storing the tokens when you're not running would be much smaller).
But my understanding is that neuralese hasn't won because it's too hard to train. CoT works by training a base model to produce ~all kinds of human-like text, and then RL can extract human-like text that's useful for reasoning. For neuralese, you have to train the reasoning from scratch, without teacher forcing, and getting that to work is (for now) too hard and not as effective as text CoT.
Great article though!
I'm on the $100 Max plan ("5x more usage than Pro"), although rate limits were doubled for most of this period as a holiday thing[1]. I used Claude Code on the web to fire off concurrent workers pretty often, and I only hit rate limits twice: Once around 11:55 pm (reset at midnight) and once in the afternoon about 10 minutes before the rate limit reset (on a different project, where I was making Claude repeatedly read long financial documents). I used Opus 4.5 exclusively.
Basically the rate limits never really got in my way, and I wouldn't have hit them at all on the $200 plan (4x higher rate limits).
I assume spare capacity since people were off of work.
I basically agree with this. When I say Claude is superhuman at coding, I mean that when Claude knows what needs to be done, it does it about as well as a human but much faster. When I say Claude isn't superhuman at software engineering in general, it's because sometimes it doesn't take the right approach when an expert software engineer would.
I run multiple agents in parallel through the Claude Code web interface, so I actually managed to hit the limits on the Max plan. It was always within 10 minutes of the reset though.
I was also making Claude repeatedly read long investment documents for an unrelated project at the same time though.
Aren't human reactions deterministic too though? I'm not sure I understand what you're arguing.
But during inference, when you’re actually talking to Claude or GPT-4, the system is frozen. It processes your input, generates a response, and… that’s it. The prediction errors it’s implicitly computing don’t modify anything that persists as “the system.” There’s no absorption into durable structure.
It's worth pointing out that the weights being fixed doesn't mean everything is fixed. The LLM can react in-context, it just can't modify other instances of itself.
It also feels too confident in its conclusions to be Scott.
I was thinking of "coder" as specifically the job of writing code, which I assume is what the Claude Code guy meant too. AI is clearly not reliable at system design yet.
The thing METR is measuring seems slightly different than "superhuman coder". My understanding is that they're dropping an AI into an unfamiliar codebase and telling it to do something with no context or design help, so this is partially software architecture partially coding. On pure coding tasks, Claude Code is clearly superhuman already.
I spent a few hours over the last few days collaborating with Claude on design docs and some general instructions, then having it go through massive todo lists fully autonomously[1]. This is weeks of coding and it did it in a few hours (mostly slowed down by me getting around to giving it more work).
This is the first time I've had it do tasks of this scale so I'm not doing anything special, just having it propose a design, telling it which parts I want done differently, then having it make a todo list and execute it.
Example prompt:
Can you go through @TODO.md, delegating each task to opus subagents and ensuring that they understand all of the necessary context and implement the task, check it off, and commit it, then move onto the next task until the list is done?
Yeah, definitely a lot of what I've asked it required software experience, sometimes fairly low-level (like describing the event loop I want for the background worker).