I want to make a game. A video game. A really cool video game. I've got an idea and it's going to be the best thing ever.
A video game needs an engine, so I need to choose one. Or build one? Commercial engines are big and scary, indie engines might limit my options, making my own engine is hard - I should investigate all of these options to make an informed choice. But wait, I can't make a decision like that without having a solid idea of the features and mechanics I want to implement. I need a comprehensive design doc with all of that laid out. Now should I have baked or dynamic lighting...
13-year-old omegastick didn't get much done.
I want to make a game. A video game. A really cool video game. I've got an idea and it's going to be the best thing ever.
But I tried that before and never got around to actually building it.
Okay, there's an easy fix for that, let's open Vim and get to typing. Drawing some sprites to the screen: not so hard. Mouse and keyboard input. Audio. Oof, UI, there goes three weeks. All right, at least we're ready for the core mechanics now. Oh, my graphics implementation is naive and performance is ass. No problem, I know how to optimize that, I've just got to untangle the spaghetti around GameEntityFactory. Wait, there's a memory leak?
20-year-old omegastick didn't get much done.
100%. A good test suite is worth its weight in gold.
I'd be interested to see a write-up of your experience doing this. My own experience with spec-driven development hasn't had so much success. I've found that the models tend to have trouble sticking to the spec.
In this scenario, are you not also also paying uniquely little attention to your surroundings (and thus equally less likely to spot the bill)?
It feels a little like begging the question to apply that modifier to other people in the scenario, but not yourself.
System design is one part of designing software, but isn't so much what I'm trying to point at here.
Claude Opus 4.5 still can't produce or follow a simple plan to implement a feature on a mid-sized codebase independently.
As an example: earlier today I was implementing the feature of resuming a session when a client reconnects to a server after losing connection. One small part of this task is re-syncing the state once the current (server-side) task has finished.
Claude Code was not capable of designing a functioning solution to this problem in its planning mode (it kept trying to sync the state immediately upon connecting, leading to the client missing the result of the in-progress task).
The solution I chose for this specific instance of the problem was to add a state sync command to the server's command queue for that session when a client reconnects. Claude Code updated the plan to show the exact code changes required (correctly).
However, when implementing the plan, it forgot to actually make the relevant change to add the command to the queue. End-to-end tests caught this, and Claude's solution was to automatically do a state sync after every task. It did not implement what was written in the plan. I gave it a nudge to re-read the plan, which was enough to make it see the mistake and correct it.
Compared to if I had asked a human co-worker to make the same change, the difference is stark. We are still a way off from superhuman coders.
If I use my IDE's LSP functions to do a large automated refactor, is the IDE better than me at coding?
There are many more elements to "coding" than "writing code", namely software design. As a software engineer I use Claude Code daily (I write maybe 1% of my total LOC by hand these days), but I still have to steer it. I have to tell it which architecture to use, which abstractions, correct it when it tries to use a shortcut instead of solving a problem at the root, etc.
When it can produce PRs which would pass code review on a competent software engineering team without that steering, we will have a superhuman coder.
I enjoyed the article and think it points at some important things, but agree with Stephen that it might not point to a useful distinction.
Purely anecdotally: I don't get absorbed into books easily (I very much enjoy reading, but don't get the level of immersion you describe), feel emotional conflict as two distinct feelings or thoughts warring in my mind, can have IFS conversations, etc. but am absolutely hopeless at multi-tasking, dividing my attention, etc.
Meanwhile, my wife is the polar opposite. She gets immersed in books, feels one emotion at a time, empathizes compulsively, etc. but is great at multi-tasking.
Maybe the threaded model just doesn't apply to multi-tasking, but that seems unusual to me. I would expect multi-tasking to be an obvious benefit of having a "multi-threaded" brain.
Then we'll need a "thought process tampering awareness" evaluation.
if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month)
This doesn't seem like a good example to me.
The sort of tasks we're talking about are extrapolations of current benchmark tasks, so it's more like: what a programming savant with almost no ability to interact with colleagues or search out new context might do in a month given a self-contained, thoroughly specced and vetted task.
I expect current systems will naively scale to that, but not to the abilities of an arbitrary intern because that requires skills that aren't tested in the benchmarks.
I think that Chinchilla provides a useful perspective for thinking about neural networks, it certainly turned my understanding on its head when it was published, but it is not the be-all-and-end-all of understanding neural network scaling.
The Chinchilla scaling laws are fairly specific to the supervised/self-supervised learning setup. As you mentioned, the key insight is that with a finite dataset, there's a point where adding more parameters doesn't help because you've extracted all the learnable signal from the data, or vice versa.
However, RL breaks that fixed-dataset assumption. For example, on-policy methods have a constantly shifting data distribution, so the concept of "dataset size" doesn't really apply.
There certainly are scaling laws for RL, they just aren't the ones presented in the Chinchilla paper. The intuition that compute allocation matters and different resources can bottleneck each other carries over, but the specifics can differ quite significantly.
And then there are evolutionary methods.
Personally, I find that the "parameters as pixels" analogy captures a more general intuition.