I've seen this scale to 100 person companies, and I think it scales much further.
I would describe the problem as follows: to reliably hire people who are 90th percentile at a skill, you (or someone you trust) needs to be at least 75th percentile at that skill. To hire 99th percentile, you need to be at least 90th percentile. To avoid hiring total charlatans, you need to be 25th percentile. And so on.
25th percentile is a relatively small investment! 75th is more, but still often 1-2 weeks depending on the domain. It's almost certainly worth it to make sure your team has at least one person you trust who's 75th percentile at any task you repeatedly hire for, even if you're hiring contractors. And if it's a key competency for the company, it can be worth the work to get to 90th percentile so you can hire really outstanding people in that domain.
I'd say the vast majority of cases I know of, within the US, involve firing too late rather than too early. The reason is straightforward: it sucks to fire people, knowing that you'll potentially have a severe negative impact on their life, so most managers will put it off as long as possible.
Broadly agree. One thing I'll add is that you should structure a "piece" around your points of highest uncertainty, and a common mistake I see is for companies to iterate on the wrong thing. Real examples from my career:
Even in real assembly lines at physical factories, I get the impression that generalization has often worked well because it gives you the ability to change your process. Toyota is/was considered best-in-class, and a major innovation of theirs was having workers rotate across different areas, becoming more generalized and more able to suggest improvements to the overall system, with some groups rotating every 2 hours[1].
Tesla famously reduced automation around 2018 even when the marginal costs were lower than human operators, again because the lost flexibility wasn't worth it.[2] Though it's worth noting they started investing more in robots again in recent years, presumably when their process was more solidified[3].
[1]: https://michelbaudin.com/2024/02/07/toyotas-job-rotation-policy/
[2]: https://theconversation.com/teslas-problem-overestimating-automation-underestimating-humans-95388
[3]: https://newo.ai/tesla-optimus-robots-revolutionize-manufacturing/
When I write code, I try to make most of it data to data transformations xor code that only takes in a piece of data and produces some effect (such as writing to a database.) This significantly narrows the search space of a lot of bugs: either the data is wrong, or the do-things-with-data code is wrong.
There are a lot of tricks in this reference class, where you try to structure your code to constrain the spaces where possible bugs can appear. Another example: when dealing with concurrency/parallelism, write the majority of your functions to operate on a single thread. Then have a separate piece of logic that coordinates workers/parallelism/etc. This is much easier to deal with than code that mixes parallelism and nontrivial logic.
Based on what you described, writing code that constrains the bug surface area to begin with sounds like the next step – and related, figuring out places where your codebase already does that, or places where it doesn't do that but really should.
I think this is the Sanderson post: https://wob.coppermind.net/events/529/#e16670
For personal relationships, mitigating my worst days has been more important than improving the average.
For work, all that's really mattered is my really good days, and it's been more productive to try and invest time in having more great days or using them well than to bother with even the average days.
I really enjoyed this study. I wish it weren't so darn expensive, because I would love to see a dozen variations of this.
I still think I'm more productive with LLMs since Claude Code + Opus 4.0 (and have reasonably strong data points), but this does push me further in the direction of using LLMs only surgically rather than for everything, and towards recommending relatively restricted LLM use at my company.
It's really useful to ask the simple question "what tests could have caught the most costly bugs we've had?"
At one job, our code had a lot of math, and the worst bugs were when our data pipelines ran without crashing but gave the wrong numbers, sometimes due to weird stuff like "a bug in our vendor's code caused them to send us numbers denominated in pounds instead of dollars". This is pretty hard to catch with unit tests, but we ended up applying a layer of statistical checks that ran every hour or so and raised an alert if something was anomalous, and those alerts probably saved us more money than all other tests combined.
Comments