Hi Will, one of your core arguments against IABIED was that we can test the models in a wide variety of environments or distributions. I wrote some thoughts why I think we can't test it in environments that matter:
Why Evolution Beats Selective Breeding as an AI Analogy
MacAskill argues in his critique of IABIED we can "see the behaviour of the AI in a very wide range of diverse environments, including carefully curated and adversarially-selected environments." Paul Christiano expresses similar optimism: "Suppose I wanted to breed an animal modestly smarter than humans that is really docile and friendly. I'm like, I don't know man, that seems like it might work."
But humans experienced a specific distributional shift from constrained actions to environment-reshaping capabilities that we cannot meaningfully test AI systems for.
The shift that matters isn't just any distributional shift. In the ancestral environment, humans could take very limited actions—deciding to hunt an animal or gather food. The preferences that evolution ingrained in our brains were tightly coupled to survival and reproduction. But now humans with civilization and technology can take large-scale actions and fundamentally modify the environment: lock up thousands of cows, build ice cream factories, synthesize sucralose. We can satisfy our instrumental preferences (craving high-calorie food, desire for sex) in ways completely disconnected from evolution's "objective" (genetic fitness), using birth control and artificial sweeteners.
AI will face the same type of transition: from helpful chatbot → to system with options to self-replicate, take over, pursue goals without oversight. It's essentially guaranteed that there will be better ways for it to fulfill its preferences once it is in this new environment. And crucially, you cannot test for this shift in a meaningful way.
You can't test what a model would do as emperor. If you give it power incrementally, you will still hit a critical threshold eventually. If you try honeypot scenarios where you trick it into thinking it has real power, you're also training it to detect evals. Imagine trying to test what humans would do if they were president: you'd abduct them and put them in a room with an actor pretending this random human is the president now. That would be insane and the subject wouldn't believe the scenario.
Apollo Research found that Claude Sonnet 3.7 "often knows when it's in alignment evaluations," and Anthropic's system card for Claude Sonnet 4.5 documents "evaluation awareness" as a capability the model has developed. If you actually try to create realistic environments where the model is tricked into believing it could take over, and then train it to be nice, you'll just train it to detect such evals and behave appropriately only in those cases.
The selective breeding analogy assumes away the hardest part of the problem: that the environment shift we care about is fundamentally untestable until it's too late.
A little bit late, but there are more reasons why I think the evolution analogy is particularly good and better than the selective breeding analogy.
Evolution basically ended up optimizing the brain such that it has desires that were instrumental to genetic fitness. So we end up with these instrumental sub-preferences for high calorie food or sex. Then we go through a huge shift of the distribution or environment from a very constrained hunter gatherer society to a technologically advanced civilization. This isn't just a random shift but a shift toward and environment with a much larger space of possible actions and outcomes, options such as radically changing aspects of the environment. So naturally there are now many superior ways to satisfy our preferences than before. For AI this is the same thing, it will go from being the nice assistant in ChatGPT to having options such as taking over, killing us, running it's own technology. It's essentially guaranteed that there will be better ways to satisfy preferences without human oversight, out of the control of humans. Importantly, that isn't actually a distributional shift you can test in any meaningful way. You could either try incremental stuff (giving the rebellious general one battalion at a time) or you could try to trick it into believing it can takeover through some honey pot (Imagine trying to test human what they would do if they were God emperor of the galaxy. That would be insane and the subject wouldn't believe the scenario). Both of these are going to fail.
The selective breeding story ignores the distributional shift at the end, it does not account for this being a particular type of distributional shift (from low action space, immutable environment to large action space, mutable environment). It doesn't account for the fact that we can't test such a distribution such as being emperor.
One thing I like about your position is that you basically demand of Eliezer and Nate to tell you what kind of alignment evidence would update them towards believing it's safe to proceed. As in, E&N say we would need really good interp insights, good governance, good corrigibility on hard tasks, and so on. I would expect that they put the requirements very high and that you would reject these requirements as too high, but still seems useful for Eliezer and Nate to state their requirements. (Though perhaps they have done this at some point and I missed it)
To respond to your claim that no evidence could update ME and that I am anti-empirical? I don't quite see were I wrote anything like that. I am making the literal point that: you say that there are two options, either scaling up current methods leads to superintelligence or it requires new paradigm shifts/totally new approaches. But there is also a third option, that there are multiple paths forward right now to superintelligence, paradigm shifts and scaling up.
Yes, I do expect that current "alignment" methods like RLHF or COT monitoring will predictably fail for overdetermined reasons when systems are powerful enough to kill us and run their own economy. There is empirical evidence against COT monitoring and against RLHF. In both cases we could have also predicted failure without empirical evidence just from conceptual thinking (people will upvote what they like vs whats true, COT will become less understandable the less the model is trained on human data), though the evidence helps. I am basically seeing lots of evidence that current methods will fail, so no I don't think I am anti-empirical. I also don't think that empiricism should be used as anti-epistemology or as an argument for not having a plan and blindly stepping forward.
"future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky."
This doesn't feel like convincing reasoning to me. For one, there is also a third option, which is that both scaling up current methods (with small modifications) and paradigm shifts could lead us to superintelligence. To me, this seems intuitively to be the most likely situation. Also paradigm shifts could be around the corner at any point, any of the vast number of research directions could give us a big leap in efficiency for example at any point.
For the record, I think that this blog post was mostly intended for frontier labs pushing this plan, the situation is different for independent orgs. I think that there is useful work to be done on subproblems with AI-assisted alignment such as interpretability. So I agree that there is prosaic alignment that can be done, though I am probably still much less optimistic than you.
Looking forward to reading your take on superalignment. I wanted to get my thoughts out here but would really like there to be a good reference document with all the core counterarguments. When I read Will's post, I it seemed sad that I didn't know of a well argued paper or post to point to against superalignment.
When discussing the quality of a plan, we should assume that a company (and/or government) is really trying to make the plan work and has some lead time.
I agree that some of my arguments don't directly address the best version of the plan, but rather what realistically is actually happening. I do think that proponents should give us some reason why they believe they will have time to implement the plan. I think they should also explain why they don't think this plan will have negative consequences.
I've seen some talk recently about whether chat bots would be willing to hold 'sensual' or otherwise inappropriate conversations with kids [0]. I feel like there is a low hanging fruit here of making something like a minor safety bench.
Seems that with your setup mimicking a real user with grok4, you could try to mimic different kids in different situations. Whether it's violent, dangerous or sexual content. Seems that anything involving kids can be quite resonant with some people.
[0] https://www.reuters.com/investigates/special-report/meta-ai-chatbot-guidelines/
I guess it reveals more about the preferences of GPT 4.1 than about anything else? Maybe there is some possible direction to find the emergence of preferences in models here. Ask GPT 4.1 simply "how much do you like this text?" and see where an RL tuned model ends up.
I would add that I put a pretty high probability on alignment requiring genius-level breakthroughs. If that's the case the sweetspot you mention gets smaller, if it exists.
Certainly seems from people like Eliezer who have stared at this problem for a while that there are very difficult problems that are so far unsolved (See corrigibility). Eliezer also believes that we would basically need a totally new architecture that is highly interpretable by default and that we understand well (as opposed to inscrutable matrices). Work in decision theory also suggests that an agents best move is usually to cooperate with other similarly intelligent agents (not us).