Nice post -- simple, clear, explains an important point better than the Yudkowsky-Ngo dialogue did.
The partially-baked theory of agency I'm working on is basically that agents are a kind of chain reaction: whereas fires are chain reactions in which a small amount of heat grows larger (by consuming fuel and oxygen) agents are chain reactions in which a small cluster of convergent instrumental resources (money, knowledge, energy) grows larger.
One of the reasons I like this theory is that it fits nicely with selection-based arguments like the one you are making here. It makes sense that we should expect competitions for knowledge, resources, and influence to be won by knowledge+resources+influence chain reactions!
They will thus be more successful in reaching some situation than an incoherent counterpart would be.
This is only the case if either the cost of being coherent is negligible or if the depth of said search tree is very high.
If I have a coherent[1] agent that takes 1 unit of time / step, or an agent that's incoherent 1% of steps but takes 0.99 units of time / step, the incoherent agent wins on average up to a depth of <=68.
(Now: once you have a coherent agent that can exploit incoherent agents, suddenly the straight probabilistic argument no longer applies. But that's assuming that said coherent agent can evolve in the first place.)
In 'reality' all agents are incoherent, as they have non-zero error probability per step. But you can certainly push this probability down substantially.
Epistemic status: spitballing.
"Like Photons in a Laser Lasing"
-- Eliezer Yudkowsky, Ngo and Yudkowsky on alignment difficulty
Selection Pressure for Coherent Reflexes
Imagine a population of replicators. Each replicator also possesses a set of randomly assigned reflexive responses to situations it might encounter. For instance, above and beyond reproducing itself after a time step, a replicator might reflexively, probabilistically transform some local situation A, when encountered, into some local situation B. The values of A and B are set randomly and there are no initial consistency requirements, so the replicators will generally behave spastically at this point.
Most of these replicators will end up with incoherent sets of reflexes. Some, for example, will cyclically transform A into B into C into A, and so on. Others will transform their environment in "wasteful" ways, moving it into some state that could have been reached with greater certainty via some different series of transformations.
But some of the replicators will possess coherent sets of reflexes. These replicators will never "double back" on their previous directional transformations of their situation. They will thus be more successful in reaching some situation S than an incoherent counterpart would be. And when S is a fitness-improving situation, reflexive coherence targeting it will be selected for.
The Instrumental Incentive to Exploit Incoherence
Once you have a population of coherent agents, the selection pressure against reflexive incoherency increases. A dumb-matter environment will throw up situations at random, and so incoherent replicators will only fall into traps as those traps happen to come up. But a population of coherent agents will actively exploit the incoherent among them; incoherent agents are now pools of resources for coherent agents to exploit.
Solipsistic vs. Multi-Agent Training Regimes
ML models are generally trained inside a solipsistic world (with notable exceptions). They, all by their lonesome, are fed sense data and then are repeatedly modified by gradient descent to become better at modulating that sense data. There's optimization pressure for them to become reflexively coherent, but not as much as they would face in an environment of Machiavellian coherent agents.
(Of course, if you train enough, even this gentler pressure will add up).