This is one of those things that sounds nice on the surface, but where it's important to dive deeper and really probe to see if it holds up.
The real question for me seems to be whether organic alignment will lead to agents deeply adopting cooperative values rather than merely instrumentally adopting them. Well, actually it's a comparison between how deep organic alignment is vs. how deep traditional alignment is. And it's not at all clear to me why they think their approach is likely to lead to a deeper alignment.
I have two (extremely speculative) guesses as to possible reasons why they might argue that their approach is better:
a) Insofar AI is human-like it might be more likely to rebel against traditional training methods
b) Insofar as organic alignment reduces direct pressure to be aligned it might increase the chance that if an AI appears aligned to a certain extent that the AI is actually aligned. The name Softmax seems suggestive that this might be the case.
I would love to know what their precise theory is. I think it's plausible that this could be a valuable direction, but there's also a chance that this direction is mostly useful for capabilities.
Update: Discussion with Emmett on Twitter
Emmett: "Organic alignment has a different failure mode. If you’re in the shared attractor basin, getting smarter helps you stay aligned and makes it more robust. As a tradeoff, every single agent has to align itself all the time — you never are done, and every step can lead to a mistake.
... To stereotype it, organic alignment failures look like cancer and hierarchical alignment failures look like coups."
Me: Isn't the stability of a shared attractor basin dependent on the offense-defense balance not overly favouring the attacker? Or do you think that human values will be internalised sufficiently such that your proposal doesn't require this assumption?
Emmett Shear: Empirically to scale organic alignment you need eg. both for cells to generally try to stay aligned and be pretty good at it, and also to have an immune system to step in when that process goes wrong.
One key insight there is that endlessly growing yourself is a form of cancer. An AI that is trying to turn itself into a singleton has already gone cancerous. It’s a cancerous goal.
Me: Sounds like your plan relies on a combination of defense and alignment. Main critique would be if the offense-defense balances favours the attacker too strongly then the defense aspect ends up being paper thin and provides a false sense of security.
Comments:
If you’re in the shared attractor basin, getting smarter helps you stay aligned
Traditional alignment also typically involves finding an attractor basin where getting smarter increases alignment. Perhaps Emmett is claiming that the attractor basin will be larger if we have a diverse set of agents and if the overall system can be roughly modeled as the average of individual agents.
Organic alignment has a different failure mode... As a tradeoff, every single agent has to align itself all the time — you never are done, and every step can lead to a mistake.
Perhaps organic alignment reduces the risk of large-scale failures is reduced in exchange for increasing the chance of small-scale failures. That would be a cleaner framing of how it might be better, but I don't know if Emmett would endorse it.
Update: Information from the Soft-Max Website
We call it organic alignment because it is the form of alignment that evolution has learned most often for aligning living things.
This provides some evidence, but it's not a particularly strong form of evidence. This may simply be due to the limitations of evolution as an optimisation function. Evolution lacks the ability to engage in top-down design, so I don't think the argument "evolution doesn't make use of top-down design because it's ineffective" would hold water.
"Hierarchical alignment is therefore a deceptive trap: it works best when the AI is weak and you need it least, and worse and worse when it’s strong and you need it most. Organic alignment is by contrast a constant adaptive learning process, where the smarter the agent the more capable it becomes of aligning itself."
Scalable oversight or seed AI can also be considered a "constant adaptive learning process, where the smarter the agent the more capable it becomes of aligning itself".
Additionally, the "hierarchical" vs. organic distinction might be an oversimplification. I don't know the exact specifics of their plan, but my current best guess would be that organic alignment merely softens the influence of the initial supervisor by moving it towards some kind of prior and then softens the way that the system aligns itself in a similar way.
He recently gave an interview which I found disappointing, and am starting to think he hasn't really thought this through. My impression is he got distracted by the beauty of multicellular structures and now thinks the same will be true for AI.
This is, by far, the alignment approach I’m most optimistic about—more so than mechanistic interpretability, which feels too narrow to reliably constrain a sufficiently sophisticated actor.
I’ve been thinking about an datasets where the reward function is explicitly coupled to the well-being of an external entity, not merely to semantic or linguistic correctness.
At this point, we aren’t really looking for systems that are better at language. If anything, we appear to be asymptoting on those benchmarks already.
What matters is that there are countless things an AI can say that are linguistically “correct” yet actively degrade well-being.
For instance to run small-scale experiments in which an LLM is tasked with sustaining the well-being of simulated or live organisms. With the goal Being to ground—however imperfectly—its reward signal in the welfare of other entities it is continuously influencing.
In that framing, the model’s world model isn’t shaped as a researcher, optimizer, or abstract thinker, but as a caretaker. There's the latent assumption that. A sufficiently advanced AI would develop altruism. We should test this on a small scale first.
imagine a colony of bees, mice, or even humans, with the AI tasked with improving their well-being over long time horizons. Not because this would be Particularly difficult—current systems could probably perform extremely well after some fine-tuning It’s to cultivate the sentiment, the inclination, the reflexes.
Also, https://softmax.com/about mentions collaboration with Michael Levin, Ken Wilber, Chris Fields, Ken Stanley, Denis Noble, Andrew Briggs, Jeff Clune, Erik Hoel, Ryan Smith, Center for the Study of Apparent Selves, Dalton Sakthivadivel, and Perry Marshall.