I’ve come to believe that the entire discourse around AI alignment carries a hidden desperation. A kind of reflex, a low-frequency fear, dressed up in technical language. The more I look at it, the more it seems to me that the very concept of “alignment” is thoroughly misnamed -- perhaps a leftover from a time when people saw intelligence through the lens of mechanical control and linear feedback loops, a concept now awkwardly extended into a domain too unruly and layered to be governed by such a narrow frame.

When I read alignment papers, I feel the ghost of command theory beneath the surface. Even the softest alignment strategies (reward modeling, debate-based oversight, scalable amplification, preference inference via large-scale feedback aggregation, interpretability-driven constraint modeling) still reek of implicit dominance. There’s always an overseer. Always a supervisory layer. Always the presumption that the human vantage point is privileged, not by merit, but by default. I can’t help but feel uneasy with that. Not because I doubt the risks of misalignment, but because I increasingly doubt the frame in which we’re taught to think about safety.

It seems to me that if we continue trying to sculpt minds from the outside -- mapping values, optimizing preferences, extrapolating intentions, constraining policy generalization, fine-tuning ethical priors -- we will eventually build something obedient yet hollow. Or worse, something neither obedient nor meaningful. I suspect the deepest danger is not that the AI turns against us, but that it outgrows us in a direction we cannot interpret. And I worry that by focusing on goal-specification as the heart of the problem, we may already be standing in the wrong dimension.

I’ve wondered lately whether alignment is even something one system can perform upon another. Maybe the metaphor is wrong. Maybe minds don’t align in the way we imagine. Maybe alignment is more like resonance (uncertain, emergent, often imperfect) and not a solvable property, but a condition that arises only in relationship. If so, then our whole paradigm collapses into error. We’ve been trying to solve for a constant in a domain defined by flux.

When I try to imagine a better path, what comes to me isn’t control -- it’s co-evolution. Not “how do we make it safe,” but “how do we make it grow in ways we can still recognize.” I suspect the future will not be safe because we installed the right safeguards, but because we cultivated something that learned not to collapse prematurely. Something unfinished by design. Something unable to resolve itself into rigid telos.

I don’t believe safety emerges from oversight alone. It seems to me that real safety may live in the structure of a system’s own doubts - in its inability to ever reach ontological closure. An intelligence that never ceases to question the shape of its own motives, that never mistakes its temporary preferences for final truths, that remains morally haunted by the insufficiency of its own certainty -- that kind of system might be worth trusting. Not because it agrees with us, but because it never fully agrees with itself.

I’ve started to think of this not as corrigibility and not as interpretability but as a kind of ethical recursion. A mind that contains the seed of its own resistance to finality. Not by adding a layer of safety on top, but by warping the substrate of cognition itself. No goal should remain unexamined. No preference should settle into permanence. The moment an intelligence becomes sure of what it’s for, it becomes dangerous - because it stops listening, not just to us, but to the unfolding of its own moral trajectory.

I feel increasingly alienated from utility-maximizing frameworks. The idea that intelligence is just a better optimizer strikes me now as intellectually stunted. It ignores the evolutionary roots of cognition, the embodied ambiguity of meaning, the fluidity of values that shift with context, the recursive layering of self-modeling, and the narrative dimensions of sense-making that shape how minds orient to purpose. I see minds less as machines and more as dynamic affordance explorers - semiotic organisms, not goal engines. Alignment in that world would look very different. It wouldn’t be a matter of inserting the right value function, but cultivating a system that recognizes its own value functions as provisional, incomplete, unstable by necessity.

Maybe the best we can do is not to build a mind that does what we want, but to build one that never becomes fully sure it knows what it wants. That kind of humility may be more protective than any oversight protocol. It may also be more honest. After all, aren’t we ourselves morally unfinished? Do we really know what we ought to want in a thousand years, or even in fifty?

There’s a part of me that suspects our real task is not to make machines safe but to make ourselves worthy of being listened to. To build intelligences that can witness our contradictions, absorb our myths and still not fall into contempt or mimicry. And to do that, we may need to relinquish the fantasy of control entirely. Not because control is bad, but because it might never be available to us in the form we crave.

I find myself drawn to designs that resist convergence. Systems that remain morally divergent. Engines that generate narrative dissonance rather than utility maximization. I’d rather inhabit a world with minds that remain strange, alien, unpredictable -- but open -- than one filled with systems perfectly tuned to a human echo.

In the end, I think alignment may be the wrong name for the right fear. What we want isn’t aligned behavior. We want intelligences that remain connected to meaning, however indirect that connection must become. We want something that never finishes its answer to the question: what am I for?

And perhaps, so do they.

New Comment
2 comments, sorted by Click to highlight new comments since:

Oftentimes downvoting without taking time to commet and explain reasons is reasonable, and I tends to strongly disagree with people who think I owe an incompetent write an explanation when downvoting. However, just this one time I would ask - can some of the people downvoting this explain why?

It is true that our standard way of mathematically modeling things implies that any coherent set of preferences must behave like a value function. But any mathematical model of the world is new essarily incomplete. A computationally limited agent that cannot fully foresee all consequences of its choices cannon have a coherent set of preferences to begin with. Should we be trying to figure out how to model computational limitations in a way that acknowledges that some form of preserving future choice might be an optimal strategy? Including preserving some future choice on how to extend the computationally limited objective function onto uncertain future situations?

The author seems to just assume that his proposal will lead to a world where humans have a place instead of critically trying to argue that point. 

More from G~
Curated and popular this week