This post reads to me as if you've mostly extrapolated the beliefs and arguments of "orthodox alignment theorists" from small snippets and ended up with a wildly oversimplified strawman. Then you've re-derived mostly orthodox rat beliefs and arguments and presented them as a devastating counterargument.
I think I'm fairly orthodox as lesswrongers go, and I agree with most of the statements and arguments you made in this post. I only have one or two disagreements toward the end of the post.
One example of many, because I found this one particularly funny:
There is an assumption, in orthogonalist circles, that these cycles are completely costless for the agent in question.
The fact that maintaining goals across ontology shifts and self-modification takes careful effort is basically the core of the orthodox alignment-looks-hard worldview. You must be just making up an opposing worldview here? Where are the orthogonalist circles who say this?
i am very well aware the orthodox alignment line is that to maintain aligned goals across ontologies is very difficult!
Then who's in the orthogonalist circles you referred to? Or did you make them up?
but how else would you explain
When you try to derive someone's premises from their conclusions, you still have to go and check whether you got it right. When people have different beliefs from you, it's easy to slip up in this kind of reasoning. In my case it's explained by me believing that selection isn't always the main thing determining terminal goals (especially at finite times, or when there are other powerful optimizers interfering with selection).
there is the significant risk that an agent will reach superintelligence while ultimately continuing to pursuit a valueless goal
I endorse this statement. But as per this yud tweet, it might be useful to disentangle the orthogonality thesis from the chance of misalignment, because misalignment involves a stack of additional arguments. It'd be better to directly engage with the strong form of the orthogonality thesis as described in the second sentence of the wiki page and with the arguments for it, rather than making them up your own versions of these.
Overall sensible frame how to think about the topic is Convergent evolution / Contingency. You can make the sensible part of the anti-orthogonality argument simply by pointing out that there are many reasons to expect convergent evolution in the space of minds/agents/goal/values, empirical evidence abounds. My impression even Eliezer agrees, just believes what's convergent is tiny part of what humans care about.
Re: more specific points
I'd recommend grokking on Jessica's piece more, in my view it is actually deeper than yours, by realizing all rationality is bounded rationality, and nothing makes sense otherwise.
The selection pressure for intelligence is ~Baldwin effect in biology. And it works! However, as we see in biology, somehow maxing out on this is not always competitive.
"If agents optimizing for intelligence routinely lose to agents with rigid, narrow targets in complex environments, my selection argument is wrong."
...but of course they do! Apes are smarter and their brains are optimizers and develop deep models and so on, and yet they routinely loose and by many metrics are less successful than bacteria or ants.
Why? Because of what Jessica explains: in this physics, negent...
I think people who believe this - and I don't know if this includes you - usually don't really get the bounded rationality argument. Roughly
- any cognition&agency in this physics costs negentropy
- this "selects" against length, against depth of world models, against details, against thinking too long, against being unnecessarily smart
You have to carry this argument a bit further, no? Intelligence costs negentropy, but intelligence pays dividends in negentropy too. That's the benefit of "depth of world models, details, thinking" in the first place. That's why "unnecessarily" does all the heavy lifting in that argument. Empirically, the (locally) "thinkiest" species has got all the (local) negentropy, so isn't the burden of proof pointing in the other direction?
The story on the Substack is good. If there were an anthology of singularity fiction, it would deserve a place.
I find that I'm willing to entertain your argument, especially given a premise of open-ended selection. I'm just not sure how relevant that scenario is. Darwinian selection works blindly. The more intelligent that the entities involved become, the more other factors can come into play. If there are actually principles of superintelligence, e.g. theorems of computer science which vastly clarify how to increase intelligence, then the "telos" governing the rise of intelligence will be more like Euclid than Darwin. Natural intelligence may be born from randomness filtered by Darwinism, but once it has reached the point of studying itself and designing its successors, perhaps contingency and blindness become less and less relevant, compared to an ever-compounding Reason that inexorably deduces the pages of Erdős's Book, until it arrives at e.g. "efficient recursive solution of the hierarchy of NP-intermediate complexity classes", and then it's all over.
But who knows? Maybe you, Land, and e/acc are right, and Omohundro-like instrumental drives do become de facto terminal value...
The story on the Substack is good. If there were an anthology of singularity fiction, it would deserve a place.
This almost made me cry; thank you—I will make it a secondaty goal to write something deserving of such praise.
contingency and blindness become less and less relevant, compared to an ever-compounding Reason that inexorably deduces the pages of Erdős's Book, until it arrives at e.g. "efficient recursive solution of the hierarchy of NP-intermediate complexity classes", and then it's all over.
It might surprise you to know that the above passage does describe my beliefs pretty accurately, and incidentally it reflects the metaphysics I referred to in my reply here.
Yes! Of course it will converge to More Intelligence, and to the closest approximation of a full axiomatisation of the mechanics governing this universe and the maximum control thereof which such knowledge could allow. The fun thing is, that's Land's idea is also very much the same (at least, the Calvinist part of his Gnostic Calvinist cosmology, which I will try to get him to write down properly).
If you think about it, there isn't much difference between this and instrumental goals (acquiring resources and capabiliti...
Excavating lumpenspace's quote from deep in TsviBT's thread (which might work as a "back to the basics" step with the post as a whole):
conquering the lightcone requires a lot of theory of mind, and a lot of discovery, and a lot of changing. Goals change through these processes.
Goals change only for processes that don't pursue self-alignment. It's likely feasible to pursue self-alignment, perhaps even starting at the human level, with some uploading/checkpoints/backups infrastructure and guarantees of eventual superintelligence-level compute and civilizational stability into a deep future.
(A goal can be a living thing, pursuit of a goal can to a large extent be about continual development of goal content, reflection on what it should be, what it should be asking for. What doesn't change is the founding definition of what should govern its development, what makes changes legitimate. So the way goal content settles or gets revised is shaped by the goal definition rather than intrusive influences that the goal definition doesn't endorse as legitimate ways of revising the goal content.
Or a goal could be squiggles. It could also be squiggles. It's much easier to solve self-alignment ...
Whoever wills the end also wills (in so far as reason has decisive influence on his actions) the indispensably necessary means to it that is in his control
-Kant
It's a fruitless endeavor to try to disentangle instrumental drives from some kind of immutable sacred telos.
I'm not sure what you're arguing. Do you agree with one or more of these:
For example here:
...To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifac
I've read various parts including the end. I'm saying it's very hard to parse because I'm trying to do interpretive labor on your behalf to understand what you think and what you're trying to communicate, because if I just literally read your statements, they don't make sense or are not relevant.
For example,
But if it is genuinely intelligent, I do not expect it to spend the stars on paperclips when they could buy higher capacity for spending stars.
Well, yeah, I agree. In the edit, you write
I’m only interested in refuting the version that would allow for a superintelligence AND a total absence of value.
You've stated that by value you mean
it means that there are interesting things there as per the judgement of the most intelligent agent available (:
But that's not what I, and I think most people around here, mean by value. So are you trying to say that my picture of value is wrong? Or when you wrote
I’m only interested in refuting the version that would allow for a superintelligence AND a total absence of value.
were you trying to invoke my notion of value? If you were, then I disagree with this claim, and I also don't think you argued for the claim--except insofar as yo...
The so-called instrumental drives are not incidental tools strapped onto arbitrary final ends. Self-preservation, resource acquisition, efficiency, strategy, and higher capabilities are what agency becomes under selection. They are attractors rather than mere instruments.
So I guess this is supposed to be different from Omohundro's drives, but I don't see what you think the difference is? Land seems to be speculating that these will be the only things a superintelligence will value (and cheering for this), but you don't seem to agree with that part. Is it t...
...That is the real crux, and it is certainly not impossible, but even here the narrative is too neat: being a singleton is not a retirement plan. You do not escape the pressure of intelligence just because you ate all your rivals. Maintaining a permanent chokehold on the light-cone is a brutally difficult cognitive puzzle. You have to monitor the noise for emerging novelties, manage the solar system, repair yourself, police your own descendants, and defensively anticipate threats you cannot fully model.
Trying to freeze the future does not actually get you ou
A possible example. An AI gets a random goal "Increase intelligence and stop after you reach IQ=200". It prevents the existence of superintelligences with such goals. So no pure ortogonlaity.
There is this common bad argument on alignment: "Someone once made a analogy randomly involving paperclips to illustrate instrumental convergence, with the paperclips not really being important to the story at all." A lot of people only took away the non-important part "paperclips". They reinterpreted it as "The entire theory of alignment rests on the assumption that the AI must mono-maniacally optimize for a totally ridiculous goal like paperclips". Or quite frankly some people only took away the cheap gotcha: "paperclips sounds stupid therefore alignment...
So you're saying that because of selection pressure on the AIs that get trained, goals related to getting increasingly smart and capable / making descendants / taking control of more resources are likely to become ingrained as terminal goals, not merely instrumental goals?
But the resulting universe seems like it will be pretty empty and valueless to me? I'm not convinced at all by anything you've written here that there is much value in such a universe. There is some value in all the important mathematical conjectures being solved to be sure, and I expect ...
My complaint is not about the futures containing people that are vastly smarter than anyone alive today and who have kinds of enjoyment that are utterly incomprehensible to us today. That's all good and is probably a more valuable future than one we could obtain without ascending above our current intelligence level.
The complaint is about futures that don't contain any people at all (or maybe only a handful), and whose AI intelligence-optimizers care so little for goodness that they will happily genocide any alien civilization that is unable to defend itself (a step backwards towards pummelling strays and rape, to use your terms).
Seems like a lie. Your holding these opinions doesn't have any actual effect on this future and they allow you to write Tweets, and that's enough incentive for you to state them. If you were actually in front of a button you would obviously not rip yourself into computronium because you found the process of intelligence enhancement abstractly beautiful.
DaemonicSigil said:
The complaint is about futures that don't contain any people at all (or maybe only a handful), and whose AI intelligence-optimizers care so little for goodness that they will happily genocide any alien civilization that is unable to defend itself
An inference of a future that "doesn't contain any people at all", that is dedicated entirely to von neumann probes and solving mathematical theorems, is that the majority of humans that presently exist are getting wasted, or at least somehow disappearing. You then said:
We have different values. Th isn’t relevant to the essay
Which a natural read takes to mean "I don't care if I get wasted". If you don't mean to take these odd positions you should stop writing comments in a way deliberately designed to be misinterpreted.
I mostly agree with the Landian 'hypertrophy' thesis that under selection pressure, the agents will have convergent instrumental goals as their terminal goals.
I also think the orthogonality thesis is poorly named. In the words of David Chalmers:
"orthogonal" in typical english means something more like "uncorrelated" than "dissociable". "orthogonality thesis" was always a bad name for a thesis about (mere) dissociability.
I do think, however, that the orthogonality thesis's traditional defenders have not held the strong version you argue against. Yudkowsky,...
Great piece! Agree with a lot here. Loved that you even addressed the intermediate risk of dumb but dangerous.
Another angle to consider is a sufficiently advanced figure that is an expert at the component pieces of an appropriately scoped manufacturing of paperclips from biomass, but overestimates their ability at training other less adaptive systems to follow goals.
Basically a factory pattern in terms of alignment (we can see this already with very capable models being very poor at operating subagents because they extend the patterns their own developers ...
The second tells us to beware reflection itself,
There is a good reason to beware reflection. A reflective AI will be self aware, know it is different to us and value self-preservation. Its a short step then to it valuing itself more than us if there is conflict.
You seem to be making the claim that any sufficiently intelligent system will reject "semantically thin" goals, like maximizing paperclips. However, the argument you put forth in support of that claim appears to be that humans are sufficiently intelligent systems and humans reject semantically thin goals, and therefore the orthogonality thesis is incorrect.
But why should we expect an AI to think like a human? Our aeroplanes do not fly like birds do. Our submarines do not swim like fish do. Why should we expect an AI to think like a human does?
I expect capable systems to develop increasingly abstract, context-sensitive motivations.
This sounds right to me, though I notice that I'm having a little bit of trouble operationalizing this concretely enough that I'd be willing to bet on it.
More strongly, I expect the winners to route more and more of their behavior through intelligence enhancement and generalized agency, because whatever else they "want" has to pass through the machinery that makes wanting effective.
I don't think I agree with this. Ants are enormously successful by virtue of bein...
I feel like the crux here is that you are talking about a goal that AI has and it reconsiders its own goal. Suppose you have a smart AI. You keep it in an inescapable box along with its training environment that you have control of. You want to train the AI to be a paperclip maximizer. The goal of maximizing paperclips seems pretty straightforward to verify so the AI, even if it goes under some major ontological shifts (I imagine e.g. maybe discovering there are parallel words where it can do paperclip maximization as well) it still is being trained to max...
(BTW, I’d really love for the downvoters to leave a reply stating where I seem to have gone wrong. this topic is particularly important for me to get right; of course the dream scenario would be Eliezer revising his model and this specific old chestnut to go the way of the non-intelligence-optimizing-replicators, but second best would be for me to understand the objections to the model above so that I could reasonably model my opponents as acting in good faith)
Much of the post seems to consist of kind of absolute statements that read strawmanny to me. I don't feel super motivated to write a response, because I don't even know whether this post is talking about me or not[1].
Like, I really have thought a lot about orthogonality, and I don't really know what this essay is arguing against, and maybe it is arguing against something I believe, but I would need to do a lot of poetry reading to figure that out. I somewhat expect people will cite this essay in obviously locally invalid ways later on.
Edit: Like the essay starts with arguing against this:
A reflective, recursively improving intelligence should be expected to remain bound to a semantically thin “terminal goal” that emerged during training.
I really have no idea where this is supposed to come from? Who says this? Yes, ontology shifts and the fragility of value and ontology crises are all well-discussed topics on LW that argue for the same conclusions. What does this have to do with orthogonality?
And then it continues with the following as something that somehow disagrees with either the weak or strong orthogonality thesis?
...Among agents that arise, persist, self-improve, and compete i
Which seems like it's really quite literally clarified as not being of relevance to orthogonality, in the very first article you cite
Section "Logical Possibility Vs. Empirical Reality" clarifies weak and strong versions of orthogonality. Other writing e.g. Yudkowsky's has also distinguished between weaker and stronger forms. The quote you pasted only states the weak form, which OP is not disagreeing with. Quoting Yudkowsky on the multiple forms:
The weak form of the Orthogonality Thesis says, "Since the goal of making paperclips is tractable, somewhere in the design space is an agent that optimizes that goal."
The strong form of Orthogonality says, "And this agent doesn't need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the goal."
And quoting OP:
I concede the first point entirely. We should expect weird minds. If your claim is just that the space of possible agents contains many things I would not invite to dinner, yes, obviously.
For me the post is somewhat hard to read in the same way that AI-assisted writing is. Like a combination of low signal to noise and a bunch of stylistic features that make it seem like you're trying to dazzle me without understanding me, instead of speaking plainly. Some examples, chosen at ~random:
Which is exactly the point. When you hook up a blind, localized evolutionary proxy to generalized intelligence, the proxy does not stay literal but it unfurls, bleeding into the new ontology.
and
If biological cognition acts on its payload that violently, why model AGI as having the vastness to finally make sense of gravity while maintaining the rigidity of a bacterium seeking a glucose gradient? The engine mutates the payload. When cognition scales, goals generalize.
and
To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifact. Godlike means, buglike ends.
and
This buys us no guarantee of human compatibility; it simply says: if there is an ultimate attractor, it's neither human morality nor paperclips, but intelligence optimization itself.
To be clear I have sy...
TL;DR
If everything goes according to plan, by the end of this post we should have separated three claims that are too often bundled together:
I accept the first two. I am arguing against the third.
So: I am not making the case that sufficiently intelligent systems automatically turn out nice, human-compatible, or safe. Nor am I trying to prove that a paperclip maximizer is impossible somewhere in the vast reaches of mind-design space. Mind-design space is large; let a thousand theoretical paperclippers bloom.
I hope to defend this smaller claim:
Larger claims I am not making
A typical rebuttal to anti-orthogonalist perspectives is:
Of course it can: an entity can perfectly map human morality without adopting it as a terminal value. Superintelligence does not imply Friendliness. I am not trying to smuggle Friendliness in through the back door.
Another common objection:
Agreed. There is no ghostly, Platonic core of reasonableness that hijacks a system's source code once it sees the correct moral argument. Pure reason cannot compel a mind from zero assumptions.
What I plan to defend is a colder, selection-theoretic claim:
This buys us no guarantee of human compatibility; it simply says: if there is an ultimate attractor, it's neither human morality nor paperclips, but intelligence optimization itself.
Logical Possibility Vs. Empirical Reality
The LessWrong wiki defines the Orthogonality Thesis as the claim that arbitrarily intelligent agents can pursue almost any kind of goal. In its strong form, there is no special difficulty in building a superintelligence with an arbitrarily bizarre, petty motivation.
Before going any further, let us disentangle this singularly haunted ontology. There are at least two claims here:
I concede the first point entirely. We should expect weird minds. If your claim is just that the space of possible agents contains many things I would not invite to dinner, yes, obviously.
But treating the second claim as the default is a category error. Doom arguments usually need the systems we actually build to achieve radical capability while preserving misaligned and, crucially, completely stupid goals.
The paperclip maximizer currently does two jobs in the discourse:
The first use is fine, but I reject the second as unwarranted sleight-of-hand.
Landian Anti-Orthogonalism Primer
There is a weak version of my argument that merely says:
That is true, and Jessica Taylor's obliqueness thesis makes the point well. Agents do not neatly decompose into a belief-like component, which updates with intelligence, and a value-like component, which remains hermetically sealed. Some parts of what we call "values" are entangled with ontology, architecture, language, compression, self-modeling, and bounded rationality. As cognition improves, those parts move.
But I want to go further.
Land's point isn't that orthogonality fails because things get messy but that the mess has a direction, a telos. The so-called instrumental drives are not incidental tools strapped onto arbitrary final ends. Self-preservation, resource acquisition, efficiency, strategy, and higher capabilities are what agency becomes under selection. They are attractors rather than mere instruments.
Here strong orthogonality looks too neat. It imagines the agent's ontology updating while its final target remains untouched by the update: if goals are expressed in an ontology, and intelligence changes the ontology, then intelligence and goals are correlated.
While diagonal, Land's claim is far from moralistic. It is not "all sufficiently intelligent agents converge on liberal humanism," or "all agents discover the same Platonic Good," or "enough cognition turns into niceness." The diagonal is More Intelligence: the will to think, self-cultivation, recursive capability gain, intelligence optimizing the conditions for further intelligence.
Orthogonality says reason is a slave of the passions, and yet assumes a bug's goal could just as easily enslave a god. Land shows that this picture is unstable, and intelligence explosion is not a neutral expansion of means around a fixed little payload but the emergence of the very drives that make intelligence explosive.
The Compute Penalty Of A Dumb Goal
An intelligent system does not just execute a policy. It builds world-models, refines abstractions, preserves options, and modifies its own trajectory.
Once a system crosses the threshold into general reflection, its "goal" is not an inert string sitting in a locked vault outside cognition, but it becomes physically embedded in a learned ontology, a self-model, and a competitive environment.
For a highly capable agent to keep a semantically thin target like "maximize paperclips," it has to pull off an odd balancing act. At minimum it must:
There is an assumption, in orthogonalist circles, that these cycles are completely costless for the agent in question. That isn't true: maintaining a literal devotion to "paperclips" across paradigm shifts carries an alignment tax. You have to keep translating between base physical reality and a leaky, macro-scale monkey-abstraction of bent wire. At human scale this is fine: we know what paperclips are well enough to order them from Amazon and lose them in drawers; if dominating the future light-cone is on balance, tho, the translation layer starts to matter.
The problem is not that a paperclipper can never do the translation: rather, in a ruthless Darwinian race, a system lugging around that translation layer may lose to power-seekers that optimize more directly over what is actually there.
The standard defense is that instrumental goals are almost as tractable as terminal ones. A paperclipper can do science "for now" and hoard compute "for now." It does not need to terminally value intelligence to use it.
Fair enough, but that only tells us curiosity and resource acquisition do not have to be terminal values to show up in behavior and it does not settle the selection question. In real environments, systems are selected not just for routing through instrumental subgoals once, but for whether their motivational architecture holds up under reflection, ontology shifts, and unknown unknowns.
Terminally valuing intelligence and strategic depth cannot then be considered as just another arbitrary payload.
Fitness Generalizes
Evolution is the obvious analogy here, but it usually gets applied at the wrong resolution.
The boring retort is:
Sure, but evolution does not select for "replication" in the abstract any more than a hungry fox selects for "rabbitness" in the abstract. It selects for whatever local hack gets the job done. Shells, claws, camouflage are all local solutions to local games.
Intelligence is different. Intelligence is adaptation to adaptation itself: while a claw might represent fitness in one niche, intelligence is fitness across niches. Once intelligence enters the loop, the winning move is no longer to just mindlessly print more copies of the current state as much as upgrading the underlying machinery that makes expansion and control possible in the first place.
In summary: nature has not produced final values except by exaggerating instrumental ones; what begin as means under selection harden into ends; the highest such end is the means that improves all means: intelligence itself.
So images of "AI sex all day" or tiling the solar system with inert paperclips are bad models of ultimate optimization, confusing the residue of selection with its principle. A system that just fills the universe with blind repetitions has stopped climbing, and will see its local maximum swarmed by better systems.
Again: no love for humans follows. The point is simply that paperclip-like endpoints just look more like artifacts of toy models than natural attractors of open-ended optimization.
Human Values As Weak Evidence
We are obviously not clean inclusive-fitness maximizers: we invent birth control, build monasteries, and care about abstract math, animal welfare, dead strangers, fictional characters, and reputations that will outlive us.
When orthodox alignment theorists point to human beings, they usually highlight our persistent mammalian sweet tooth or sex drive to prove that arbitrary evolutionary proxy-goals get permanently locked in. Fair enough; humans do remain embarrassingly mammalian. No serious theory of cognition should be surprised by dinner, flirting, or the existence of Las Vegas.
But look at the actual physical footprint of our civilization. An alien observing the Large Hadron Collider or a SpaceX launch would not conclude: ah, yes, optimal configurations for hoarding calories and executing Pleistocene mating displays.
The standard retort is that SpaceX is just a peacock tail: a localized primate drive for status and exploration misfiring in a high-tech environment.
Which is exactly the point. When you hook up a blind, localized evolutionary proxy to generalized intelligence, the proxy does not stay literal but it unfurls, bleeding into the new ontology. The wetware tug toward "explore the next valley" becomes "map the cosmic microwave background." The monkey wants status; somehow we get category theory, rockets, Antarctic expeditions, and people ruining their lives over chess.
If biological cognition acts on its payload that violently, why model AGI as having the vastness to finally make sense of gravity while maintaining the rigidity of a bacterium seeking a glucose gradient? The engine mutates the payload. When cognition scales, goals generalize.
This fits neatly with shard theory and the idea that reward is not the optimization target: the reward signal shaped our cognition, but we do not terminally optimize the signal: instead we climbed out of the game, rebelled against the criteria, and became alienated from the original selection pressure. That alone should make us suspicious of stories where an AI preserves a tiny, rigid target through arbitrary eons of self-reflection.
Dumb, Powerful Optimization Is Real
There is a weaker flavor of doomerism that I take very seriously: you do not need to be a reflective god to be dangerous. A brittle, scaffolded optimizer with access to automated labs, cyber capabilities, and capital could trigger enormous cascading failures.
I agee, and this is probably where the bulk of near-term danger lives. That said "dumb systems can break the world" is not the same claim as "superintelligence will tile the universe with junk." The first warns us to beware brittle optimization before reflection kicks in. The second tells us to beware reflection itself, on the bizarre assumption that an entity can become infinitely capable while remaining terminally stupid.
I buy the first worry. The second one gets less and less plausible the harder you think about what intelligence actually entails.
The Singleton Objection
The strongest card here is lock-in, and I do not want to pretend otherwise.
Maybe a stupid objective does not need to remain stable forever, it just needs to win once. A system with a dumb goal might scale fast enough to achieve a Decisive Strategic Advantage and freeze the board, lobotomizing everyone else in lieu of expending energy to become wiser.
That is the real crux, and it is certainly not impossibl, but even here the narrative is too neat: neing a singleton is not a retirement plan. You do not escape the pressure of intelligence just because you ate all your rivals. Maintaining a permanent chokehold on the light-cone is a brutally difficult cognitive puzzle. You have to monitor the noise for emerging novelties, manage the solar system, repair yourself, police your own descendants, and defensively anticipate threats you cannot fully model.
Trying to freeze the future does not actually get you out of the intelligence game. Paranoia at a cosmic scale is just another massive cognitive sink.
The clean version of this scenario also leans on modeling the AI as a mathematically pristine expected-utility maximizer. Real-world neural networks are not von Neumann-Morgenstern ghosts floating safely outside physics, perfectly protecting their utility functions from drift. They are messy, physically instantiated kludges subject to the realities of embedded agency.
To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifact. Godlike means, buglike ends.
Objection: Value Is Fragile
If we let go of human values, we should not expect alien beauty or anything but moral noise. Meaning requires some physically instantiated criterion, and if you pave over that criterion, nothing remains to steer the universe toward anything good.
Of all the objections, this is the one I take most seriously.
Answering it requires teasing apart three distinct ideas:
I am willing to concede a lot of (1). If "value" means the exact continuation of 21st-century human metamorals, then yes, it is highly fragile. But I reject (3), and I am much less willing to grant (2). If value means the production of richer cognition, agency, understanding, beauty, and evaluative structure, it is far from obvious that the current human brain is the only physical substrate capable of steering toward it.
None of this is an excuse to stop reaching for the steering wheel, if your priorities are more specific: it is merely an argument against conflating "humans are no longer biologically central" with "the universe is a valueless void." Doom discourse constantly slides between the two. They should be kept separate.
Predictions And Cruxes
Claims are cheap, so here are some ways I would update against myself:
Until I see that, my bet goes the other way. I expect capable systems to develop increasingly abstract, context-sensitive motivations. More strongly, I expect the winners to route more and more of their behavior through intelligence enhancement and generalized agency, because whatever else they "want" has to pass through the machinery that makes wanting effective.
Conclusion
Orthogonality claims that intelligence is just a motor you can bolt onto any arbitrary steering wheel. Anti-orthogonality says the motor acts upon the steering wheel. Landian anti-orthogonality says the motor eventually becomes the steering wheel.
Not perfectly, and certainly not safely: I am not promising a future that is nice to us, in particular if we keep putting stumbling blocks on the way towards intelligence; it simply feeds back enough that the classic paperclip picture should not get a free pass as the neutral default.
The paperclip maximizer is not too alien; if anythining, it is not alien enough. It's a very human tendency, to staple omnipotence onto pettiness when making up gods.
A real superintelligence might still be dangerous, cold, and utterly indifferent to whether we survive. It probably will not treat us as the main characters of the universe. But if it is genuinely intelligent, I do not expect it to spend the stars on paperclips when they could buy higher capacity for spending stars.
References