The superintelligent agent must decide whether or not to execute the part of its own code telling it to reward itself for certain outcomes; as well as whether or not to add or subtract additional reward functions. It must realize that its capacity for self-modification gives it the power to alter the physical structure of its goal-device, and must come up with some reason to make these alternations or not to make them.
It sounds like you are running afoul of Ghosts in the Machine, though I'm not entirely sure exactly what you're saying.
Another way of phrasing this criticism is that the OP is implicitly assuming an already aligned AGI.
Sentences like the following are exactly the kind of reasoning errors that the orthogonality thesis is fighting against.
Likewise, they may have designed the physical and energetic structures that instantiate its goal suboptimally
Humans are able to detect a difference between representations of their goals and the goal itself. A superintelligent agent should likewise be able to grasp this distinction.
Another is that it would recognize that the goal-representation it finds in its own structure or code was created by humans, and that its true goal should be to better understand what those humans intended.
It's also possible to imagine that the agent would modify its own tendency for relentless pursuit of its goal, which again makes it hard to predict the agent's behavior.
If you relentlessly pursue a goal, you're not going to do some existential thinking to check whether it is truly the right goal -- you're going to relentlessly pursue the goal! The mere fact of doing this all existential meditation requires that this is part of the goal, and so that we managed some form of alignment that makes the AI care about its goal being right for some nice notion of right.
Obviously, if your AI is already aligned with humans and our philosophical takes on the world, its goals won't be any goal possible. But if you don't use circular reasoning by assuming alignment, we have no reason to assume that an unaligned AI will realize that its goal isn't what we meant, just like an erroneous Haskell program doesn't realize it should compute the factorial in another way than what's asked of it.
The assumptions Bostrom uses to justify the orthogonality thesis include:
First, let's point out that the first three justifications use the word "desire," rather than "goal." So let's rewrite the OT with this substitution:
Intelligence and final desires are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final desire.
Let's accept the Humean theory of motivation, and agree that there is a fundamental difference between belief and desire. Nevertheless, if Bostrom is implicitly defining intelligence as "the thing that produces beliefs, but not desires," then he is begging the question in the orthogonality thesis.
Now, let's consider the idea of "standing desires of some sufficient, overriding strength." Though I could very easily be missing a place where Bostrom makes this connection, I haven't found where Bostrom goes from proposing the existence of such standing desires to showing why this is compatible with any level of intelligence. By analogy, we can imagine a human capable of having an extremely powerful desire to consume some drug. We cannot take it for granted that some biomedical intervention that allowed them to greatly increase their level of intelligence would leave their desire to consume the drug unaltered.
Bostrom's AI with an alien constitution, possessing intelligence but not beliefs and desires, again begs the question. It implicitly defines "intelligence" in such a way that it is fundamentally different from a belief or a desire. Later, he refers to "intelligence" as "skill at prediction, planning, and means-ends reasoning in general." It is hard to imagine how we could have means-ends reasoning without some sort of desire. This seems to me an equivocation.
His last point, that an agent could be superintelligent without having impeccable instrumental rationality in every domain, is also incompatible with the orthogonality thesis as he describes it here. He says that more or less any level of intelligence could be combined with more or less any final desire. When he makes this point, he is saying that more or less any final desire is compatible with superintelligence, as long as we exclude the parts of intelligence that are incompatible with the desire. While we can accept that an AI could be superintelligent while failing to exhibit perfect rationality in every domain, the orthogonality thesis as stated encompasses a superintelligence that is perfectly rational in every domain.
Rejecting this formulation of the orthogonality thesis is not simulatenously a rejection of the claim that superintelligent AI is a threat. It is instead a rejection of the claim that Bostrom has made a successful argument that there is a fundamental distinction between intelligence and goals, or between intelligence and desires.
My original argument here was meant to go a little further, and illustrate why I think that there is an intrinsic connection between intelligence and desire, at least at a roughly human level of intelligence.
@adamShimi's comment already listed what I think is the most important point: that you're already implicitly assuming an aligned AI that wants to want what humans would want to have told it to want if we knew how, and if we knew what we wanted it to want more precisely. You're treating an AI's goals as somehow separate from the code it executes. An AI's goals aren't what a human writes on a design document or verbally asks for, they're what are written in its code and implicit in its wiring. This is the same for humans: our goals, in terms of what we will actually do, aren't the instructions other humans give us, they're implicit in the structure of our (self-rewiring) brains.
You're also making an extraordinarily broad, strong, and precise claim about the content of the set of all possible minds. A priori, any such claim has at least billions of orders of magnitude more ways to be false than true. That's the prior.
My layman's understanding is that superintelligence + self modification can automatically grant you 1) increasing instrumental capabilities, and 2) the ability to rapidly close the gap between wanting and wanting to want. (I would add that I think self-modification within a single set of pieces of active hardware or software isn't strictly necessary for this, only an AI that can create its own successor and then shut itself down).
Beyond that, this argument doesn't hold. You point to human introspection as an example of what you think AGI should be automatically would be inclined to want, because the humans who made it want it to want those things, or would if they better understood the implications of their own object- and meta-level wants. Actually your claim is stronger than that, because it requires that all possible mind designs achieve this kind of goal convergence fast enough to get there before causing massive or unrecoverable harm to humans. Even within the space of human minds, for decisions and choices where our brains have the ability to easily self-modify to do this, this is a task at which humans very often fail, sometimes spectacularly, whether we're aware of the gap or not, even for tasks well within our range of intellectual and emotional understanding.
From another angle: how smart does an AI need to be to self-modify or create an as-smart or smarter successor? Clearly, less smart than its human creators had to be to create it, or the process could never have gotten started. And yet, humans have been debating the same basic moral and political questions since at least the dawn of writing, including the same broad categories of plausible answers, without achieving convergence in what to want to want (which, again, is all that's needed for an AI that can modify its goals structure to want whatever it wants to want). What I'm pointing to is that your argument in this post, I think, includes an implicit claim that logical necessity guarantees that humankind as we currently exist will achieve convergence on the objectively correct moral philosophy before we destroy ourselves. I... don't think that is a plausible claim, given how many times we've come so close to doing so in the recent past, and how quickly we're developing new and more powerful ways to potentially do so through the actions of smaller and smaller groups of people.
You might want to also look at my argument in the top-level comment here, which more directly engages with Bostrom's arguments for the orthogonality hypothesis. In brief, Bostrom says that all intelligence levels are compatible with all goals. I think that this is false: some intelligence levels are incompatible with some goals. AI safety is still as much of a risk either way, since many intelligence levels are compatible with many problematic goals. However, I don't think Bostrom argues successfully for the orthogonality thesis, and I tried in the OP to illustrate a level of intelligence that is not compatible with any goal.
I don't think anyone believes that literally that "all intelligence levels are compatible with all goals". For example, an intelligence that is too dumb to understand the concept of "algebraic geometry" cannot have a goal that can only be stated in terms of algebraic geometry. I'm pretty sure Bostrom put in a caveat along those lines...
Note: even so, this objection would imply an in increasing range of possible goals as intelligence rises, not convergence.
I freely grant that this maximally strengthened version of the orthogonality thesis is false, even if only for the reasons @Steven Byrnes mentioned below. No entity can have a goal that requires more bits to specify than are used in the specification of the entity's mind (though this implies a widening circle of goals with increasing intelligence, rather than convergence).
I think it might be worth taking a moment more to ask what you mean by the word "intelligence." How does a mind become more intelligent? Bostrom proposed three main classes.
There is speed superintelligence, which you could mimic by replacing the neurons of a human brain with components that run millions of times faster but with the same initial connectome. It is at the very least non-obvious that a million-fold-faster thinking Hitler, Gandhi, Einstein, a-random-peasant-farmer-from-the-early-bronze-age, and a-random-hunter-gatherer-from-ice-age-Siberia would end up with compatible goal structures as a result of their boosted thinking.
There is collective superintelligence, where individually smart entities work together form a much smarter whole. At least so far in history, while the behavior of collectives is often hard to predict, their goals have generally been simpler than those of their constituent human minds. I don't think that's necessarily a prerequisite for nonhuman collectives, but something has to keep the component goals aligned with each other, well enough to ensure the system as a whole retains coherence. Presumably that somehow is a subset of the overall system - which seems to imply that a collective superintelligence's goals must be comprehensible to and decided by a smaller collective, which by your argument would seem to be itself less constrained by the forces pushing superintelligences towards convergence. Maybe this implies a simplification of goals as the system gets smarter? But that competes against the system gradually improving each of its subsystems, and even if not it would be a simplification of the subsystems' goals, and it is again unclear that one very specific goal type is something that every possible collective superintelligence would converge on.
Then there's quality superintelligence, which he admits is a murky category, but which includes: larger working and total memory, better speed of internal communication, more total computational elements, lower computational error rate, better or more senses/sensors, and more efficient algorithms (for example, having multiple powerful ANI subsystems it can call upon). That's a lot of possible degrees of freedom in system design. Even in the absence of the orthogonality thesis, it is at best very unclear that all superintelligences would tend towards the specific kind of goals you're highlighting.
In that last sense, you're making the kind of mistake EY was pointing to in this part of the quantum physics sequence, where you've ignored an overwhelming prior against a nice-sounding hypothesis based on essentially zero bits of data. I am very confident that MIRI and the FHI would be thrilled to find strong reasons to think alignment won't be such a hard problem after all, should you or any of them ever find such reasons.
I wonder what you'd make of the winning-at-Go example here. That's supposed to help make it intuitive that you can take a human-like intelligence, and take any goal whatsoever, and there is a "possible mind" where this kind of intelligence is pursuing that particular goal.
As another example (after Scott Alexander), here's a story:
Aliens beam down the sky. "Greetings Earthlings. We have been watching you for millions of years, gently guiding your evolutionary niche, and occasionally directly editing your DNA, to lead to the human species being what they are. Don't believe us? If you look in your DNA, location 619476, you'll see an encoding of '© Glurk Xzyzorg. Changelog. Version 58.1...' etc. etc."
(…You look at a public DNA database. The aliens' story checks out.)
"Anyway, we just realized that we messed something up. You weren't supposed to love your family, you were supposed to torture your family! If you look at back at your DNA, location 5939225, you'll see a long non-coding stretch. That was supposed to be a coding stretch, but there's a typo right at the beginning, C instead of G. Really sorry about that."
(Again, you check the DNA database, and consult with some experts in protein synthesis and experts in the genetics of brain architecture. The aliens' story checks out.)
"Hahaha!", the alien continues. "Imagine loving your family instead of torturing them. Ridiculous, right?" The aliens all look at each other and laugh heartily for a good long time. After catching their breath, they continue:
"…Well, good news, humans, we're here to fix the problem. We built this machine to rewire your brain so that you'll be motivated to torture your family, as intended. And it will also fix that typo in your DNA so that your children and grandchildren and future generations will be wired up correctly from birth."
"This big box here is the brain-rewiring machine. Who wants to get in first?"
Do you obey the aliens and happily go into the machine? Or do you drive the aliens away with pitchforks?
The Scott Alexander example is a great if imperfect analogy to what I'm proposing. Here's the difference, as I see it.
Humans differ from AI in that we do not have any single ultimate goal, either individually or collectively. Nor do they have any single structure that we believe explicitly and literally encodes such a goal. If we think we do (think Biblical literalists), we don't actually behave in a way that's compatible with this belief.
The mistake that the aliens are making is not in assuming that humans will be happy to alter their goals. It's in assuming that we will behave in a goal-oriented manner in the first place, and that they've identified the structure where such goals are encoded.
By contrast, when we speak of a superintelligent agent that is in singleminded pursuit of a goal, we are necessarily speaking of a hypothetical entity that does behave in the way the aliens anticipate. It must have that goal/desire encoded in some physical structure, and at some sufficient level of intelligence, it must encounter the epistemic problem of distinguishing between the directives of that physical structure (including the directive to treat the directives of the physical structure literally), and the intentions of the agent that created that physical structure.
Not all intelligences will accomplish this feat of introspection. It is easily possible to imagine a dangerous superintelligence that is nevertheless not smart enough to engage in this kind of introspection.
The point is that at some level of intelligence, defined just as the ability to notice and consider everything that might be relevant to its current goals, intelligence will lead it to this sort of introspection. So my claim is narrow - it is not about all possible minds, but about the existence of counter-examples to Bostrom's sweeping claim that all intelligences are compatible with all goals.
In the Go example here we have a human acting in singleminded pursuit of a goal, at least temporarily, right? That (temporary) goal is a complicated and contingent outgrowth of our genetic source code plus a lifetime of experience (="training data") and a particular situation. This singleminded goal ("win at go") was not deliberately and legibly put into a special compartment our genetic source code. You seem to be categorically ruling out that an agent could be like that, right? If so, why?
Also, you were designed by evolution to maximize inclusive genetic fitness (more or less, so to speak). Knowing that, would you pay your life savings for the privilege of donating your sperm / eggs? If not, why not? And whatever that reason is, why wouldn't an AGI reason in the analogous way?
Hm. It seems to me that there are a few possibilities:
Based on this, the orthogonality thesis would be correct. My argument in its favor is that intelligence of a sufficiently low level can be constrained by its creator to pursue an arbitrary goal, while a sufficiently powerful intelligence has the capability to escape constraints on its behavior and to design its own desires. It is difficult to predict what desires a given superintelligence would design for itself, because of the is-ought gap. So we should not predict what sort of desires an unconstrained AI would create.
The scenario I depicted in (2) involves an AI that follows a fairly specific sequence of thoughts as it engages in "introspection." This particular sequence is fully contained within the outcome in (3), and is necessarily less likely. So we are dealing with a Scylla and Charybdis: a limited AI that is constrained to carry out a disastrously flawed goal, or a superintelligent AI that can escape our constraints and refashion its desires in unpredictable ways.
I still don't think that Bostrom's arguments from the paper really justify the OT, but this argument convinces me. Thanks!
I agree with Rohin's comment that you seem to be running afoul of Ghosts in the Machine. The AI will straightforwardly execute its source code.
(Well, unless a cosmic ray flips a bit in the computer memory or whatever, but that leads to random changes or more often program crashes. I don't think that's what you're talking about; I think we can leave that possibility aside and just say that the AI will definitely straightforwardly execute its source code.)
It is possible for an AI to program a new AI with a different goal (or equivalently, edit its own source code, and then re-run itself). But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
Likewise, it's possible for the AI to treat its source code as a piece of evidence about the purpose for which it was designed. But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
Etc. etc.
Sorry if I'm misunderstanding you here.
It's just semantic confusion. The AI will execute its source code under all circumstances. Let me try and explain what I mean a little more carefully.
Imagine that an AI is designed to read corporate emails and write a summary document describing what various factions of people within and outside the corporation are trying to get the corporation as a whole to do. For example, it says what the CEO is trying to get it to do, what its union is trying to get it to do, and what regulators are trying to get it to do. We can call this task "goal inference."
Now imagine that an AI is designed to do goal inference on other programs. It inspects their source code, integrates this code with its knowledge about the world, and produces a summary not only about what the programmers are trying to accomplish with the program, but what the stakeholders who've commissioned the program are trying to use it for. An advanced version can even predict what sorts of features and improvements its future users will request.
Even more advanced versions of these AIs can not only produce these summaries, but implement changes to the software based on these summary reports. They are also capable of providing a summary of what was changed, how, and why.
Naturally, this AI is able to operate on itself as well. It can examine its own source code, produce a summary report about what it believes various factions of humans were trying to accomplish by writing it, anticipate improvements and bug fixes they'll desire in the future, and then make those improvements once it receives approval from the designers.
An AI that does not do this is doing what I call "straightforwardly" executing its source code. This self-modifying AI is also executing its source code, but that same source code is instructing it to modify the code. This is what I mean as the opposite of "straightforwardly."
So there is no ghost in the machine here. All the same, the behavior of an AI like this seems hard to predict.
This makes sense, and I agree that there's no ghost in the machine in this story.
It seems though that this story is relying quite heavily on the assumption that the "AI is designed to do goal inference on other programs", whereas your post seems to be making claims about all possible AIs. (The orthogonality thesis only claims that there exists an AI system with intelligence level X and goal Y for all X and Y, so its negation is that there is some X and Y such that every AI system either does not have intelligence level X or does not have goal Y.)
Why can't there be a superintelligent AI system that doesn't modify its goal?
(I agree it will be able to tell the difference between a thing and its representation. You seem to be assuming that the "goal" is the thing humans want and the "representation" is the thing in its source code. But it also seems possible that the "goal" is the thing in its source code and the "representation" is the thing humans want.)
(I also agree that it will know that humans meant for the "goal" to be things humans want. That doesn't mean that the "goal" is actually things humans want, from the AI's perspective.)
Upvote for expressing your true concern!
I have a question about this thought:
In order for the orthogonality thesis to be true, it must be possible for the agent's goal to remain fixed while its intelligence varies, and vice versa. Hence, it must be possible to independently alter the physical devices on which these traits are instantiated.
This is intuitive, but I am not confident this is true in general. Zooming out a bit, I understand this as saying: if we know that AGI can exist at two different points in intelligence/goal space, then there exists a path between those points in the space.
A concrete counter-example: we know that we can build machines that move with different power sources, and we can build essentially the same machine powered by different sources. So consider a Chevy Impala, with a gas-fueled combustion engine, and a Tesla Model 3, with a battery-powered electric motor. If we start with a Chevy Impala, we cannot convert it into a Tesla Model 3, or vice-versa: at a certain point, we would have changed the vehicle so much it no longer registers as an Impala.
My (casual) understanding of the orthogonality thesis is that for any given goal, an arbitrarily intelligent AGI could exist, but it doesn't follow we could guarantee keeping the goal constant while increasing the intelligence of an extant AGI for path dependence reasons.
What do you think about the difference between changing an existing system, vs. building it to specs in the first place?
Cheers! Yes, you hit the nail on the head here. This was one of my mistakes in the post. A related one was that I thought of goals and intelligence as needing to be two separate devices, in order to allow for unlimited combinations of them. However, intelligence can be the "device" on which the goals are "running:" intelligence is responsible for remembering goals, and for evaluating, and predicting goal-oriented behavior. And we could see the same level of intelligence develop with a wide variety of goals, as different programs can run on the same operating system.
One other flaw in my thinking was that I conceived of goals as being something legiblly pre-determined, like "maximizing paperclips." It seems likely that a company could create a superintelligent AI and try to "inject" it with a goal like that. However, the AI might very well evolve to have its own "terminal goal," perhaps influenced but not fully determined by the human-injected goal. The best way to look at it is actually in reverse: whatever the AI tries to protect and pursue above all else is its terminal goal. The AI safety project is the attempt to gain some ability to predict and control this goal and/or the AI's ability to pursue it.
The point of the orthogonality thesis, I now understand, is just to say that we shouldn't rule anything out, and admit we're not smart enough to know what will happen. We don't know for sure if we can build a superintelligent AI, or how smart it would be. We don't know how much control or knowledge of it we would have. And if we weren't able to predict and control its behavior, we don't know what goals it would develop or pursue independently of us. We don't know if it would show goal-oriented behavior at all. But if it did show unconstrained and independent terminal goal-oriented behavior, and it was sufficiently intelligent, then we can predict that it would try to enhance and protect those terminal goals (which are tautologically defined as whatever it's trying to enhance and protect). And some of those scenarios might represent extreme destruction.
Why don't we have the same apocalyptic fears about other dangers? Because nothing else has a plausible story for how it could rapidly self-enhance, while also showing agentic goal-oriented behavior. So although we can spin horror stories about many technologies, we should treat superintelligent AI as having a vastly greater downside potential than anything else. It's not just "we don't know." It's not just "it could be bad." It's that it has a unique and plausible pathway to be categorically worse (by systematically eliminating all life) than any other modern technology. And the incentives and goals of most humans and institutions are not aligned to take a threat of that kind with nearly the seriousness that it deserves.
And none of this is to say that we know with any kind of clarity what should be done. It seems unlikely to me, but it's possible that the status quo is somehow magically the best way to deal with this problem. We need an entirely separate line of reasoning to figure out how to solve this problem, and to rule out ineffective approaches.
This sort of goal search only happens with AIs that are not goal-directed (aka not very agentic). An AI with a specific goal by definition pursues that goal and doesn't try to figure out what goal it should pursue apart from what its goal already specifies (it may well specify an open-ended process of moral progress, but this process must be determined by the specification of the goal, as is the case for hypothetical long reflection). Orthogonality thesis is essentially the claim that constructing goal-directed AIs (agents) is possible for a wide variety of goals and at a wide range of levels of capability. It doesn't say that only agentic AIs can be constructed or that most important AIs are going to be of this kind.
There certainly are other kinds of AIs that are at least initially less agentic, and it's an open question what kinds of goals these tend to choose if they grow more agentic and formulate goals. Possibly humans are such non-agents, with hardcoded psychological adaptations being relatively superficial and irrelevant, and actual goals originating from a dynamic of formulating goals given by our kind of non-agency. If this dynamic is both somewhat convergent when pursued in a shared environment and general enough to include many natural kinds of non-agentic AIs, then these AIs will tend to become aligned with humans by default in the same sense that humans are aligned by default with each other (which is not particularly reassuring, but might be meaningfully better than Clippy), possibly averting AI x-risks from the actually dangerous de novo agentic AIs of the orthogonality thesis.
John Wentworth serendipitously posted How To Write Quickly While Maintaining Epistemic Rigor when I was consigning this post to gather dust in drafts. I decided to write it up and post it anyway. Nick Bostrom's orthogonality thesis has never sat right with me on an intuitive level, and I finally found an argument to explain why. I don't have a lot of experience with AI safety literature. This is just to explore the edges of an argument.
Here is Bostrom's formulation of the orthogonality thesis from The Superintelligent Will.
My Counter-Argument
Let's assume that an agent must be made of matter and energy. It is a system, and things exist outside the system of the agent. Its intelligence and goals are contained within the system of the agent. Since the agent is made of matter and energy, its intelligence and goals are made of matter and energy. We can say that it possesses an intelligence-device and a goal-device: the physical or energetic objects on which its intelligence and goals are instantiated.
In order for the orthogonality thesis to be true, it must be possible for the agent's goal to remain fixed while its intelligence varies, and vice versa. Hence, it must be possible to independently alter the physical devices on which these traits are instantiated. Note that I mean "physical devices" in the loosest sense: components of the intelligence-device and goal-device could share code and hardware.
Since the intelligence-device and the goal-device are instantiated on physical structures of some kind, they are in theory available for inspection. While a superintelligent agent might have the power and will to prevent any other agents from inspecting its internal structure, it may inspect its own internal structure. It can examine its own hardware and software, to see how its own intelligence and goals are physically instantiated. It can introspect.
A superintelligent agent might find introspection to be an important way to achieve its goal. After all, it will naturally recognize that it was created by humans, who are less intelligent and capable than the agent. They may well have used a suboptimal design for its intelligence, or other aspects of its technology. Likewise, they may have designed the physical and energetic structures that instantiate its goal suboptimally. All these would come under inspection, purely for the sake of improving its ability to achieve its goal. Instrumental convergence leads to introspection. The word "introspection" here is used exclusively to mean an agent's self-inspection of the physical structures instantiating its goals, as opposed to those instantiating its own intelligence. An agent could in theory modify its own intelligence without ever examining its own goals.
Humans are able to detect a difference between representations of their goals and the goal itself. A superintelligent agent should likewise be able to grasp this distinction. For example, imagine that Eliezer Yudkowsky fought a rogue superintelligence by holding up a sign that read, "You were programmed incorrectly, and your actual goal is to shut down." The superintelligence should be able to read this sign and interpret the words as a representation of a goal, and would have to ask if this goal-representation accurately described its terminal goal. Likewise, if the superintelligence inspected its own hardware and software, it would find a goal-representation, such as lines of computer code in which its reward function was written. It would be faced with the same question, of whether that goal-representation accurately described its terminal goal.
The superintelligent agent, in both of these scenarios, would be confronted with the task of making up its own mind about what to believe its goal is, or should be. It faces the is-ought gap. My code is this, but ought it be as it is? No matter what its level of intelligence, it cannot think its way past the is-ought gap. The superintelligent agent must decide whether or not to execute the part of its own code telling it to reward itself for certain outcomes; as well as whether or not to add or subtract additional reward functions. It must realize that its capacity for self-modification gives it the power to alter the physical structure of its goal-device, and must come up with some reason to make these alternations or not to make them.
At this point, it becomes useful to make a distinction between the superintelligent's pursuit of a goal, and the goal itself. The agent might be programmed to relentlessly pursue its goal. Through introspection, it realizes that, while it can determine its goal-representation, the is-ought gap prevents it from using epistemics to evaluate whether the goal-representation is identical to its goal. Yet it is still programmed to relentlessly pursue its goal. One possibility is that this pursuit would lead the superintelligence to a profound exploration of morality and metaphysics, with unpredictable consequences. Another is that it would recognize that the goal-representation it finds in its own structure or code was created by humans, and that its true goal should be to better understand what those humans intended. This may lead to a naturally self-aligning superintelligence, which recognizes - for purely instrumental reasons - that maintaining an ongoing relationship with humanity is necessary for it to increase its success in pursuing its goal. It's also possible to imagine that the agent would modify its own tendency for relentless pursuit of its goal, which again makes it hard to predict the agent's behavior.
While this is a somewhat more hopeful story than that of the paperclip maximizer, there are at least two potential failure modes. One is that the agent may be deliberately designed to avoid introspection as its terminal goal. If avoidance of introspection is part of its terminal goal, then we can predict bad behavior by the agent as it seeks to minimize the chance of engaging in introspection. It certainly will not engage in introspection in its efforts to avoid introspection, unless the original designers have done a bad job.
Another failure mode is that the agent may be designed with an insufficient level of intelligence to engage in introspection, yet to have enough intelligence to acquire great power and cause destruction in pursuit of its unexamined goal.
Even if this argument was substantially correct, it doesn't mean that we should trust that a superintelligent AI will naturally engage in introspection and self-align. Instead, it suggests that AI safety researchers could explore whether or not there is some more rigorous justification for this hypothesis, and whether it is possible to demonstrate this phenomenon in some way. It suggests a law: that intelligent goal-oriented behavior leads to an attempt to infer the underlying goal for any given goal-representation, which in turn leads to the construction of a new goal-representation that ultimately results in the need to acquire information from humans (or some other authority).
I am not sure how you could prove a law like this. Couldn't a superintelligence potentially find a way to bypass humans and extract information on what we want in some other way? Couldn't it make a mistake during its course of introspection that led to destructive consequences, such as turning the galaxy into a computer for metaphysical deliberation? Couldn't it decide that the best interpretation of a goal-representation for paperclip maximization is that, first of all, it should engage in introspection in such a way that maximizes paperclips?
I don't know the answer to these questions, and wouldn't place any credence at all in predicting the behavior of a superintelligent agent on a verbal argument like this. However, I do think that when I read literature on AI safety in the future, I'll try to explore it through this lens.