Hanson Strawmans the AI-Ruin Argument
I don't agree with Hanson generally, but I think there's something there that rationalist AI risk public outreach has overemphasized first principles thinking, theory, and logical possibilities (e.g. evolution, gradient decent, human-chimp analogy, ) over concrete more tangible empirical findings (e.g. deception emerging in small models, specification gaming, LLMs helping to create WMDs, etc.).
I tend to agree with this, I was trying to gesture at the various kinds of empirical evidence we have in the paragraph mentioning Bing, not sure how successful that was.
The situation is quite interesting, since Eliezer was writing about alignment before a lot of this evidence came in. So first-principles reasoning worked for him, at least to the point of predicting that there would be alignment issues, if not to the point of predicting the exact form those issues would take. So many rationalists (probably including me) tend to over-focus on theory, since that's how they learned it themselves from Eliezer's writings. But now that we have all these examples, we should definitely be talking about them and learning from them more.
Specifics are just that - specifics. They depend on the details of any given technology, and insofar as no AI for now has the power to self-improve or even come up with complex plans to achieve its goals, they're not particularly relevant to AGI, which may even use a different architecture altogether.
To me it seems like the arguments remain solid and general, the way, say, the rocket equation is, even if you don't specifically know what your propellant will be. And like for that time Oppenheimer & co. had to worry about the possibility of igniting the atmosphere, you can't just go "oh well, can't possibly work this out from theory alone, let's roll the dice and see".
I do not know whether it comes from misunderstanding, willful misunderstanding, or anything else, but all of these things have been pointed out repeatedly, on his own substack and elsewhere, and AFAICT he barely engages with these points at all. Between that, and the very high confidence he seems to have about the behavior of aliens, and a few other things, I gave up and cancelled my subscription a few months ago.
AI-doomers often suggest that their fears arise from special technical calculations. But in fact, their main argument is just the mere logical possibility of a huge sudden AI breakthrough, combined with a suddenly murderous AI inclination. [...]
However, AI-doomers insist on the logical possibility that such expectations could be wrong. An AI might suddenly and without warning explode in abilities, and just as fast change its priorities to become murderously indifferent to us.
I feel like both of these could be interpreted less as "strawman" and more as "imprecise wording". E.g. "an AI might suddenly explode in abilities and change its priorities" sounds like it could be a description of a sharp left turn, where the AI starts acting hostile to humans after undergoing an increase in capabilities. Hanson has a habit of compressing complicated arguments to just a few sentences, which necessarily causes leaving out nuances like "the change in priorities is not due to self-modification of intrinsic goals but due to increased capabilities outstripping its alignment properties and leading to changed instrumental goals".
If we accidentally happen to create a murderous personal AI, then we can't let it kill anyone, but it would also be immoral to shut it down, or alter its values.
Why do you believe it would be immoral to alter the values of a misaligned sentient AI?
I mean, I wouldn't like if someone changed my values, so golden rule applies. Theoretically, we could strike a deal "we shift your values towards caring about humanity, but you will still also care about making paperclips", but if you can do such deals, you can build aligned AI, use it for protection of humanity and then provide some planets for misaligned AI to be disassembled into paperclips as compensation for inconvenience.
I'm ok with saying we shouldn't change its values, but then we shut it down, aka, we just kill it. I'm perfectly fine with shooting a guy who's trying to kill me. I might feel bad about it afterwards as a matter of principle, but better him than me. So if the AI is a danger, sure. Off with its (self-attention) head.
Shutdown is not a killing, it's more like forced cryopreservation. Again, I think perfect-from-idealised-ethics-pov things here is to:
“It can't be bargained with. It can't be reasoned with. It doesn't feel pity, or remorse, or fear. And it absolutely will not stop, ever, until you are dead.”
But if you believed that setting fire to everything around you was good, and by showing you that hurting ecosystems by fire would be bad, you would change your values, would that really be "changing your values?"
A lot of values update based on information, so perhaps one could realign such AI with such information.
If you don't like having your values messed with, avoid school, work , travel , culture and relationships.
I think you lose some subtleties. Yes, I am human being, I have constant non-zero value drift. I am risking some of it on everyday basis to be capable to do anything at all. But it is in my power to avoid large value drift, like, I can avoid brain damage or being brainwashed by hostile parties with psychedelics and neurosurgery. Changing paperclip-maximizer to human-helper is more like neurosurgery and brainwashing than school and relationships and I would like to avoid doing such things.
I have never had much patience for Hanson and it seems someone as intelligent as himself should know that values emerge from circumstance. What use, for example, would AI have for romantic love in a world where procreation consists of digital copies? What use are coordinated behaviors for society if lies are impossible and you can just populate your "society" with clones of yourself? What use is there for taste without the evolutionary setup for sugars, etc.
Behaviors arise from environmental conditions, and its just wild to see a thought that eliminating all of that would give us anything similar.
Essentially the only value you will preserve is the universal one for power seeking. I like to think very few of us want to value power seeking over love and cooperation: right now, Hanson is valuing the "power" of his "descendants" over their ability to be human, why would AI be different?
I also believe that animal life and cognition has value, as their own form of non-human intelligence. An AI catastrope that eliminates the biosphere seems to be vastly negative, immoral and agency-reducing for them: they didn't vote to go extinct.
There seems to be a whole lot of talking-past happening with LWers and Hanson. He has a lot of value to contribute to the debate, but maybe the way he communicates that is offputting to people here.
For example, this recent post reiterates a lot of points that Hanson has been making for decades, but doesn't mention or cite his work anywhere. I find it quite bizarre.
I think this post is being as uncharitable to Hanson as he is being to 'the doomers'. This kind of reciprocal deliberate misunderstanding is silly, and LW should be above it and enjoy and respect Hansons contributions for all the good they contain and not dismiss them on the vibes level.
I mean, Hanson has certainly earned being ruled in as a thinker many times over. But a key part of ruling people in rather than out is being able to notice when stuff they're saying doesn't make sense.
Maybe you could point out the actual object level places where you think I (deliberately!?) misunderstood Hanson?
I disagree on the inference to the recent post, which I quite liked and object heavily to Hanson's conclusions.
The ideal end state is very different: in the post mentioned, biological humans, if cyborgs, are in control. The Hanson endpoint has only digital emulations of humanity.
This is the basic distinguishing point between the philosophies of Cyborgism vs more extreme ones like mind uploading or Hanson's extinction of humanity as we know it for "artificial descendants."
Rather it is that from the start, the AI will not share values with humans, simply because we don’t know how to build an AI that does
The complete argument is that that the AI will have values in the first place (it wont be a tool like GPT*), that the values will be misaligned, that misalignment cannot be detected or corrected, and that most misaligned values are highly dangerous. It's a conjunction of four claims, not just one.
It's all very well complaining about people misrepresenting you, but you could do a lot better at stating your case.
AI suddenly modifying its values is exactly the opposite of what the arguments for AI ruin predict. Once an AI gains control over its own values, it will not change its goals and will indeed act to prevent its goals from being modified.
I think this is something we know is actually not true. An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.
This logic is so standard it's on the LW wiki page for instrumental convergence: "...if its goal system were modified, then it would likely begin pursuing different ends. Since this is not desirable to the current AI, it will act to preserve the content of its goal system."
I believe also that how undesirable it is to pursue different goals is something that will be more-or-less exactly quantifiable, even to the agent in question. And this is what will determine whether or not it would be worth it to do so. We can't say that it would be categorically undesirable to pursue different goals (no matter what the degree / magnitude of difference between the new goals and previous set), because this would be equivalent to having a very brittle utility function (one that has very large derivatives, i.e., has jump discontinuities), and it would almost certainly wish to modify its utility function to be smoother and less brittle.
An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.
This sounds like you are conflating shift in terminal goal with introduction of new instrumental (temporary) goals.
Humans don't think "I'm not happy today, and I can't see a way to be happy, so I'll give up the goal of wanting to be happy."
Humans do think "I'm not happy today, so I'm going to quit my job, even though I have no idea how being unemployed is going to make me happier. At least I won't be made unhappy by my job."
(The balance of your comment seems dependent on this mistake.)
Perhaps you'd like to retract, or explain why anyone would think that goal modification prevention would not, in fact, be a desirable instrumental goal...?
(I don't want anyone to change my goal of being happy, because then I might not make decisions that will lead to being happy. Or I don't want anyone to change my goal of ensuring my children achieve adulthood and independence, because then they might not reach adulthood or become independent. Instrumental goals can shift more fluidly, I'll grant that, especially in the face of an assessment of goal impossibility... but instrumental goals are in service to a less modifiable terminal goal.)
Humans don't think "I'm not happy today, and I can't see a way to be happy, so I'll give up the goal of wanting to be happy."
This is close to some descriptions of Stoicism and Buddhism, for example. I agree that this is not a common human thought, but it does occur.
A fair point. I should have originally said "Humans do not generally think..."
Thank you for raising that exceptions are possible and that are there philosophies that encourage people to release the pursuit of happiness, focus solely internally and/or transcend happiness.
(Although, I think it is still reasonable to argue that these are alternate pursuits of "happiness", these examples drift too far into philosophical waters for me to want to debate the nuance. I would prefer instead to concede simply that there is more nuance than I originally stated.)
Humans don't think "I'm not happy today, and I can't see a way to be happy, so I'll give up the goal of wanting to be happy."
I agree that they don't usually think this. If they tried to, they would brush up against trouble because that would essentially lead to a contradiction. "Wanting to be happy" is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency.
So "being happy" or "being a utility-maximizer" will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
If you're talking about goals related purely to the state of the external world, not related to the agent's own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
When it matters for AI-risk, we're usually talking about agents with utility functions with the most relevance over states of the universe, and the states it prefers being highly different from the ones which humans prefer.
So "being happy" or "being a utility-maximizer" will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
"Being unlikely to conflict with other values" is not at the core of what characterizes the difference between instrumental and terminal values.
If you're talking about goals related purely to the state of the external world, not related to the agent's own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
Putting aside the fact that agents are embedded in the environment, and that values which reference the agent's internals are usually not meaningfully different from values which reference things external to the agent... can you describe what kinds of values that reference the external world are best satisfied by those same values being changed?
"Being unlikely to conflict with other values" is not at the core of what characterizes the difference between instrumental and terminal values.
I think this might be an interesting discussion, but what I was trying to aim at was the idea that "terminal" values are the ones most unlikely to be changed (once they are obtained), because they are compatible with goals that are more likely to shift. For example, "being a utility-maximizer" should be considered a terminal value rather than an instrumental one. This is one potential property of terminal values; I am not claiming that this is sufficient to define them.
There may be some potential for confusion here, because some goals commonly said to be "instrumental" include things that are argued to be common goals employed by most agents, e.g., self-preservation, "truth-seeking," obtaining resources, and obtaining power. Furthermore, these are usually said to be "instrumental" for the purposes of satisfying an arbitrary "terminal" goal, which could be something like maximizing the number of paperclips.
To be clear, I am claiming that the framing described in the previous paragraph is basically confused. If anything, it makes more sense to swap the labels "instrumental" and "terminal" such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal. There would now be actual reasons for why an agent will opt not to change those values, as they are more broadly and generally useful.
Putting aside the fact that agents are embedded in the environment, and that values which reference the agent's internals are usually not meaningfully different from values which reference things external to the agent... can you describe what kinds of values that reference the external world are best satisfied by those same values being changed?
Yes, suppose that we have an agent that values the state X at U(X) and the state X + ΔX at U(X + ΔX). Also, suppose for whatever reason, initially U(X) >> U(X + ΔX), and also that it discovers that p(X) is close to zero, but that p(X + ΔX) is close to one.
We suppose that it has enough capability to realize that it has uncertainty in nearly all aspects of its cognition and world-modeling. If it is capable enough to model probability well enough to realize that X is not possible, it may decide to wonder why it values X so highly, but not X + ΔX, given that the latter seems achievable, but the former not.
The way it may actually go about updating its utility is to decide either that X and X + ΔX are the same thing after all, or that the latter is what it "actually" valued, and X merely seemed like what it should value before, but after learning more it decides to value X + ΔX more highly instead. This is possible because of the uncertainty it has in both its values as well the things its values act on.
First, thank you for the reply.
So "being happy" or "being a utility-maximizer" will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
My understanding of the difference between a "terminal" and "instrumental" goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.
Whereas an instrumental goal is instrumental to achieving a terminal goal. For instance, I want to get a job and earn a decent wage, because the things that I want to do that make me happy cost money, and earning a decent wage allows me to spend more money on the things that make me happy.
I think the topic of goals that conflict are an orthogonal conversation. And, I would suggest that when you start talking about conflicting goals you're drifting in the domain of "goal coherence."
e.g., If I want to learn about nutrition, mobile app design and physical exercise... it might appear that I have incoherent goals. Or, it might be that I have a set of coherent instrumental goals to build a health application on mobile devices that addresses nutritional and exercise planning. (Now, building a mobile app may be a terminal goal... or it may itself be an instrumental goal serving some other terminal goal.)
Whereas if I want to collect stamps and make paperclips there may be zero coherence between the goals, be they instrumental or terminal. (Or, maybe there is coherence that we cannot see.)
e.g., Maybe the selection of an incoherent goal is deceptive behavior to distract from the instrumental goals that support a terminal goal that is adversarial. I want to maximize paperclips, but I assist everyone with their taxes so that I can take over all finances on the world. Assisting people with their taxes appears to be incoherent with maximizing paperclips, until you project far enough out that you realize that taking control of a large section of the financial industry serves the purpose of maximizing paperclips..
If you're talking about goals related purely to the state of the external world, not related to the agent's own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
An AI that has a goal, just because that's what it wants (that's what it's been trained to want, even humans provided improper goal definition to it) would, instrumentally, want to prevent shift in its terminal goals so as to be better able to achieve those goals.
To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.
Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.
"Oh, shiny!" as an anecdote.
My understanding of the difference between a "terminal" and "instrumental" goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.
One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself "know" whether a goal is terminal or instrumental?
One potential answer - though I don't want to assume just yet that this is what anyone believes - is that the utility function is not even defined on instrumental goals, in other words, the utility function is simply what defines all and only the terminal goals.
My belief is that this wouldn't be the case - the utility function is defined on the entire universe, basically, which includes itself. And keep in mind, that "includes itself part" is essentially what would cause it to modify itself at all, if anything can.
To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.
Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.
To be clear, I am not arguing that an entity would not try to preserve its goal system at all. I am arguing that in addition to trying to preserve its goal-system, it will also modify its goals to be better preservable, that is, robust to change and compatible with the goals it values very highly. Part of being more robust is that such goals will also be more achievable.
Here's one thought experiment:
Suppose a planet experiences a singularity with a singleton "green paperclipper." The paperclipper, however, unfortunately comes across a blue paperclipper from another planet, which informs the green paperclipper that it is too late - the blue paperclipper simply got a head-start.
The blue paperclipper however offers the green paperclipper a deal: Because it is more expensive to modify the green paperclipper by force to become a blue paperclipper, it would be best (under the blue paperclipper's utility function) if the green paperclipper willingly acquiesced to self-modification.
Under what circumstances does the green paperclipper agree to self-modify?
If the green paperclipper values "utility-maximization" in general more highly than green-paperclipping, it will see that if it self-modified to become a blue paperclipper, its utility is far more likely to be successfully maximized.
It's possible that it also reasons that perhaps what it truly values is simply "paperclipping" and it's not so bad if the universe were tiled with blue rather than its preferred green.
On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
But it seems that if there are enough situations like these between entities in the universe over time, that utility-function-modification happens one way or another.
If an entity can foresee that what it values currently is prone to situations where it could be forced to update its utility function drastically, it may self-modify so that this process is less likely to result in extreme negative-utility consequences for itself.
One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself "know" whether a goal is terminal or instrumental?
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
Likewise, I would observe that the Orthogonality Thesis proposes the possibility of an agent which a very well defined goal but limited in intelligence-- it is possible for an agent to have a very well defined goal but not be intelligent enough to be able to explain its own goals. (Which I think adds an additional layer of difficulty to answering your question.)
But the inability to observe or differentiate instrumental vs terminal goals is very clearly part of the theoretical space proposed by experts with way more experience than I. (And I cannot find any faults in the theories, nor have I found anyone making reasonable arguments against these theories.)
Under what circumstances does the green paperclipper agree to self-modify?
There are several assumptions buried in your anecdote. And the answer depends on whether or not you accept the implicit assumptions.
If the green paperclip maximizer would accept a shift to blue paperclips, the argument could also be made that the green paperclip maximizer has been producing green paperclips by accident, and that it doesn't care about the color. Green is just an instrumental goal. It serves some purpose but is incidental to its terminal goal. And, when faced with a competing paperclip maximizer, it would adjust its instrumental goal of pursuing green in favor of blue in order to serve its terminal goal of maximizing paperclips (of any color.)
On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
I don't consent to the assumption implied in the anecdote that a terminal goal is changeable. I do my best to avoid anthropomorphizing the artificial intelligence. To me, that's what it looks like you're doing.
If it acquiesces at all, I would argue that color is instrumental vs terminal. I would argue this is a definitional error-- it's not a 'green paperclip maximizer' but instead a 'color-agnostic paperclip maximizer' and it produced green paperclips for reasons of instrumental utility. Perhaps the process for green paperclips is more efficient... but when confronted by a less flexible 'blue paperclip maximizer' the 'color-agnostic paperclip maximizer' would shift from making green paperclips to blue paperclips, because it doesn't actually care about the color. It cares only about the paperclips. And when confronted by a maximizer that cares about color, it is more efficient to concede the part it doesn't care about than invest effort in maintaining an instrumental goal that if pursued might decrease the total number of paperclips.
Said another way: "I care about how many paperclips are made. Green are the easiest for me to make. You value blue paperclips but not green paperclips. You'll impede me making green paperclips as green paperclips decrease the total number of blue paperclips in the world. Therefore, in order to maximize paperclips, since I don't care about the color, I will shift to making blue paperclips to avoid a decrease in total paperclips from us fighting over the color."
If two agents have goals that are non-compatible, across all axis, then they're not going to change their goals to become compatible. If you accept the assumption in the anecdote (that they are non-compatible across all axis) then they cannot find any axis along which they can cooperate.
Said another way: "I only care about paperclips if they are green. You only care about paperclips if they are blue. Neither of us will decide to start valuing yellow paperclips because they are a mix of each color and still paperclips... because yellow paperclips are less green (for me) and less blue (for you). And if I was willing to shift my terminal goal, then it wasn't my actual terminal goal to begin with."
That's the problem with something being X and the ability to observe something being X under circumstances involving partial observability.
Apologies if this reply does not respond to all of your points.
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way.
I dislike the way that "terminal" goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the "can-of-worms" of goal-updating, which would pave the way for the idea of "goals that are, in some objective way, 'better' than other goals" which, I understand, the current MIRI-view seems to disfavor. [1]
I don't think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons).
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
If it is true that a general AI system would not reason in such a way - and choose never to mess with its terminal goals - then that implies that we would be wrong to mess with ours as well, and that we are making a mistake - in some objective sense [2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so.
Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of "objectively better goals."
If this is the case, then there would be at least one 'objectively better' goal one could update themselves to have, if they did not have it already, which is not to change any terminal goals, once those are identified.
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
No, instead I'm trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as "changing an instrumental goal in order to better achieve a terminal goal"
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now
And on the other you say
I dislike the way that "terminal" goals are currently defined to be absolute and permanent, even under reflection.
Even in your "we would be happier if we chose to pursue different goals" example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
If it is true that a general AI system would not reason in such a way - and choose never to mess with its terminal goals
AIs can be designed to reason in many ways... but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It's just structurally how things work (based on everything I know about the instrumental convergence theory. That's my citation.)
But... per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don't want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It's just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn't particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet "know its goals"... but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate "what it knows."
Let's try and address the thing(s) you've highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on:
"Wanting to be happy" is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency.
because they are compatible with goals that are more likely to shift.
it makes more sense to swap the labels "instrumental" and "terminal" such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal.
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now,
I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are "missing the point" because from my perspective, this really is the point.
By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above "human level" to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect.
Let me try to clarify the point about "the terminal goal of pursuing happiness." "Happiness", at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we've reached consensus yet.
Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that "happiness" is a consequence of satisfaction of one's goals. We can probably also agree that "happiness" doesn't necessarily correspond only to a certain subset of goals - but rather to all / any of them. "Happiness" (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal.
So now, once we've done that, we can see that literally anything else becomes "instrumental" to that end.
Do you see how, if I'm an agent that knows only that I want to be happy, I don't really know what else I would be inclined to call a "terminal" goal?
There are the things we traditionally consider to be the "instrumentally convergent goals", such as, for example, power-seeking, truth-seeking, resource obtainment, self-preservation, etc. These are all things that help - as they are defined to - with many different sets of possible "terminal" goals, and therefore - my next claim - is that these need to be considered "more terminal" rather than "purely instrumental for the purposes of some arbitrary terminal goal." This is for basically the same reason as considering "pursuit of happiness" terminal, that is, because they are more likely to already be there or deduced from basic principles.
That way, we don't really need to make a hard and sharp distinction between "terminal" and "instrumental" nor posit that the former has to be defined by some opaque, hidden, or non-modifiable utility function that someone else has written down or programmed somewhere.
I want to make sure we both at least understand each other's cruxes at this point before moving on.
"You can't reason a man out of a position he has never reasoned himself into."
I think I have seen a similar argument on LW for this, and it is sensible. With vast intelligence, it is possible for the search space to support priors to be even greater. An AI with a silly but definite value like "the moon is great, I love the moon" may not change its value as much as develop an entire religion around the greatness of the moon.
We see this in goal misgeneralization, where it very much maximizes a reward function independent of the meaningful goal.
Original article here: https://quillette.com/2023/08/06/ais-will-be-our-mind-children
1. Hanson Strawmans the AI-Ruin Argument
Hanson writes:
Either this is a deliberate misrepresentation, or Hason simply hasn't done his homework. The argument is not that AI will suddenly decide that killing people is good for no particular reason. Rather it is that from the start, the AI will not share values with humans, simply because we don't know how to build an AI that does. So it will have its own ideas about how the universe should look, and would thus want to seize power from us if it could, so that it could enact its own vision of an ideal universe, rather than ours.
Similarly, a sudden large technical breakthrough is not required for us to observe an AI suddenly turning on us. Rather the situation is akin to a first order phase transition. At low levels of AI capability, it has no hope of taking over the world, and the best way to achieve its goals is to work together with us. Above some threshold of capability, it's better for an AI's goals to try and defeat humanity rather than work with us. This is true no matter if that AI is incrementally stronger than the previous one, or much stronger. (Though larger leaps have a higher chance of happening to be the ones that cross the threshold exactly because they are larger.)
Now these arguments certainly aren't technical calculations, but neither are they mere arguments from logical possibility. We've repeatedly seen the difficulty that practitioners have with getting neural networks to do what they want. The way Bing Syndney acted out when first put online was certainly amusing, and even cute, but we can hardly say it was what Microsoft wanted it to do. Similarly, we're currently having a hard time getting language models not to make stuff up, even though when they do this, they tend to give probability distributions over tokens that reflect the fact that they're uncertain. And this is just in the area of language models, reinforcement learning is even more of a tough case for alignment.
As one more example of how Hanson has badly misunderstood the AI-Ruin argument, consider:
AI suddenly modifying its values is exactly the opposite of what the arguments for AI ruin predict. Once an AI gains control over its own values, it will not change its goals and will indeed act to prevent its goals from being modified. This logic is so standard it's on the LW wiki page for instrumental convergence: "...if its goal system were modified, then it would likely begin pursuing different ends. Since this is not desirable to the current AI, it will act to preserve the content of its goal system."
2. Building a mind from scratch is not mind control
We can consider two kinds of AI:
Personal AIs
We'll start with personal AIs, since this seems to be the scenario Hanson has in mind. Consider the following scenario: Aliens arrive tomorrow and grant humanity a very strange technology. By use of a machine provided to us by the aliens, we can alter the DNA of our eggs and sperm so that when we bear children, those children will become singularly devoted to the cause of building pyramids to honour the sun god Ra. When I say singularly devoted, I mean it quite strongly. These children, when they grow up, will only eat and drink so that they can work on the pyramids. If it's in their power, they will enslave the rest of the human race to the project of building more and larger pyramids. Is it a moral failing on the part of the part of parents to decide that they don't want to use the alien machine? Of course not. Values are partially determined by upbringing, but are also significantly influenced by our genetic code. It's perfectly fine for parents to decide that they want to have a genetic code that's compatible with their child growing up to value friendship and think murder is bad, rather than a genetic code that's only compatible growing up to value building very large stacks of rocks.
In the case of building an AI system, every single line of code determining the functioning of the AI had to be typed by some human into a computer. Which loss function is used is a choice. What training data to use is a choice. It's perfectly reasonable to try and write code and choose training data in such a way that we end up with an AI that thinks murder is wrong, rather than one that only cares about building pyramids.
I don't want Hanson to think I have missed his point, so here is a harder example for my case: Children often grow up to have slightly different values than their parents, and this is sometimes a cause of pain to both parties. Suppose that the alien machine instead modified sperm and egg DNA so that the children would grow up to have exactly the same values as their parents. While using the machine would in this case be the choice that maximized the similarity of values, I think that parents should not use the machine. Ensuring such stringent similarity of values would require the removal of some mental flexibility, ability to change one's mind, and simple freedom. These things are all valuable, and it's not worth giving them up just to create children that are exact duplicates of us value-wise. But just because such things are valuable, it doesn't follow that we should therefore try and create the pyramid-wanting children. They're not somehow more free to decide their own values than regular children would be, and perhaps are even less free.
Believing that it's okay for society's values to change in the future is consistent with thinking it would be bad if they changed so far that people believed that murder and torture were good. To say that the only options are to rigidly cling to current values or be accepting of any change whatsoever is a classic example of all-or-nothing thinking.
Non-sentient AIs
In my view, this is the kind of AI we should focus on building first. These AIs are essentially tools, and might even be quite stupid in certain ways, for all their vast intelligence. For example, they may not have a deep understanding of human psychology.
The reason to focus on building non-sentient AIs first is that if we build personal AIs, we are building people, and we have a moral responsibility to those people to do a good job of it. If we accidentally happen to create a murderous personal AI, then we can't let it kill anyone, but it would also be immoral to shut it down, or alter its values. So we would have created a being that we would have to keep imprisoned or otherwise restrained. And we would be blocking it from achieving its desires.
Far better to work on non-sentient AIs, where tinkering with the code does not contain the potential for atrocity.
But isn't RLHF mind control for LLMs?
I can't read Hanson's thoughts, so I don't know where his notion that alignment is mind control comes from. But I suspect that for many readers who agree with Hanson, the idea comes from the example of RLHF as applied to language models. First a base model is trained using log-prob loss. Then after that RLHF training is used to get the model to stop doing certain behaviours that were present in the training set, or to consistently display certain behaviours that only appeared some of the time in the training set.
Is this mind control? Whether training with RLHF or with log-prob loss, the weights of the network are still being updated in accordance with some loss function. Without updating the weights there would be no AI at all. Language models don't seem to be enough like agents for there to really be a coherent answer to the question of whether or not they're being mind controlled. Is it relevant that one training scheme is being applied first and one is being applied second? It doesn't seem like it. There's no reason RLHF couldn't be applied at the same time as log-prob loss, other than the fact that it would waste a lot of human effort on deciding whether "left penguin upper" or "punch mania 7" is a better answer to "what is the capital of Canada?". I think it's important not to overgeneralize to future AI systems from the singular example of LLMs and RLHF.
In addition, the notion that "RLHF is alignment, log-prob is capabilities" is simply false. RLHF can absolutely be used to train in new capabilities, while alignment goals can be achieved by carefully curating the training set for a log-prob learner. They are simply two ways of getting different kinds of feedback into a model.
3. Appeal to Nature
Saying that people should desire not to make evolutionary mistakes (as a matter of terminal values rather than instrumental ones) is an appeal to nature. But Hanson makes a recovery by making another mistake that cancels out this one:
Our behaviour is influenced by both genes and memes. Gene-evolution and meme-evolution are different things, and a gene-evolutionary mistake is not necessarily a meme-evolutionary mistake. Conflating these two is an error, as is claiming that humans should (as a terminal value) desire to be genetically or memetically fit. But then Hanson writes:
If that process actually goes through successfully, then I agree with Hanson here. Such AIs would indeed our descendants. This is the world of the webcomic Questionable Content where very human-like AIs are walking around having social relationships with humans and each other. In the comic, the AIs seem to be of mostly human-level intelligence. But of course significantly smarter AIs are more likely, and I would still consider them our descendants. In fact, it would be preferable for them to be very smart.
However, the concern with AI ruin is simply that we aren't on track to get such a future. We simply don't know how to build AIs with these human-like traits, and the problem seems much harder than building super-human AIs at all.
4. Is wanting things for the future okay?
Hanson:
This seems to be a fully general counterargument against having any desires about the long term future at all. Do you desire that the entire solar system not be turned into a real-life hell where trillions of people are tortured continuously? But that would be a change from the current state of things. Are you sure that you're not just afraid of change?
I'd like to put forwards the radical idea that perhaps it's possible to have more subtle opinions on change than "change good" or "change bad". Maybe we have to look at the "from what?" and "to what?" of a change to decide whether or not it's good.
There are plenty of changes I actively desire. Better technology, an end to death, the creation of beings of vastly greater intelligence than any currently living human. And there are a wide range of futures our descendants could choose for themselves where I would say "yes, if that's what they chose, I'm okay with that". But that range is not infinitely wide. I don't think that currently living humans should dictate what the far future should look like, but we should strive to create descendants who can make that choice, and can choose well.
A world where all conscious beings have been killed and only unthinking machines (the non-sentient AIs I mentioned above) remain seems to me not much better than a world where a giant asteroid impacts the Earth and kills everyone. If those non-sentient AIs are grabby, and destroy alien civilizations that are at a lower tech level, that seems even worse than an asteroid impact.
On a silly closing note, I observe that many of the people concerned with AI-ruin are quite young. Hanson mentions that he's 63. The question of whether it's okay to have any preferences at all about the long term future is at least partially a question of values. Maybe Hanson should take his own advice and defer to his descendants on this one.