Original article here: https://quillette.com/2023/08/06/ais-will-be-our-mind-children
1. Hanson Strawmans the AI-Ruin Argument
Hanson writes:
AI-doomers often suggest that their fears arise from special technical calculations. But in fact, their main argument is just the mere logical possibility of a huge sudden AI breakthrough, combined with a suddenly murderous AI inclination.
Either this is a deliberate misrepresentation, or Hason simply hasn't done his homework. The argument is not that AI will suddenly decide that killing people is good for no particular reason. Rather it is that from the start, the AI will not share values with humans, simply because we don't know how to build an AI that does. So it will have its own ideas about how the universe should look, and would thus want to seize power from us if it could, so that it could enact its own vision of an ideal universe, rather than ours.
Similarly, a sudden large technical breakthrough is not required for us to observe an AI suddenly turning on us. Rather the situation is akin to a first order phase transition. At low levels of AI capability, it has no hope of taking over the world, and the best way to achieve its goals is to work together with us. Above some threshold of capability, it's better for an AI's goals to try and defeat humanity rather than work with us. This is true no matter if that AI is incrementally stronger than the previous one, or much stronger. (Though larger leaps have a higher chance of happening to be the ones that cross the threshold exactly because they are larger.)
Now these arguments certainly aren't technical calculations, but neither are they mere arguments from logical possibility. We've repeatedly seen the difficulty that practitioners have with getting neural networks to do what they want. The way Bing Syndney acted out when first put online was certainly amusing, and even cute, but we can hardly say it was what Microsoft wanted it to do. Similarly, we're currently having a hard time getting language models not to make stuff up, even though when they do this, they tend to give probability distributions over tokens that reflect the fact that they're uncertain. And this is just in the area of language models, reinforcement learning is even more of a tough case for alignment.
As one more example of how Hanson has badly misunderstood the AI-Ruin argument, consider:
However, AI-doomers insist on the logical possibility that such expectations could be wrong. An AI might suddenly and without warning explode in abilities, and just as fast change its priorities to become murderously indifferent to us.
AI suddenly modifying its values is exactly the opposite of what the arguments for AI ruin predict. Once an AI gains control over its own values, it will not change its goals and will indeed act to prevent its goals from being modified. This logic is so standard it's on the LW wiki page for instrumental convergence: "...if its goal system were modified, then it would likely begin pursuing different ends. Since this is not desirable to the current AI, it will act to preserve the content of its goal system."
2. Building a mind from scratch is not mind control
As you can’t prove otherwise, they say, we must only allow AIs that are totally ‘aligned’, by which they mean totally eternally enslaved or mind-controlled. Until we figure out how to do that, they say, we must stop improving AIs.
We can consider two kinds of AI:
- Personal AI: These AIs are basically people, with an internal subjective experience, hopes, dreams, goals, feelings, memories and all the rest.
- Non-sentient AI: These AIs are simply very powerful tools, but they are not conscious. There is "no one home".
Personal AIs
We'll start with personal AIs, since this seems to be the scenario Hanson has in mind. Consider the following scenario: Aliens arrive tomorrow and grant humanity a very strange technology. By use of a machine provided to us by the aliens, we can alter the DNA of our eggs and sperm so that when we bear children, those children will become singularly devoted to the cause of building pyramids to honour the sun god Ra. When I say singularly devoted, I mean it quite strongly. These children, when they grow up, will only eat and drink so that they can work on the pyramids. If it's in their power, they will enslave the rest of the human race to the project of building more and larger pyramids. Is it a moral failing on the part of the part of parents to decide that they don't want to use the alien machine? Of course not. Values are partially determined by upbringing, but are also significantly influenced by our genetic code. It's perfectly fine for parents to decide that they want to have a genetic code that's compatible with their child growing up to value friendship and think murder is bad, rather than a genetic code that's only compatible growing up to value building very large stacks of rocks.
In the case of building an AI system, every single line of code determining the functioning of the AI had to be typed by some human into a computer. Which loss function is used is a choice. What training data to use is a choice. It's perfectly reasonable to try and write code and choose training data in such a way that we end up with an AI that thinks murder is wrong, rather than one that only cares about building pyramids.
I don't want Hanson to think I have missed his point, so here is a harder example for my case: Children often grow up to have slightly different values than their parents, and this is sometimes a cause of pain to both parties. Suppose that the alien machine instead modified sperm and egg DNA so that the children would grow up to have exactly the same values as their parents. While using the machine would in this case be the choice that maximized the similarity of values, I think that parents should not use the machine. Ensuring such stringent similarity of values would require the removal of some mental flexibility, ability to change one's mind, and simple freedom. These things are all valuable, and it's not worth giving them up just to create children that are exact duplicates of us value-wise. But just because such things are valuable, it doesn't follow that we should therefore try and create the pyramid-wanting children. They're not somehow more free to decide their own values than regular children would be, and perhaps are even less free.
Believing that it's okay for society's values to change in the future is consistent with thinking it would be bad if they changed so far that people believed that murder and torture were good. To say that the only options are to rigidly cling to current values or be accepting of any change whatsoever is a classic example of all-or-nothing thinking.
Non-sentient AIs
In my view, this is the kind of AI we should focus on building first. These AIs are essentially tools, and might even be quite stupid in certain ways, for all their vast intelligence. For example, they may not have a deep understanding of human psychology.
The reason to focus on building non-sentient AIs first is that if we build personal AIs, we are building people, and we have a moral responsibility to those people to do a good job of it. If we accidentally happen to create a murderous personal AI, then we can't let it kill anyone, but it would also be immoral to shut it down, or alter its values. So we would have created a being that we would have to keep imprisoned or otherwise restrained. And we would be blocking it from achieving its desires.
Far better to work on non-sentient AIs, where tinkering with the code does not contain the potential for atrocity.
But isn't RLHF mind control for LLMs?
I can't read Hanson's thoughts, so I don't know where his notion that alignment is mind control comes from. But I suspect that for many readers who agree with Hanson, the idea comes from the example of RLHF as applied to language models. First a base model is trained using log-prob loss. Then after that RLHF training is used to get the model to stop doing certain behaviours that were present in the training set, or to consistently display certain behaviours that only appeared some of the time in the training set.
Is this mind control? Whether training with RLHF or with log-prob loss, the weights of the network are still being updated in accordance with some loss function. Without updating the weights there would be no AI at all. Language models don't seem to be enough like agents for there to really be a coherent answer to the question of whether or not they're being mind controlled. Is it relevant that one training scheme is being applied first and one is being applied second? It doesn't seem like it. There's no reason RLHF couldn't be applied at the same time as log-prob loss, other than the fact that it would waste a lot of human effort on deciding whether "left penguin upper" or "punch mania 7" is a better answer to "what is the capital of Canada?". I think it's important not to overgeneralize to future AI systems from the singular example of LLMs and RLHF.
In addition, the notion that "RLHF is alignment, log-prob is capabilities" is simply false. RLHF can absolutely be used to train in new capabilities, while alignment goals can be achieved by carefully curating the training set for a log-prob learner. They are simply two ways of getting different kinds of feedback into a model.
3. Appeal to Nature
But—and here is my main point, so please listen carefully—future human-level AIs are not co-existing competing aliens; they are instead literally our descendants. So if your evolved instincts tell you to fight your descendants due to their strangeness, that is a huge evolutionary mistake. Natural selection just does not approve of your favoring your generation over future generations. Natural selection in general favors instincts that tell you to favor your descendants, even those who differ greatly from you.
Saying that people should desire not to make evolutionary mistakes (as a matter of terminal values rather than instrumental ones) is an appeal to nature. But Hanson makes a recovery by making another mistake that cancels out this one:
Your “genes” are whatever parts of you code for your and your descendants’ behaviours. For billions of years genes were mostly in DNA, but for at least the last ten thousand years most human “genes” have instead been in culture; in humans, cultural evolution now dominates DNA evolution.
Our behaviour is influenced by both genes and memes. Gene-evolution and meme-evolution are different things, and a gene-evolutionary mistake is not necessarily a meme-evolutionary mistake. Conflating these two is an error, as is claiming that humans should (as a terminal value) desire to be genetically or memetically fit. But then Hanson writes:
As we want our AIs to interact with us and share our world, we are now endowing them with tendencies to express and value our “genes” like awe, laughter, love, friendship, art, music, markets, law, democracy, inquiry, and liberty. And then when AIs help build new better AIs, they will tend to also endow descendants with such features. Thus, AIs will pass on your genes, and are your descendants.
If that process actually goes through successfully, then I agree with Hanson here. Such AIs would indeed our descendants. This is the world of the webcomic Questionable Content where very human-like AIs are walking around having social relationships with humans and each other. In the comic, the AIs seem to be of mostly human-level intelligence. But of course significantly smarter AIs are more likely, and I would still consider them our descendants. In fact, it would be preferable for them to be very smart.
However, the concern with AI ruin is simply that we aren't on track to get such a future. We simply don't know how to build AIs with these human-like traits, and the problem seems much harder than building super-human AIs at all.
4. Is wanting things for the future okay?
Hanson:
If you are horrified by the prospect of greatly changed space or AI descendants, then maybe what you really dislike is change. For example, maybe you fear that changed descendants won’t be conscious, as you don’t know what features matter for consciousness.
This seems to be a fully general counterargument against having any desires about the long term future at all. Do you desire that the entire solar system not be turned into a real-life hell where trillions of people are tortured continuously? But that would be a change from the current state of things. Are you sure that you're not just afraid of change?
I'd like to put forwards the radical idea that perhaps it's possible to have more subtle opinions on change than "change good" or "change bad". Maybe we have to look at the "from what?" and "to what?" of a change to decide whether or not it's good.
There are plenty of changes I actively desire. Better technology, an end to death, the creation of beings of vastly greater intelligence than any currently living human. And there are a wide range of futures our descendants could choose for themselves where I would say "yes, if that's what they chose, I'm okay with that". But that range is not infinitely wide. I don't think that currently living humans should dictate what the far future should look like, but we should strive to create descendants who can make that choice, and can choose well.
A world where all conscious beings have been killed and only unthinking machines (the non-sentient AIs I mentioned above) remain seems to me not much better than a world where a giant asteroid impacts the Earth and kills everyone. If those non-sentient AIs are grabby, and destroy alien civilizations that are at a lower tech level, that seems even worse than an asteroid impact.
On a silly closing note, I observe that many of the people concerned with AI-ruin are quite young. Hanson mentions that he's 63. The question of whether it's okay to have any preferences at all about the long term future is at least partially a question of values. Maybe Hanson should take his own advice and defer to his descendants on this one.
First, thank you for the reply.
My understanding of the difference between a "terminal" and "instrumental" goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.
Whereas an instrumental goal is instrumental to achieving a terminal goal. For instance, I want to get a job and earn a decent wage, because the things that I want to do that make me happy cost money, and earning a decent wage allows me to spend more money on the things that make me happy.
I think the topic of goals that conflict are an orthogonal conversation. And, I would suggest that when you start talking about conflicting goals you're drifting in the domain of "goal coherence."
e.g., If I want to learn about nutrition, mobile app design and physical exercise... it might appear that I have incoherent goals. Or, it might be that I have a set of coherent instrumental goals to build a health application on mobile devices that addresses nutritional and exercise planning. (Now, building a mobile app may be a terminal goal... or it may itself be an instrumental goal serving some other terminal goal.)
Whereas if I want to collect stamps and make paperclips there may be zero coherence between the goals, be they instrumental or terminal. (Or, maybe there is coherence that we cannot see.)
e.g., Maybe the selection of an incoherent goal is deceptive behavior to distract from the instrumental goals that support a terminal goal that is adversarial. I want to maximize paperclips, but I assist everyone with their taxes so that I can take over all finances on the world. Assisting people with their taxes appears to be incoherent with maximizing paperclips, until you project far enough out that you realize that taking control of a large section of the financial industry serves the purpose of maximizing paperclips..
An AI that has a goal, just because that's what it wants (that's what it's been trained to want, even humans provided improper goal definition to it) would, instrumentally, want to prevent shift in its terminal goals so as to be better able to achieve those goals.
To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.
Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.
"Oh, shiny!" as an anecdote.