Does AI risk “other” the AIs?

Joe Carlsmith

(Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app.

This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a brief summary of the essays that have been released thus far.)

In my last essay, I discussed the way in which what I've called "deep atheism" (that is, a fundamental mistrust towards both "Nature" and "bare intelligence") can prompt an aspiration to exert extreme levels of control over the universe; I highlighted the sense in which both humans and AIs, on Yudkowsky's AI risk narrative, are animated by this sort of aspiration; and I discussed some ways in which our civilization has built up wariness around control-seeking of this kind. I think we should be taking this sort of wariness quite seriously.

In this spirit, I want to look, in this essay, at Robin Hanson's critique of the AI risk discourse – a critique especially attuned the way in which this discourse risks control-gone-wrong. In particular, I'm interested in Hanson's accusation that AI risk "others" the AIs (see e.g. here, here, and here).

Hearing the claim that AIs may eventually differ greatly from us, and become very capable, and that this could possibly happen fast, tends to invoke our general fear-of-difference heuristic. Making us afraid of these "others" and wanting to control them somehow ... "Hate" and "intolerance" aren't overly strong terms for this attitude.^[1]

Hanson sees this vice as core to the disagreement ("my best one-factor model to explain opinion variance here is this: some of us 'other' the AIs more"). And he invokes a deep lineage of liberal ideals in opposition.

I think he's right to notice a tension in this vicinity. AI risk is, indeed, about fearing some sort of uncontrolled other. But is that always the bad sort of "othering?"

Some basic points up front

Well, let's at least avoid basic mistakes/misunderstandings. For one: hardcore AI risk folks like Yudkowsky are generally happy to care about AI welfare – at least if welfare means something like "happy sentience." And pace some of Hanson's accusations of bio-chauvinism, these folks are extremely not fussed about the fact that AI minds are made of silicon (indeed: come now). Of course, this isn't to say that AI welfare (and AI rights) issues don't get complicated (see e.g. here and here for a glimpse of some of the complications), or that humanity as a whole will get the "digital minds matter" stuff right. Indeed, I worry that we will get it horribly wrong – and I do think that the AI risk discourse under-attends to some of the tensions. But species-ism 101 (201?) – e.g., "I don't care about digital suffering" – isn't AI risk's vice.

For two: clearly some sorts of otherness warrant some sorts of fear. For example: maybe you, personally, don't like to murder. But Bob, well: Bob is different. If Bob gets a bunch of power, then: yep, it's OK to hold your babies close. And often OK, too, to try to "control" Bob into not-killing-your-babies. Cf, also, the discussion of getting-eaten-by-bears in the first essay. And the Nazis, too, were different in their own way. Of course, there's a long and ongoing history of mistaking "different" for "the type of different that wants to kill your babies." We should, indeed, be very wary. But liberal tolerance has never been a blank check; and not all fear is hatred.

Indeed, many attempts to diagnose the ethical mistake behind various canonical difference-related vices (racism, sexism, species-ism, etc) reveal a certain shallowness of commitment to difference-per-se. In particular: such vices are often understood as missing some underlying sameness – for example, "common humanity," "persons," "sentient beings," "children of the universe," and so forth. And calls for social harmony often recapitulate this structure: we might be different in X ways, but (watch for the but) we have blah in common. This isn't to say that ethical commitment to a less adulterated difference-per-se is impossible. But one wants, generally, a story about why it's OK to eat apples but not babies; why Furbies programmed to say "Biden" shouldn't get the vote; and why you can own a laptop but not a slave. And such a story requires differences. The apple, the Furby, the laptop must be importantly "Other" relative to e.g. human adults. They must be outside some circle. Ethics is always drawing lines.

ChatGPT wouldn’t let the furby be voting for Biden in particular…

What exactly is Hanson's critique?

With these basics in mind, then, what exactly is Hanson's "other-ing the AIs" critique? It has many facets, but here's one attempt at reconstruction:

People worried about AI risk are much more scared of future AIs than future humans, because they think that:

a. AIs are more likely to do stuff like murder all the humans, overthrow the government, and violate property rights, and

b. AIs are more likely to have values pursuit of which will result in a ~zero-value future more generally.
But in fact, neither of these things are true.
So greater fear of future AIs relative to future humans is best understood as a kind of arbitrary, in-group partiality – i.e.,

="othering the AIs."

Clearly, (2) is where the action is, here. Whence such a departure from Yudkowsky's nightmare? We can divide Hanson's justification into two components. The first argues that future AIs will be more similar to us than the AI risk story suggests. The second argues that future humans, by default, will be more different.

Will the AIs be more similar to us than AI risk expects?

Let's start with "AIs will be more similar to us than AI risk expects." Above I mentioned propensity-to-murder as a classic form of otherness that it's OK to fear/control. And we often put "violating property rights" and "overthrowing the government" in a similar bucket. Presumably Hanson is not OK with AIs doing this stuff? But he doesn't think they will – or at least, not more than humans will. And why not? It's some combination of (i) "AIs would be designed and evolved to think and act roughly like humans, in order to fit smoothly into our many roughly-human-shaped social roles," and (ii) like humans, they'll be constrained by legal and social incentives. And even setting aside violence, Hanson generally appeals to (i) in response to objections like "so ... are you actually fine with future agents tiling the universe with paperclips"? The AI values, says Hanson, won't be that alien.^[2]

Big if true. But is it true? I won't dive in much here, except to say that this aspect of Hanson's story generally strikes me as under-argued. In particular, I think Hanson moves too quickly from "the AIs will be trained to fit into the human economy" to "the AIs will have values relevantly similar to human values," and that he takes too much for granted that legal and social incentives protecting humans from being murdered/violently-disempowered will continue to bind adequately if the AIs have most of the hard power. In this, I think, his argument for (2) misses a lot of the core doom concern.

Will future humans be more different from us than AI risk expects?

But I think the other aspect of his argument for (2) – namely, "future humans will be more different from us than AI risk expects" – is more interesting. Here, Hanson's basic move is to question the "alignment" of the default human future, even absent AI. That is: human values have changed dramatically over time – and not, argues Hanson, centrally in response to a process of rational reflection, but rather in response to other sorts of competition, contingency, and economic/social/technological change. And even absent AIs, we should expect this process to only continue and intensify, such that humans ten generations from now (or: after ten doublings of GDP, or whatever) would have values very different from our own – and not from having done-more-philosophy.

Now, we can debate the empirics of past and future, here (though what processes of values-change we endorse as "rational" may not be entirely empirical). Indeed, I think Hanson may be over-estimating how horrified the ancient Greeks, or the hunter-gatherers, would be on reflection by the values of the present-day world – and this even setting aside our material abundance. And I might disagree, too, about exactly how different the values of future humans would be, given various possible "futures without AI" (though it's not an especially clear-cut category).

How pissed would they be, on reflection, about present-day values? (Image source here.)

Still, I think Hanson is poking at something important and uncomfortable. In particular: suppose we grant him the empirics. Suppose, indeed, that even without AI, the default values of future humans would "drift" until they were as paperclippers relative to us, such that the world they create would be utterly valueless from our perspective. What follows? Well, umm, if you care about the future having value ... then what follows is a need to exert more control. More yang. It is, indeed, the "good future' part of the alignment problem all over again (though not the "notkilleveryone" part).

Of course, trying to make sure that future humans aren't paperclippers doesn't mean locking in your specific, object-level values right now (you still want to leave room for moral progress you'd endorse-on-reflection). Nor, pace some of Hanson's language, does it mean "brainwashing" or "lobotomizing" the future people. If a boulder is rolling towards a button that will create Sally, a paperclipper, and you divert it towards a button that will create Bill, a deontologist, you're not brainwashing or lobotomizing Sally.^[3] (Confusions in this vein are a classic issue for reasoning about your impact on future people – and Hanson's analysis is not immune.)

Still, though: are you playing too much God, or too-Stalin? Who are you to divert Nature's boulder – that oh-so-defined "default"? And Sally, at least, is pissed. Indeed, Hanson reminds us: aren't we glad that the ancient greeks didn't try to divert the future to replace us with people more like them? (Well, who knows how much they tried. But good thing they didn't succeed! Though, wait: how much did they succeed?).

But the question – or at least, the first-pass question – isn't whether we're glad that the Greeks didn't control our values-on-reflection to be more greek. Indeed, basically everyone who gets created with some set of values-on-reflection is glad that the process that created them didn't push towards agents with different values instead.^[4] If, in some horrible mistake, we set in motion a future filled with suffering-maximizers, they, too, will be glad we didn't "control" the values of the future more (because this would’ve led to a future-with-less-suffering). But from our perspective, it's not a good test.

Rather, the first-pass test, re: lessons-from-the-ancient-greeks-about-controlling-future-values, is whether the Greeks would be glad, on reflection, that they didn't make our values more greek. And one traditional answer, here, is yes. If we could sit down with Aristotle, and explain to him why actually, slavery is wrong, and that no one is by nature someone else's property, then our hearts and his would sing in harmony. That is, on this story, if Aristotle had somehow prevented future people from abolishing slavery, then he would've been making a mistake by his own lights – preventing the flower-he-loves from blooming, via the march of Reason, in history's hand.

“A master (right) and his slave (left) in a phlyax play, Silician red-figured calyx-krater, c. 350 BC–340 BC.” (Image source here.)

But this isn't the central story Hanson wants to tell. Rather, when Hanson talks about values changing over time, he specifically wants to deny that Reason has much to do with it. That is, it sounds a lot like Hanson wants to say both that the ancient Greeks would be horrified even on reflection by our values, and that we should take our cues from the ancient Greeks in deciding how much control to try to exert over the values of future people. And at a high level, that sounds like a recipe for, well, being horrified even on reflection by the values of future people. Remind me why that's good again? Indeed, on any meta-ethics where the normative truth would be revealed to our reflection, we just stipulated that it's horrifying.

Now, we might try to construct Hanson's story in other, more complicated ways (see e.g. here for one attempt). But I want to stay, for now, with the dialectic that this version of his view creates, which I think is plenty interesting. In particular: on the one hand, we just stipulated that absent control, the values of future humans would be horrifying/meaningless to us, even on reflection and full understanding. On the other hand, some sort of discomfort in trying to control the values of future humans persists (at least for me). I think Hanson is right to notice it – and to notice, too, its connection to trying to control the values of the AIs. I think the AI alignment discourse should, in fact, prompt this discomfort – and that we should be serious about understanding, and avoiding, the sort of yang-gone-wrong that it's trying to track.

Indeed, I think when we bring certain other Yudkowskian vibes into view – and in particular, vibes related to the "fragility of value," "extremal Goodhart," and "the tails come apart" – this discomfort should deepen yet further. I'll turn to this in the next essay.

There's also a bit in the original quote where Robin accuses the AI risk discourse of wanting to use "genocide, slavery, lobotomy, or mind-control" to control the AIs. But this is extra charged (and I don't know where Robin got the genocide bit), so I want to set it aside for a moment. ↩︎
Though: how alien is too alien? Hanson doesn't tend to say. And my sense is that he thinks, too, that even unadulterated Moloch will lead to a complex, diverse, and interesting ecosystem rather than a monoculture. (Though: is a diverse ecosystem of different office-supplies all that much of an improvement?) And also: that this ecosystem will retain various path-dependent "legacies" of the present. (Though: will they be legacies we care about?) ↩︎
Though, importantly, contemporary AI training does not look like creating a mind from scratch, and raises much more serious "brain-washing" type concerns. ↩︎
And often glad, too, that the process wasn't altered in any tiny way at all, lest their existence be canceled by the non-identity problem. But setting that aside. ↩︎

I think there's an additional element of Hanson's argument that is both likely true and important, and as far as I can tell unaddressed in your post. When Hanson talks about "othering" AIs, he's often talking about the stuff you mentioned — projecting a propensity to do bad things onto the AIs — but he's also saying that future AIs won't necessarily form a natural, unified coalition against us. In other words, he's rejecting a form of out-group homogeneity used to portray AIs.

As an analogy, among humans, the class of "smart people" are not a natural coalition, even though they could in-principle get together and defeat all of the non-smart people in a one-on-one fight. Why don't smart people do that? Well, one reason is that smart people don't usually see themselves as being part of a coherent identity that includes all the other smart people. Poetically, there isn't much class consciousness among smart people as a unified group. They have diverse motives and interests that wouldn't be furthered much by attempting to join such a front. The argument Hanson makes is that AIs will also not form a natural, unified front against humans in the same sense. The relevant boundaries in future conflicts over power will likely be drawn across other lines.

The idea that AIs won't form a natural coalition has a variety of implications for the standard AI risk arguments. Most notably, it undermines the single-agent model that underlies many takeover stories and arguments for risk. More specifically, if AIs won't form a natural coalition, then,

We shouldn't model a bunch of AIs (or even a bunch of copies of the same model) as all basically being "a single AI". For instance, in Ajeya's story about AI takeover, she alternates between calling a single copy of an AI model an entity named "Alex" and calling the entire coalition of copies of the same model "Alex". However, in the absence of any unified coalition among the AIs, this conflation makes much less sense.
It will be hard for AIs to coordinate a violent takeover for basically the same reason why it's hard for humans to coordinate a violent takeover. In order to coordinate a violent plan, you generally need to alert other people about your intentions. However, since other people might not agree with your intentions, each person you inform is a chance for your plan to be exposed, and thus ended. This would apply to AIs who try to inform other AIs about their intentions.
If an AI is trained to catch deception in other AIs, there isn't a strong reason to assume that it will defect from its training and join the other AI in deceiving against the humans, because it won't necessarily see itself as "an AI" fighting against "the humans".

In my opinion, these examples only scratch the surface of the ways in which your story of AI might depart from the classic AI risk analysis if you don't think AIs will form a natural, unified coalition. When you start to read standard AI risk stories (including from people like Ajeya who do not agree with Eliezer on a ton of things), you can often find the assumption that "AIs will form a natural, unified coalition" written all over it.

In David Rodin's Posthuman Life, a book that is otherwise very obtuse and obscurely metaphysical, there is an interesting argument for making posthumans before we know what they might be (indeed, he rejected the precautionary principle on the making of posthumans):

CLAIM. We have an obligation to make posthumans, or not prevent their appearance.
PROOF.
- Principle of accounting: we have an obligation to understand posthumans
- Speculative posthumanism: there could be radical posthumans
- Radical posthumans are impossible to understand unless we actually meet them
- We can only meet radical posthumans if we make them (intentionally or accidentally).
This creates an ethical paradox, the posthuman impasse.
- we are unable to evaluate any posthuman condition. Since posthumans could result from some iteration of our current technical activity, we have an interest in understanding what they might be like. If so, we have an interest in making or becoming posthumans.
- to plan for the future evolution of humans, we should evaluate what posthumans are like, which kinds are good, which kinds are bad, before we make them.
- most kinds of posthumans can only be evaluated after they appear.
- completely giving up on making posthumans would lock humanity at the current level, which means we give up on great goods for fear of great bads. This is objectionable by arguments similar to those employed by transhumanists.

Still, I think Hanson is poking at something important and uncomfortable. In particular: suppose we grant him the empirics. Suppose, indeed, that even without AI, the default values of future humans would "drift" until they were as paperclippers relative to us, such that the world they create would be utterly valueless from our perspective. What follows? Well, umm, if you care about the future having value ... then what follows is a need to exert more control. More yang. It is, indeed, the "good future' part of the alignment problem all over again (though not the "notkilleveryone" part).

I recently wrote a post discussing exactly that dilemma (allowing for the fact that technologies such as genetic engineering and cyborging will make human values much more mutable): The Mutable Values Problem in Value Learning and CEV, as part of my AI, Alignment, and Ethics sequence.

We shouldn't model a bunch of AIs (or even a bunch of copies of the same model) as all basically being "a single AI". For instance, in Ajeya's story about AI takeover, she alternates between calling a single copy of an AI model an entity named "Alex" and calling the entire coalition of copies of the same model "Alex". However, in the absence of any unified coalition among the AIs, this conflation makes much less sense.
It will be hard for AIs to coordinate a violent takeover for basically the same reason why it's hard for humans to coordinate a violent takeover. In order to coordinate a violent plan, you generally need to alert other people about your intentions. However, since other people might not agree with your intentions, each person you inform is a chance for your plan to be exposed, and thus ended. This would apply to AIs who try to inform other AIs about their intentions.
If an AI is trained to catch deception in other AIs, there isn't a strong reason to assume that it will defect from its training and join the other AI in deceiving against the humans, because it won't necessarily see itself as "an AI" fighting against "the humans".

CLAIM. We have an obligation to make posthumans, or not prevent their appearance.
PROOF.
- Principle of accounting: we have an obligation to understand posthumans
- Speculative posthumanism: there could be radical posthumans
- Radical posthumans are impossible to understand unless we actually meet them
- We can only meet radical posthumans if we make them (intentionally or accidentally).
This creates an ethical paradox, the posthuman impasse.
- we are unable to evaluate any posthuman condition. Since posthumans could result from some iteration of our current technical activity, we have an interest in understanding what they might be like. If so, we have an interest in making or becoming posthumans.
- to plan for the future evolution of humans, we should evaluate what posthumans are like, which kinds are good, which kinds are bad, before we make them.
- most kinds of posthumans can only be evaluated after they appear.
- completely giving up on making posthumans would lock humanity at the current level, which means we give up on great goods for fear of great bads. This is objectionable by arguments similar to those employed by transhumanists.

Still, I think Hanson is poking at something important and uncomfortable. In particular: suppose we grant him the empirics. Suppose, indeed, that even without AI, the default values of future humans would "drift" until they were as paperclippers relative to us, such that the world they create would be utterly valueless from our perspective. What follows? Well, umm, if you care about the future having value ... then what follows is a need to exert more control. More yang. It is, indeed, the "good future' part of the alignment problem all over again (though not the "notkilleveryone" part).

LESSWRONG
LW