Being nicer than Clippy

Joe Carlsmith

(Cross-posted from my website. Podcast version here, or search "Joe Carlsmith Audio" on your podcast app.

This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a summary of the essays that have been released thus far.)

In my last essay, I discussed a certain kind of momentum, in some of the philosophical vibes underlying the AI risk discourse,^[1] towards deeming more and more agents – including: human agents – "misaligned" in the sense of: not-to-be-trusted to optimize the universe hard according to their values-on-reflection. We can debate exactly how much mistrust to have in different cases, here, but I think the sense in which AI risk issues can extend to humans, too, can remind us of the sense in which AI risk is substantially (though, not entirely) a generalization and intensification of the sort of "balance of power between agents with different values" problem we already deal with in the context of the human world. And I think it may point us towards guidance from our existing ethical and political traditions, in navigating this problem, that we might otherwise neglect.

In this essay, I try to gesture at a part of these traditions that I see as particularly important: namely, the part that advises us to be "nicer than Clippy" – not just in what we do with spare matter and energy, but in how we relate to agents-with-different-values more generally. Let me say more about what I mean.

Utilitarian vices

As many have noted, Yudkowsky's paperclip maximizer looks a lot like total utilitarian. In particular, its sole aim is to "tile the universe" with a specific sort of hyper-optimized pattern. Yes, in principle, the alignment worry applies to goals that don't fit this schema (for example: "cure cancer" or "do god-knows-whatever kludge of weird gradient-descent-implanted proxy stuff"). But somehow, especially in Yudkowskian discussions of AI risk, the misaligned AIs often end up looking pretty utilitarian-y, and a universe tiled with something – and in particular, "tiny-molecular-blahs" – often ends seeming like a notably common sort of superintelligent Utopia.

What's more, while Yudkowsky doesn't think human values are utilitarian, he thinks of us (or at least, himself) as sufficiently galaxy-eating that it's easy to round off his "battle of the utility functions" narrative into something more like a "battle of the preferred-patterns" – that is, a battle over who gets to turn the galaxies into their favored sort of stuff. The AIs want to tile the universe with paperclips; the humans, in Yudkowsky's world, want to tile it with "Fun." (Tiny-molecular-Fun?)

ChatGPT imagines "tiny molecular fun."

But actually, the problem Yudkowsky talks about most – AIs killing everyone – isn't actually a paperclips vs. Fun problem. It's not a matter of your favorite uses for spare matter and energy. Rather, it's something else.

Thus, consider utilitarianism. A version of human values, right? Well, one can debate. But regardless, put utilitarianism side-by-side with paperclipping, and you might notice: utilitarianism is omnicidal, too – at least in theory, and given enough power. Utilitarianism does not love you, nor does it hate you, but you're made of atoms that it can use for something else. In particular: hedonium (that is: optimally-efficient pleasure, often imagined as running on some optimally-efficient computational substrate).

But notice: did it matter what sort of onium? Pick your favorite optimal blah-blah. Call it Fun instead if you'd like (though personally, I find the word "Fun" an off-putting and under-selling summary of Utopia). Still, on a generalized utilitarian vibe, that blah-blah is going to be a way more optimal use of atoms, energy, etc than all those squishy inefficient human bodies. They never told you in philosophy class? It's not just organ-harvesting and fat-man-pushing. The utilitarians have paperclipper problems, too.^[2]

Oh, maybe you heard this about the negative utilitarians. "Doesn't your philosophy want to kill everyone?" But the negative utilitarians protest: "so does the classical version!" And straw-Yudkowsky, at least, is not surprised. In straw-Yudkowsky's universe, killing everyone is, like, the first thing that (almost) any strong-enough rational agent does. After all, "everyone" is in the way of that agent's yang.

But are foomed-up humans actually this omnicidal? I hope not. And real-Yudkowsky, at least, doesn't think so. There's a bit in his interview with Lex Fridman, where Yudkowsky tries to get Lex to imagine being trapped in a computer run by extremely slow-moving aliens who want their society to be very different from how Lex wants it to be (in particular: the aliens have some sort of equivalent of factory farming). Yudkowsky acknowledges that Lex is presumably "nice," and so would not, himself, actually just slaughter all of these aliens in the process of escaping. And eventually, Lex agrees.

What is this thing, "nice"? Not, apparently, the same thing as "preferring the right tiny-molecular-pattern." Existing creatures are unlikely to be in this pattern by default, so if that's the sum total of your ethics, you're on the omnicide train with Clippy and Bentham. Rather, it seems, niceness is something else: something where, when you wake up in an alien civilization, you don't just kill everyone first thing, even though you're strong enough to get away with it. And this even-though (gasp) their utility functions are different from yours. What gives?

"Something very contingent and specific to humans, or at least to evolved creatures, and which won't occur in AIs by default in any way we'd like" answers Yudkowsky. And maybe so.^[3] But I'm interested, here, not in whether AIs will be nice-like-us, but rather, in understanding what our niceness consists in, and what it might imply about the sorts of otherness and control issues I've been talking about in this series.

In particular: a key feature of niceness, in my view, is some sort of direct responsiveness to the preferences of the agents you're interacting with. That is, "nice" values give the values of others some sort of intrinsic weight. The aliens don't want to be killed, and this, in itself, is a pro tanto reason not to kill them. In this sense, niceness allows some aspect of yin into its agency. It is influenced by others; it receives others; it allows itself to channel – or at least, to respect and make space for – the yang of others.

The extreme version of this is preference utilitarianism, which tries to make of itself, solely, a conduit of everyone else. And it might seem, prima facie, an attractive view. In particular: to someone who doesn't like the idea of imposing their own arbitrary, contingent will upon the world, an ideal that instead enacts some sort of "universal compromise will" (i.e., the combination of everyone's preferences) can seem to regain the kind of objective and other-centered footing that anti-realism about ethics threatens to deny. But as I've written about previously, I think the appeal of a pure preference utilitarianism fades on closer scrutiny.^[4] In particular: I think it founders on possible people, on paperclippers, and in particular, on sadists.

But rejecting a pure preference utilitarianism does not mean embracing a stance that refuses to ever give the preferences of others intrinsic weight. ^[5] And my sense is that sometimes the AI safety discourse goes too far in this respect. It learns, from paperclippers, the strange and unappealing places that the preferences of arbitrary others can lead. Indeed, Yudkowsky takes explicit steps to break his audience's temptation towards sympathy with Clippy's preferences (this is the point of the abstract notion of "paperclips"), and to place Clippy's agency firmly in the role of "adversary" (see, e.g., the "true prisoner's dilemma"). And against such a backdrop, it's easy (though: not endorsed by Yudkowsky) for the idea that preferences like Clippy's deserve any intrinsic weight to fall out of the picture. After all: Clippy doesn't give our preferences any weight. And aren't we and Clippy ultimately alike, modulo our favored blah-blah-onium?

No. In addition to liking happier onium than Clippy, we are nicer than Clippy to agents-with-different-values. Or: we should be. Indeed, I think we should strive to be the sort of agents that aliens would not fear the way Yudkowsky fears paperclippers, if the aliens discovered they were on the verge of creating us. This doesn't mean we should just adopt the alien preferences as our own – and especially not if the stuff they like is actively evil rather than merely meaningless (more below). But it does mean, for example, not killing them. But also: actively helping them (on their own terms) in cheap ways, treating them with respect and dignity, not enslaving them or oppressing them, and more.^[6]

Alien alignment researcher thinking about p(doom)

That is: human values themselves have stuff to say about how we should treat agents-with-different-values – including, non-humans. Indeed, a huge portion of our ethics and politics ends up dealing with this in one form or another. AI otherness will be new, yes – but we have deep, richly textured, and at-least-somewhat battle-tested traditions to draw on in orienting towards it. Too often, utilitarian vibes forget about these traditions ("isn't it all just an empirical question about what-causes-the-utils?"). And too often, fear that the agents-with-different-values might hurt us makes us forget, too (which, I re-emphasize, isn't to say that agents-with-different-values won't hurt us – cf all this stuff about bears and Nazis and the-brutality-of-nature etc in the previous essays). But faced with a new class of others/fellow-creatures/potential-threats, we should be drawing on every source of wisdom we can.

Boundaries

Let me give an example of ways in which bringing to mind some of the less utilitarian dimensions of human ethics can make a difference to how we orient towards AI systems with values different from our own.

In "Does AI risk 'other' the AIs?," I mentioned two worries the AI alignment discourse has about paperclippers:

That they'll kill everyone (and relatedly: violate people's basic rights, steal people's stuff, and violently overthrow the government).
That they'll gain power in a way that results in their values (rather than human values) steering the trajectory of earth-originating civilization, thereby leading to a future of ~zero value.

These two worries are often lumped together under the more unified concern that the AIs will have the "wrong values." After all, if they had the right values, presumably they would do neither of these things.

But the two worries are importantly distinct.^[7] For one thing, as has been oft-noted, different human ethical views might disagree about their respective importance. But beyond this, these two worries interact very differently with our existing ethical and political norms governing how agents with different values should relate to one another.

In particular: as a civilization, we have extremely deep and robust norms prohibiting agents from doing worry-number-1-style behavior: i.e., killing other people, stealing other people's stuff, and trying to overthrow the government (though of course, there are exceptions and complexities). That is, worry-number-1 casts the AIs in a role that triggers very directly our sense that we are dealing with aggressors who are violating important boundaries -- boundaries that lie at the core of human cooperative arrangements – and whose behavior therefore warrants unusually strong forms of defensive response. For example: if someone is breaking into your home with nano-bots trying to kill you, you are morally permitted – on the basis of self-defense – to do things that would otherwise be impermissible (even to save your own life) in other contexts: for example, killing them (where this is necessary and proportionate).^[8] Similarly: you are justified in doing things to people who are invading your country that you aren't justified in doing if they aren't invading your country, and so forth. The misaligned AIs, according to worry-number-1, are enemies of this deep and familiar sort.^[9]

"Hitler watching German soldiers march into Poland in September 1939." An example of a worry-number-1-style boundary violation. (Image source here.)

But what of worry-number-2? Here, hmm: if we take worry-number-1 full off the table, I think it becomes quite a bit less clear what standard (western, liberal, broadly democratic) ethical and political norms have to say about worry-number-2 on its own. To see this, consider the following thought experiment (caveat: I'm really, really not saying that misaligned AIs will be like this).

Imagine a liberal society very much like our own, except with the addition of one extra human cultural group: namely, the humans-who-like-paperclips. The humans-who-like-paperclips are a sect of humans that arose at some point in the sixties and has been growing ever since. They are meticulously law-abiding, kind, and cooperative, but they have one weird quirk: the main thing they all want to do with their personal resources is to make paperclips. Passing by a house owned by a human-who-likes-paperclips, you'll often see large, neatly-sorted stacks of paperclip boxes in their backyards, and through the windows of their garages, and sometimes in the living rooms. The richer humans-who-like-paperclips own whole warehouses. The paperclip industry is booming.

Yeah sometimes he just stands there looking at them...

Now, let's start by noticing that in this context, it's not at all clear that "the humans-who-like-paperclips have different values from us" qualifies as a problem, at least by the lights of basic western, liberal norms (here I mean liberalism in the political-philosophy sense roughly at stake in this Wikipedia page, rather than in the "liberals vs. republicans" sense). What the humans-who-like-paperclips do with their private resources, and in the privacy of their homes/backyards, is their own business, conditional on its compatibility with certain basic norms around harm, consent, and so forth. After all: Alicia down the street spends her free time and money listening to noise music; Jim sits around watching trashy TV in a drunken haze; Felipe has sex with other men; Maria collects stamps; and Jason is Mormon. Are the humans-who-like-paperclips importantly different? What happened to liberal tolerance?

Now, of course, utilitarianism-in-theory was never, erm, actually very tolerant. Utilitarianism is actually kinda pissed about all these hobbies. For example: did you notice the way they aren't hedonium? Seriously tragic. And even setting aside the not-hedonium problem (it applies to all-the-things), I checked Jim's pleasure levels for the trashy-TV, and they're way lower than if he got into Mozart; Mary's stamp-collecting is actually a bit obsessive and out-of-balance; and Mormonism seems too confident about optimal amount of coffee. Oh noes! Can we optimize these backyards somehow? And Yudkowsky's paradigm misaligned AIs are thinking along the same lines – and they've got the nano-bots to make it happen.

I sometimes think about this sort of vibe via the concept of "meddling preferences." That is: roughly, we imagine dividing up the world into regions ("spaces," "spheres") that are understood as properly owned or controlled by different agents/combinations of agents. Literal property is a paradigm example, but these sorts of boundaries and accompanying divisions-of-responsibility occur at all sorts of levels – in the context of bodily autonomy, in the context of who has the right to make what sort of social and ethical demands of others, and so forth (see also, in more interpersonal contexts, skills involved in "having boundaries," "maintaining your own sovereignty," etc).

Some norms/preferences concern making sure that these boundaries function in the right way – that transactions are appropriately consensual, that property isn't getting stolen, that someone's autonomy is being given the right sort of space and respect. A lot of deontology, and related talk about rights, is about this sort of thing (though not all). And a lot of liberalism is about using boundaries of this kind of help agents with different values live in peace and mutual benefit.

Meddling preferences, by contrast, concern what someone else does within the space that is properly "theirs" – space that liberal ethics would often designate as "private," or as "their own business." And being pissed about people using their legally-owned and ethically-gained resources to make paperclips looks a lot like this. So, too, being pissed about noise-musicians, stamp-collectors, gay people, Mormons, etc. Traditionally, a liberal asks, of the humans-who-like-paperclips: are they violating any laws? Are they directly hurting anyone? Are they [insert complicated-and-contested set of further criteria]? If not: let them be, and may they do the same towards "us."

Humans-who-like-stamps, at a convention. (Image source here.)

Many "axiologies" (that is, ways of evaluating the "goodness" of the world) are meddling in a way that creates tension with this sort of liberal vibe. After all: axiologies concern the goodness of the entire world. Which means: all the "regions." In this sense, axiology is no respecter of boundaries. Of course, you could have an axiology that prefers worlds precisely insofar as they obey some set of boundary-related norms, and which has no preferences about what-happens-in-back-yards, but one finds this rarely in practice. To the contrary, many axiologies are concerned, for example, with the welfare of the agents involved (the average welfare, the total welfare, etc), or the beauty/friendship/complexity/fun etc occurring in the different regions. And if you give people liberal freedoms in their own spheres, sometimes they make those spheres less-than-optimally welfare-y/beautiful/complex/fun etc. Thus that classic tension between goodness and freedom (cf. "top down" vs. "bottom up"; and see also Nozick's critique of "end-state" and "patterned" principles of justice).

The "utility functions" that Yudkowskian rational agents pursue need not be axiologies in a traditional sense. But somehow, they often end up pretty axiology-vibed.^[10] No wonder, then, that Clippy is no respecter of boundaries, either. Indeed, in many respects, Yudkowsky's AI nightmare is precisely the nightmare of all-boundaries-eroded. The nano-bots eat through every wall, and soon, everywhere, a single pattern prevails. After all: what makes a boundary bind? In Yudkowsky's world (is he wrong?), only two things: hard power, and ethics. But the AIs will get all the hard power, and have none of the ethics. So no walls will stand in their way.

But I claim that humans often have the ethics bit.^[11] Or at least, human liberals, on their current self-interpretation. Of course, this isn't to say that liberals are OK with anything happening inside "walled" zones that might be intuitively understood as "private." For example: it's a contested question what aspects of a child's life should be under the control of a parent, but clearly, you aren't allowed to abuse or torture your own children (or anyone else), even in your own living room with the blinds drawn. And similarly, at a larger scale: the borders between nation states are a paradigm example of a certain kind of "boundary," but we believe, nevertheless, that certain sorts of human-rights-abuses inside a sovereign nation warrant infringing this boundary and righting the relevant wrong.

Often, though, these sorts of boundary infringements are justified precisely insofar as they are necessary to prevent some other boundary violation (e.g., child abuse, genocide) taking place within the first boundary. Indeed, Yudkowsky often turns to this sort of thing when he tries to prompt humans to behave in a manner analogous to a paperclipping AI. Thus, in "Three Worlds Collide," he specifically has humans encounter (and then: decide to intervene on violently) an alien species that eats their own conscious, suffering children – rather than, e.g., a species that just spends its resources making paperclips. And in trying to induce Lex to try to take over an alien world he wakes up in ("don't think of it as 'world domination'," Yudkowsky says with a grin, "think of it as 'world optimization'"), Yudkowsky specifically appeals to the idea that the alien civilization involves a lot harm and suffering – via war, or via some equivalent of factory farming – that Lex could alleviate, rather than to the idea that the aliens use their resources (and still less: their atoms) on boring/meaningless/sub-optimal things.

And to be clear: I agree that preventing harm, suffering, genocide, and so forth can justify infringing otherwise-important boundaries. (Indeed, I think that as it becomes possible to create suffering and harm in digital minds using personal computers, we're going to have to grapple with new tensions in this respect. Your backyard is yours, yes: but just as you can't abuse your children there, neither can you abuse digital minds.) But I also want to be clear that what's going on with the part of human values that says "no torturing people even in your own backyard" is much more specific, and much more compatible with "niceness" in other contexts, than what's going on with an arbitrary rational optimizer stealing your atoms to make its favored form of blah-blah-onium.

For example: if Lex were to wake up in a civilization of peaceful paperclippers, whose civilization involves no suffering (but also, let's say, very little happiness), but who spend all of their resources on paperclips, it seems very plausible to me that the right thing for Lex to do is to mostly leave them alone, rather than to engage in some project of world-domination/optimization (maybe Lex escapes to some other planet, but he doesn't take over the alien government and turn their paperclip factories into Fun-onium factories instead). And this even though Lex likes fun a lot more than paperclips.

Yudkowsky, to his credit, is attuned to this aspect of human ethics (the humans in Three Worlds Collide, for example, look for ways to respect and preserve baby-eater culture while still saving the babies) – but his rhetoric can easily leave it in the background. For example, in trying to induce Lex to world-dominate/optimize, Yudkowsky reminds him: "the point is: they want the world to be one way, you want the world to be a different way." But for a liberal: that's not good enough. All the time, my preferences conflict with the preferences of others. All the time, according to me, they could be using their private resources more optimally. Does this mean I dominate/optimize their backyards as soon as I'm powerful enough to get away with it? Not, I claim, if I am nice.

Of course, an even-remotely-sophisticated ethics of "boundaries" requires engaging with a ton of extremely gnarly and ambiguous stuff. When, exactly, does something become "someone's"? Do wild animals, for example, have rights to their "territory"? See all of the philosophy of property for just a start on the problems. And aspirations to be "nice" to agents-with-different-values clearly need ways of balancing the preferences of different agents of this kind – e.g., maybe you don't steal Clippy's resources to make fun-onium; but can you tax the rich paperclippers to give resources to the multitudes of poor staple-maximizers?^[12] Indeed, remind me your story about the ethics of taxation in general?

I'm not saying we have a settled ethic here, and still less, that its rational structure is sufficiently natural and privileged that tons of agents will converge on it. Rather, my claim is that we have some ethic here – an ethic that behaves towards "agents with different values" in a manner importantly different from (and "nicer" than) paperclipping, utilitarianism, and a whole class of related forms of consequentialism; and in particular, an ethic that doesn't view the mere presence of (law-abiding, cooperative) people-who-like-paperclips as a major problem.

And such an ethic seems well-suited, too, to handling the possibility – discussed in the previous essay – that different humans might end up with pretty different values-on-reflection as well. Liberalism does not ask that agents sharing a civilization be "aligned" with each other in the sense at stake in "optimizing for the same utility function." Rather, it asks something more minimal, and more compatible with disagreement and diversity – namely, that these agents respect certain sorts of boundaries; that they agree to transact on certain sorts of cooperative and mutually-beneficial terms; that they give each other certain kinds of space, freedom, and dignity. Or as a crude and distorting summary: that they be a certain kind of nice. Obviously, not all agents are up for this – and if they try to mess it up, then liberalism will, indeed, need hard power to defend itself. But if we seek a vision of a future that avoids Yudkowsky's nightmare, I think the sort of pluralism and tolerance at the core of liberalism will often be more a promising guide than "getting the utility function that steers the future right."

What if the humans-who-like-paperclips get a bunch of power, though?

Let's keep going, though, with the thought experiment about the humans-who-like-paperclips, until it hits on worry-number-2 more directly. In particular: thus far the humans-who-like-paperclips are just one human group among others. But what happens if we imagine them becoming the dominant human group – albeit, via means entirely compatible with respect for the boundaries of others, and with conformity to liberal ethics and laws.

Thus, let's say that the humans-who-like-paperclips are quite a bit smarter, more productive, and better coordinated than basically everyone else. As a result of their labors in the economy and their upstanding citizenship, humans in general are richer, happier, stronger, and healthier relative to a world without them. But for closely related reasons, and without violating any legal or ethical norms (all the economic transactions they engage in are consensual, fully-informed, and mutually beneficial), they are gradually accumulating more and more power. Their population is growing unusually fast; they own a larger and larger share of capital; and they exert more and more influence over politics and public opinion – albeit, in entirely above-board ways (much more above board, indeed, than many of the other groups vying for influence). Analysts are projecting that in a few decades, humans-who-like-paperclips will be the most powerful human group, for most measures of power – more powerful, indeed, than all the other groups combined. And they're predicting that for various reasons to do with the pace of technological development, this dominance will grant the humans-who-like-paperclips enormous influence over the trajectory of humanity's future.

Now, it's natural to wonder whether, once the humans-who-like-paperclips achieve sufficient dominance, all this niceness and cooperativeness and good-citizenship and respect-for-the-law stuff might fall by the wayside, and whether they might start looking more hungrily at your babies and your atoms. But suppose that somehow, you know that this won't happen. Rather, the humans-who-like-paperclips will continue to meticulously respect legal and ethical norms (or at least, the sort of minimal, boundary-related ethical norms I gestured at above). No one will get nano-bot-ed; the humans-who-like-paperclips won't sneak any suffering or slavery into their paperclip piles; and the humans-who-like-other-stuff (e.g. "Fun") will be able to happily pursue this other stuff from within secure backyards that are extremely ample by today's standards. But most of the resources of the future will go towards paperclips regardless.^[13]

How bad is this outcome? Different ethical views will disagree, and a less-crude analysis would obviously include factors other than "conformity to very basic liberal norms" and "what happens with the galaxies." Crudely, my own view is that the galaxy thing is actually a huge deal, and that even with basic liberal norms secure, turning ~all reachable resources into literal paperclips would be a catastrophic waste of potential. ^[14] But I also want to acknowledge that this is a very different sort of big deal than someone, or some group, killing everyone else and taking their stuff (and note that distant galaxies are not, in any meaningful sense, "ours," despite transhumanist talk about "our cosmic endowment"). In particular: the pure galaxies thing implicates different, and more fraught, ethical questions about otherness and control.

Thus: once we specify that basic liberal norms will be respected regardless, further disputes-over-the-galaxies look much more like a certain kind of raw competition for resources. It's much less akin to a country defending itself from an invader, and much more akin to one country racing another country to settle and control some piece of currently-uninhabited territory.^[15] The dispute is less about upholding the basic conditions of cooperation and peace-among-differences, and more about whose hobbies get-done-more; who gets the bigger backyard. Does it all come down to land use?

Well, even it did: land use is actually a very big deal.^[16] And to be clear: I don't like paperclips any more than you do. I much prefer stuff like joy and understanding and beauty and love. But I also want to be clear about what sort of ground I am standing on, according to my own values, when I fight for these things in different ways in different contexts. And according to my own values: it is one thing to defend your boundaries and your civilization's basic norms against aggressors and defectors. It is another to compete with someone who prefers-different-stuff, even while those norms are secure. And it is a third, yet, to become an aggressor/defector yourself, in pursuit of the stuff-you-prefer. But to talk, only, about "having different values" – and especially, to assume that the main thing re: values is your favored use of unclaimed energy/matter, your preferred blah-blah-onium – obscures these distinctions.

In particular: the defending-boundaries thing is where liberalism goes most readily to identify the forms of "otherness" that are not OK: namely, otherness done Nazi-style; otherness that actually, really, is trying to kill you and eat your babies. But the otherness at stake in "cooperative and nice, but still has a different favorite-use-of-resources" is quite different. It's the sort of otherness that liberalism wants to tolerate, respect, include, and even celebrate. Cf noise music, Mormonism, and that greatest test of tolerance: sub-optimally-efficient pleasure. Such tolerance/respect/etc is compatible with certain kinds of competition, yes. But not fighting-the-Nazis style. Not, for example, with the same sort of moral righteousness; and relatedly, not with the same sorts of justifications for violence and coercion.

Indeed, importantly not, if you want peace and diversity both. After all, the wider the set of differences-in-values you allow to justify violence and coercion, the more you are asking either for violence/coercion, or for everyone-having-the-same-values. Or perhaps most likely: violence/coercion in the service of everyone-having-the-same-values. Cf cleansing, purging. Like how the paperclipper does it. But we can do better.

An aside on AI sentience

I want to pause here to address an objection: namely, "Joe, all this talk about tolerance and respect etc – for example, re: the humans-who-like-paperclips – is assuming that the Others being tolerated/respected/etc are sentient. But the AIs-with-different-values – even: the cooperative, nice, liberal-norm-abiding ones – might not even be sentient! Rather, they might be mere empty machines. Should you still tolerate/respect/etc them, then?"

My sense is that I'm unusually open to "yes," here.^[17] I'm not going to try to defend this openness in depth here, but in brief: while I take consciousness very seriously,^[18] and definitely care a lot about something-in-the-vicinity-of-consciousness, I don't feel very confident that our current concepts of "sentience" and "consciousness" are going to withstand enough scrutiny to handle the moral weight that some people currently want to put on them;^[19] I think focus on consciousness does poorly on golden-rule-like tests when applied to civilizations with different conceptions of the precise sorts of functional mental architectures that matter (e.g., aliens that would look at us and say "these agents aren't schmonscious, because their introspection doesn't have blah-precise-functional-set-up" – see e.g. this story for an intuition pump); and I think some of the more cooperation-focused origins and functions of niceness/liberalism/boundaries (including: functions I discuss below re: liberalism and real-politik, where sentience more clearly doesn't matter^[20]) don't point towards consciousness as a key desideratum (and note that I'm here specifically talking about the bits of ethics that are cooperation-flavored, rather than the bits associated with what you personally do in your backyard).^[21] Plus, more generally, I think this is all sufficiently confusing territory that we should err on the side of caution and inclusivity in allocating our moral concern, rather than saying e.g. "whatever, this cognitively-sophisticated-agent-with-preferences isn't conscious – by which I mean, um, that we-know-not-what-thing, that least-understood-thing – so it's fine to torture it, deprive it of basic rights, etc."

Of course, if you stop using sentience as a necessary condition for being worthy-of-tolerance/respect etc, then you need to say additional stuff about where you do draw the sorts of lines I discussed a few essays ago: e.g., "OK to eat apples but not babies," "furbies and thermostats don't get the vote," "you can own a laptop but not a slave,"^[22] and so on.^[23] And indeed, gnarly stuff. My current best guess here would be to hand-wave about agenty-ness and cognitive sophistication and who-would've-been-a-good-target-for-cooperation-in-other-circumstances – but obviously, one needs to say quite a bit more.

For the purposes of understanding the ethical underpinnings of the AI risk discourse, though, I don't think that we need to resolve questions about whether non-sentient AIs-with-different-values are worthy of tolerance/respect. Why? Because the core bits of the Yudkowskian narrative I've been discussing apply even if all the AIs-with-different-values are sentient. The classic paperclipper-doom story, for example, does not require that the paperclipper be insentient: it still kills all the humans, it still turns the galaxies into paperclips, and that's enough.^[24] And Yudkowsky himself would find the possibility of conscious AIs, at least, obvious. Where this includes, presumably, conscious paperclippers. (In reality, my sense is that Yudkowsky thinks consciousness unusually scarce – for example, he's skeptical that pigs are conscious. But this view isn't important to his story.) So for now, in talking about tolerating/respecting AIs with-different-values, I'll just assume they're sentient, and see what follows.

Indeed: did you think it matters a lot, to the Yudkowsky narrative, whether the AI was sentient? If so, then I suspect you are thinking of this narrative as a less familiar story than it truly is. Ultimately, AI risk is not about humans vs. AIs (in that case, it really would be species-ism/bio-chauvinism), or sentience vs. insentience (the AIs might well be sentient). Rather, it's about something more ancient and basic: namely, agents with different values competing for power. So I encourage you: run the story with conscious humans-with-different-values in the place of the AIs-with-different-values – humans to whom you are more immediately inclined to ascribe moral status, rights, citizenship, tolerance-worthiness, and so forth. You want to make sure that you get the differences-in-values different enough, sure (though: "maximize paperclips" is an unfortunate cartoon; thinking about where RLHF + foom leads seems a better guide). And as I said earlier: people with souls can still be enemy soldiers. But if you're finding that words like "human" or "sentient" are making the agents-with-different-values seem substantially less like enemies, then you're not yet fully keyed to the particular sort of conflict that Yudkowsky has in mind.

Giving AIs-with-different-values a stake in civilization

Let me give another example of a place where I worry that a naïve Yudkowskian discourse can too-easily neglect the virtues of niceness and liberalism: namely, the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with.

Thus, consider Yudkowsky's "proposed thing-to-do with an extremely advanced AGI, if you're extremely confident of your ability to align it on complicated targets": namely, use it to implement humanity's "coherent extrapolated volition" ("CEV"). This means, basically: have the AI do what currently-existing humans would want it to do if they were "idealized" (see more here), to the extent those idealized humans would want the same things.

We see, in Yudkowsky's discussion of CEV, some of his effort to implement a less power-grabby ethic than a simple interpretation of his philosophy might imply. That is: Yudkowsky (at least in 2004) is explicitly imagining a team of AGI programmers who are in the position to take over the world and have their particular (idealized) values rule the future (let's set aside questions about the degree of resemblance this scenario is likely to have to the actual dynamics surrounding AGI development, and treat it, centrally, as a thought experiment). And one might've thought, given the apparent convergence of oh-so-many-rational-agents on the advisability of taking over the world, that Yudkowsky's programmers would do the same.^[25] But he suggests that they should not.

Part of this, says Yudkowsky, is about not ending up like ancient greeks who impose values on the future they wouldn't actually endorse if they understood better. But that only gets you, in Yudkowsky's ontology, to the programmers making sure to extrapolate their own volitions. It doesn't get you to including the rest of humanity in the process.

What gets you to giving that wider circle a say? Yudkowsky mentions various values – "fairness," "not being a jerk," trying to act as you would wish other agents would act in your place, cooperation/real-politik, not acting like you are uniquely appointed to determine humanity's destiny, and others. I won't interrogate these various considerations in detail here (though see footnote for a bit more discussion).^[26] Rather, my point is about how far the pluralism they motivate should extend.

In particular: Yudkowsky's "extrapolation base" – that is, the set of agents his process grants direct influence over the future – stops at humanity. But it seems plausible to me that whatever considerations motivate empowering all of humanity, in a thought experiment like this, should motivate empowering certain kinds of AIs-with-different-values as well, at least if we are already sharing the world with such AIs by the time the relevant sort of power is being thought-experimentally allocated. For example, in this thought experiment: if at the time the programmers are making this sort of decision, there are lots of moral-patienty AIs with human-level-or-higher intelligence running around, who happen to have very different values from humans, I think they should plausibly be included in the "extrapolation" base too. After all, why wouldn't they be? "Because they're not humans" is actually species-ism. But absent such species-ism, the most salient answer is "because their values are different from ours, so giving them influence will make the future worse by our lights." But that answer could easily motivate not-empowering many humans as well – and the logic, in the limit, might well prompt the programmers to empower only themselves.

Now, the details here about what it means to empower moral-patienty AIs-with-different-values in the right way get gnarly fast (see e.g. Bostrom and Shulman (2022) for a flavor). Indeed, questions about how to handle the empowerment of such AIs are one of the few places I've seen Yudkowsky, in his words, "give up and flee screaming into the night." See, also, one of his characters' exclamation in the face of a sentient iPhone that's been stalking him, and which begs not to be wiped: "I don't know what the fuck else I'm supposed to do! Someone tell me what the fuck else I'm supposed to do here!" At least as of 2008 (has he written on this since?^[27]), Yudkowsky's central advice, in the face of the moral dilemma posed by creating AI moral patients with different values, seems to be: don't do it, at least until you're much readier than we are. And indeed: yes. Just like how: don't create AGI at all until you're much readier than we are. But unfortunately, in both cases: I worry that we're going to need a better plan.

Dese ne wipe...

I won't try to outline such a plan here. Rather, I mostly want to point at the general fact that, insofar as we are in fact aiming to build a world that succeeds at whatever "liberalism" and "boundaries" and "niceness" are trying to do, this world should probably be inclusive, tolerant, and pluralistic with respect to AIs-with-different-values (or at least, moral patient-y ones) as well as humans-with-different-values – at least absent some clear and not-just-species-ist story about why AIs-with-different-values should be excluded. And note, importantly, that this doesn't mean tolerating arbitrarily horrible value systems doing whatever they want, or arbitrarily alien value systems trampling on other people's backyards. This is part of why I think it's worth being clear – indeed, clearer than I've been thus far – about the sorts of values differences liberalism/boundaries/niceness gets fussed about.^[28] Peaceful, cooperative AIs that want to make paperclips in their backyards – that's one thing. Paperclippers who want to murder everyone; sadists who want to use their backyards as torture chambers; people who demand that they be able to own sentient, suffering slaves – that's, well, a different thing. Yes, drawing the lines requires work. And also: it probably requires drawing on specific human (or at least, not-fully-universal) values for guidance. I'm not saying that liberalism/niceness/boundaries is a fully "neutral arbiter" that isn't "taking a stand." Nor am I saying that we know what stand it, or the best version of it, takes. Rather, my point is that this stand probably does not treat "those AIs-we-share-the-world-with have different values from us" as enough, in itself, to justify excluding them from influence over the society we share.

The power of niceness, community, and civilization

So far, I've been making the case for this sort of inclusivity centrally on ethical grounds. But liberalism/niceness/boundaries clearly have practical benefits as well. Nice people, for example, are nicer to interact with. Free and tolerant societies are more attractive to live in, work in, immigrate to. Secure boundaries save resources otherwise wasted on conflict. And so on. There's a reason so many European scientists – including German scientists – ended up working on the Manhattan project, rather than with the Nazis; and it seems closely related to differences in "niceness."

Indeed, these benefits are enough, at times, to soften the atheism of certain rationalists. For example: Scott Alexander.^[29] As I mentioned in a previous essay: Alexander, in writing about liberalism/niceness/boundaries (e.g. here and here), attributes to it a kind of mysterious power. "Somehow Elua is still here. No one knows exactly how. And the gods who oppose Him tend to find Themselves meeting with a surprising number of unfortunate accidents." Liberalism/niceness/boundaries is not, for Alexander, just another utility function. Still less is it actively weak. Rather, it is a "terrifying unspeakable elder God." "Elua is the god of flowers and free love and he is terrifying. If you oppose him, there will not be enough left of you to bury, and it will not matter because there will not be enough left of your city to bury you in."

A bit like this?

Here, Alexander's vibe is un-Yudkowskian in a number of ways. First, Alexander seems to want to trust, at least partly, in something mysterious – namely, the ongoing power of liberalism/niceness/boundaries, which Alexander admits he does not fully understand. Indeed, I think that various more consequentialist-y stories about the justification for deontological-y norms and virtues – including the ones at stake in liberalism/niceness/boundaries – have some of this flavor as well. That is: consequentialists often argue that you should abide by deontological norms, or be blah sort of virtuous, even when it seems like doing so will make things worse, because somehow, actually, doing so will make things better (for example: because at the level of choosing a policy, or adjusting for biases, or dealing with the constraints of a bounded mind, deontology/virtue does better than consequentialist calculation). Deontology/virtue, on this story, is its own form of power-to-achieve-your-goals – but a form that remains at least somewhat cognitively inaccessible while it is being put-into-practice (otherwise, it could be more fully subsumed within a direct consequentialist calculation). So trust in deontology/virtue, in the hard cases, requires trusting in something not-fully-calculated. (Though of course, there are tons of ways to trust-wrongly, here, too.)^[30]

But beyond his willingness to trust-in-something-mysterious, Alexander's attribution of power to Elua is also in tension with certain kinds of orthogonality between ethics and optimization power. That is, to the extent that Elua represents a set of values, Elua, in a Yudkowskian ontology, is orthogonal to intelligence at least – and thus, to a key source of power. "Paperclips," after all, are neither elder Gods nor younger Gods, neither unspeakable nor speakable. They are, rather, just another direction that power can try to drive an indifferent universe. Why would niceness be any different?

Well, we can think of reasons. Plausibly, for example, the indifferent universe is steered more easily in some directions vs. others. Indeed, the social/evolutionary histories of niceness/boundaries/liberalism are themselves testaments to the ways in which the indifferent universe favors Elua under certain conditions – favoritism that plays a key role in explaining why we ended up valuing Elua-stuff intrinsically, to the extent we do. In this sense, our values are not fully orthogonal to the "universe's values." True, we are not simple might-makes-right-ists, who love, only, whatever is in fact most powerful. But our hearts have, in fact, been shaped by power – so we should not be all that surprised if the stuff we love is also powerful.

Will power of this kind persist into a post-AGI future – and in particular, in a way that should motivate extending various sorts of tolerance and inclusivity towards AIs-with-different-values on pragmatic rather than purely ethical grounds? My sense is that Yudkowskian-ism often imagines that it won't. In particular: the practical benefits of liberalism/niceness/boundaries often have to do with the ways in which they allow agents with different values, but broadly comparable levels of power, to cooperate and to live together in harmony rather than to engage in conflict. But as I discussed above: Yudkowsky is typically imagining a post-AGI world in which AIs-with-different-values and humans do not have broadly comparable levels of power. Rather, either AIs-with-different-values have all the power, or (somehow, due to a miracle) humans do. So finding a modus vivendi can seem less practically necessary.

Again, I'm not going to delve into these dynamics in any detail, but I'm skeptical that we should be writing off the purely practical benefits of extending various forms of niceness/liberalism/boundaries to AIs-with-different-values, especially from our current epistemic position. In particular: I think there may well be crucial stages along the path to a post-AGI future in which AIs-with-different-values and humans do indeed have sufficiently comparable levels of power, at least in expectation, that the practical virtues of niceness/liberalism/boundaries may well have a positive role to play – including: a role that helps us avoid having to put our trust in any foomed-up concentration of power, whether human or artificial. I am especially interested, here, in visions of a post-AGI distribution of power that would give various AIs-with-different-values more of an incentive, ex ante, to work with humans to realize the vision in question, as a part of a broadly fair and legitimate project, rather than as part of an effort, on humanity's part, to use (potentially misaligned and unwilling) AI labor to empower human values in particular. But fleshing this out is a task for another time.

Is niceness enough?

My main aim, in this essay, has been to point at the distinction between a paradigmatically paperclip-y way of being, and some broad and hazily defined set of alternatives that I've grouped under the label "liberalism/niceness/boundaries" (and obviously, there are tons of other options as well). Too often, I think, a simplistic interpretation of the alignment discourse imagines that humans and paperclippers are both paperclippy at heart – but just, with a different favored sort of stuff. I think this picture neglects core aspects of human ethics that are, themselves, about navigating precisely the sorts of differences-in-values that the possibility of AIs-with-different-values forces us to grapple with. I think that attention to these aspects of human ethics can help us be better than the paperclippers we fear – not just in what we do with spare resources, but in how we relate to the distribution of power amongst a plurality of value systems more broadly. And I think it may have practical benefits as well, in navigating possible conflicts both between different humans, and between humans and AIs.

That said: depending on how exactly we interpret liberalism/niceness/boundaries, it's also possible to imagine futures compatible with various versions (and especially, minimal versions – e.g., property rights are respected, laws don't get broken, laws are passed democratically, etc), but which are nevertheless bleak and even horrifying in other respects – for example, because love and joy and beauty and even consciousness have vanished entirely from the world.^[31] In this sense, and depending on the details, the bits of ethics I've been gesturing at here aren't necessarily enough, on their own, for even a minimally good future (let alone a great one). In particular: absent help from an indifferent universe, in order to have substantive amounts of love/joy/beauty in the future, you need agents who care about these things having enough power to keep them around to the relevant degree – and different conceptions of liberalism/niceness/boundaries may not guarantee this. So even beyond the yin of being nice/liberal/boundary-respecting towards agents who don't like love/joy/beauty, some kind of active yang, in the direction of love/joy/beauty etc, is necessary, too.^[32] In the next essay, I'll return to questions about this sort of yang – and in particular, questions about whether it involves attempting to exert inappropriate levels of control.

In particular, vibes related to the "fragility of value," "extremal Goodhardt," and "the tails come apart." ↩︎
Though in fairness, forms of "threshold deontology" that introduce constraints that can only be violated if the stakes are high enough – e.g., you can only push the fat man if it will save x lives, where x is quite a bit larger than utilitarianism would suggest – face this issue, too. E.g., the onium at stake can quickly become more-than-x. Thanks to Will MacAskill for discussion here. ↩︎
See here for some debate. Part of my argument, in this essay, is that we should not do the "teach the aliens the value of friendship" thing that Soares seems to endorse here. ↩︎
Though: I don't think it disappears. ↩︎
Remember: caring about an agent's preferences is conceptually distinct from caring about her welfare. ↩︎
And I think we should be open to doing this even if they aren't sentient – more below. ↩︎
Hanson's critique of the alignment discourse emphasizes the distinction. ↩︎
As a maybe-clearer example: if a team of five people breaks into your house trying to kill you, you can kill all of them if necessary to save yourself. But if you are on the way to the hospital and the only way to save yourself is to run over five people on the road, you aren't permitted to do it. ↩︎
Though note that we're creating them – and doing so, in the AI risk story, without adequate care to avoid the relevant sorts of aggressions, for the sake of other not-always-fully-laudatory motives. This complicates the moral narrative. ↩︎
Maybe something about "consequentialism" in AIs-that-get-things-done is to blame? But even if you add in deontological constraints, Yudkowsky (as I understand him) predicts that the AIs will simply pursue the "nearest unblocked neighbor" of those constraints. ↩︎
Though: human society today often also puts adequate hard power behinds its walls, given the current attempted-invasions. And let's keep it that way, even as the invasions get oomphier. ↩︎
Thanks to Howie Lempel for discussion of this point. ↩︎
We can wonder why the existing political order lets this happen, but let's set this aside for now. ↩︎
Roughly twenty billion galaxies, according to Toby Ord's The Precipice, p. 233. ↩︎
"Like the colonialists?" Well: the "uninhabited" bit is really important – at least if you're a boundary-respecter. But let's not pretend that colonialist vibes are so far off in the distance, here. ↩︎
In particular: lots of human and animal lineages have suffered, died, and disappeared for lack of land (and this is not to mention: having their land actively stolen, invaded, and so on). And what are most wars fought over? Thanks to Carl Shulman for discussion here. ↩︎
Though I remain pretty uncertain/confused about various of the issues here. And obviously, it would be great to first get a bunch more ethical clarity about this sort of thing before having to make decisions about it. ↩︎
More seriously than e.g. the illusionists. ↩︎
E.g., I worry it'll end up looking like people saying "if an agent doesn't have phlogiston, it doesn't deserve any moral weight." ↩︎
Game theory works regardless of whether the agents you're interacting with are conscious. ↩︎
In the context of choosing-what-to-build-in-your-backyard, I feel much happier to focus directly on getting the "thing-that-matters-in-the-vicinity-what-we-currently-call-consciousness" thing right. But here I'm talking about the bits of ethics that are about relating-to-other-backyards (but: still in a terminal-values sense, not a game-theory sense). ↩︎
We're assuming that you're not running any slaves on the laptop. ↩︎
Thanks to Howie Lempel for discussion. ↩︎
And note that just because it's sentient doesn't mean the world it creates involves a lot of sentience. ↩︎
Though perhaps not: that Yudkowsky would advise them to do the same. ↩︎
For some of these rationales, note that it's not actually clear how this gets him away from the programmers just extrapolating their own volitions. After all, if their own extrapolated volitions would value fairness, not being a jerk, golden-ruling, etc in the manner in question, then the output of the extrapolation process would presumably reflect this (Yudkowsky uses this sort of dynamic to respond to various other objections to his proposal: e.g., "if that's a good objection, our extrapolated volitions will notice and adjust for it"). And if not, they would have avoided a mistake by their own lights by keeping the circle narrow.

Indeed, in a simple version of Yudkowsky's ontology, it's unclear how the programmers could possibly do better than just extrapolating their own volitions. Their own extrapolated volitions, after all, set the standard (on Yudkowsky's anti-realist ethics) for what the right choice would be. Is Yudkowsky imagining programmers who face the option to make a correct-by-definition choice, and advising them to maybe make a mistake instead?

Well, let's be careful. Some choices can't be unmade – including choices to find out what-you-should-have-done. Suppose, at t1, that your mother is about to drown, and you have a choice between saving her, or asking a genie for advice/service. If you ask the genie "what is the right decision at t1?", it might well answer at t2, "you should have saved your mother, who just drowned." And if you ask it "figure out what I should have done at t1, and then do it," it might be too late. So, too, with the choice to seek power. Power is useful for many values, yes, but famously, obviously, seeking power can compromise your values too. Indeed, it often does, given how many of our ethical values are specifically about regulating who gets what sort of power (cf "boundaries" above) – plus, you know, the power-corrupts thing, the biased-in-favor-of-yourself thing, and so on. And this holds true even if the power in question will grant you arbitrary insight into the values you compromised. If you take-over-the-world in the process of finding out whether you should've taken-over-the-world – well, you can still have fucked up.

And beyond this, certain kinds of cooperation, coordination, and commitment often involve making choices that might seem at the time, from the perspective of a certain kind of narrow rational calculation, like "mistakes." The way, for example, cooperating in a prisoner's dilemma – or paying in the city in "Parfit's hitchhiker" – is a "mistake." The type of mistake that seems, mysteriously, to get made by agents who end up rich, or alive-at-all. Is it a mystery? Sometimes, being the sort of person that others can trust, coordinate with, rely on, get-to-the-pareto-frontier-with, and so on requires being such that you don't just grab power for yourself (or lie, or steal, or crush the outgroup, or throw out the procedural norms of your democracy, or...) when you can get away with it, or think you can – even if that's what would get you the most (extrapolated) utility at the time (at least, for some notion of "would").

And we can talk about other possible reasons why Yudkowsky's programmers might use a wider "extrapolation base" than their own volitions as well (see e.g. Yudkowsky's original paper, and discussion on Arbital here, for longer discussion). ↩︎
I'm not counting the "Comp sci in 2027" as really laying out a position re: what to do. ↩︎
For example, in the context of whether animals should be empowered, Yudkowsky worries: what happens if you "uplift" a bear, or a chimp, or an ichneumonid wasp, and it just wants to eat babies, or to sit atop some violent and oppressive dominance hierarchy, or to lay parasitic eggs inside of everyone? And Yudkowsky worries about humans in this respect as well – see, e.g., his discussion of the "selfish bastards" problem here, in which so many present-day humans want sentient, suffering slaves that humanity's CEV says yes. But as I've tried to emphasize: these aren't just any old values differences. Rather, these are precisely the sort of values differences that liberalism/niceness/boundaries gets fussed about. ↩︎
Though: he was always less of an atheist than Yudkowsky. ↩︎
And blind hope that blah sort of deontological-seeming behavior will somehow lead to the best consequences can easily fail to grapple with the trade-offs that actual-deontology actually implies. ↩︎
If you think of libertarianism as encoding a minimal form of niceness/liberalism/boundaries, then a libertarian-ish, Age-of-Em-ish world where eventually all the sentient agents die/lose their property/get outcompeted, but through legal and minimal-ethical-constraint-respecting processes, might be one example here. ↩︎
And of course, even working on behalf of liberalism/niceness/boundaries is a form of yang in its own right. ↩︎

Great post! I’m getting a lot out of this series.

Here are some of the paths that I think lead some people to thinking that a boundary-respecting post-AI future is unlikely or bad. (Note: I don’t have a strong position either way, at the end of the day, just trying to facilitate good discussion)

Belief that “pure-consequentialist AI” is kinda the only way to build very powerful (tech-inventing, self-improving, reflectively-stable) AI, so we should expect that to happen sooner or later.

(By “pure-consequentialist AI”, I mean AI that has preferences about the state of the universe in the distant future, and those preferences inform its actions.)

My impression is that Eliezer & Nate believe something like this. See for example Nate’s post Deep Deceptiveness (see also my response comment).

Anyway, I don't buy this belief, for reasons in my post Consequentialism & Corrigibility. In short, I think it’s possible to make AIs that are consequentialist enough to invent tech, self-improve, and so on, but that simultaneously have reflectively-stable preferences about respecting norms and so on. Humans are an example, I claim.

…But there's a softer version of that:

Belief that “pure-consequentialist AI” will outcompete the “non-pure-consequentialist AI”, even if (per above) the latter is real and powerful.

It’s true that if Agent A is a pure consequentialist (has preferences about the state of the universe in the distant future), and Agent B is not (it has both preferences about the state of the universe in the distant future and preferences about other kinds of things like following norms, respecting boundaries, etc.), then, other things equal, one should expect the state of the universe in the distant future to have more in common with Agent A’s preferences than Agent B’s. For example, insofar as it's instrumentally useful to maintain a reputation for norm-following, well Agent A can do that too. But agent A can also do ruthless power-seeking when it can get away with it. (There's an exception in principle if AIs can read each other's source code, but I'm not sure if that's actually feasible in practice.)

Anyway, I see this as an important dynamic to keep in mind, but I'm not sure how decisive it will be.

Belief that good enforceable boundaries are a temporary luxury of our technological immaturity, i.e. offense-defense balance will change in the future

For example, plausibly it's much easier to make boundary-ignoring nanobots than either boundary-respecting nanobots or nanobot defense systems. (Or substitute “invasive species from hell” if you’re not into nanobots.)

The OP already mentioned another important example: if it becomes possible to create sentient minds in the privacy of one's own consumer GPU (as I expect eventually), that creates challenges to envisioning a liberal-genre good future.

Belief that the power of cooperation (Elua) is a temporary feature of our technology immaturity

For one thing: if a human wants allies, he can be cooperative, or charismatic, or distribute spoils, etc. If an AI wants allies, it can do any of those things, or it can simply create more copies of itself, which is a very different and potentially very effective strategy.

There's a trope in zombie apocalypse movies where the zombies can turn people into more zombies who then immediately join the zombie cause. In human-world, that's fiction, but in AI-world, it will presumably be possible for AIs to take control of each others’ chips and use them to run more copies of themselves (either by cyber-attacking each other, or even teleoperating a robot to get physical access to another AI’s chips, and go get root access with a soldering iron or whatever). I don’t really know how this would play out but it seems like it might importantly remove the strategic advantage of playing-well-with-others.

Another thing: Very few humans are liable to act sincere and cooperative for an extended period, and then stab allies in the back as soon as the situation changes. Most people act sincere and cooperative because of their innate social drives; then there are a small number of smart sociopaths and so on, but they tend to be impulsive and impatient rather than patient and strategic, by and large. But all bets are off with future AIs. So old-fashioned cooperation (through trust, reputation, etc.) might not be a stable equilibrium in the future. It could be replaced by some high-tech version of cooperation (reading each others’ source code?), but it’s unclear whether there’s anything feasible in that genre. (My perennial uncertainty is: AI 1 can straightforwardly send source code / model weights / whatever to AI 2, but how can AI 1 prove to AI 2 that this file is actually its real source code / model weights / whatever? There might be a good answer, I dunno.)

(My perennial uncertainty is: AI 1 can straightforwardly send source code / model weights / whatever to AI 2, but how can AI 1 prove to AI 2 that this file is actually its real source code / model weights / whatever? There might be a good answer, I dunno.)

They can jointly and transparently construct an AI 3 from scratch motivated to further their deal, and then visibly hand over their physical resources to it, taking turns with small amounts in iterated fashion.

AI 3 can also be given access to secrets of AI 1 and AI 2 to verify their claims without handing over sensitive data.

I think this idea should be credited to Tim Freeman (who I quoted in this post), who AFAIK was the first person to to talk about it (in response to a question very similar to Steven's that I asked on SL4).

Well, even it did: land use is actually a very big deal.^[16] And to be clear: I don't like paperclips any more than you do. I much prefer stuff like joy and understanding and beauty and love.

I've been very much enjoying this essay sequence and have a lot I could say about various parts of it once I finish reading through it entirely, but I wanted to throw in a note now, that a constant conflation between "literally making paperclips" and "alien values we can't understand but see as harmless", smuggles in some needless confusion, because in many cases, these values have a sort of passive background factor of making the world meaningfully more interesting/novel/complicated in ways we might not even be able to fathom before encountering them. Experimental forms of music and art come to mind as clear examples within our own culture. What would Mozart think of Skrillex? Well...he might actually just really like it? Maybe reincarnated-Mozart would write psytrance and techno while being annoyingly pedantic about the use of drum samples. Or maybe he would find it incomprehensible noise, a blight on music. Or maybe, even if he couldn't understand it at all, he could understand its value and recognize a modern musician as a fellow musician (or not, Mozart was supposed to have been a bit of a dick).

But it's that last possibility I want to point towards, which is that in many cases where someone "has different values" than us, we can still appreciate those values in some abstract "complexity is good" sense, "well I wouldn't collect stamps, but the collection as a whole was kind of beautiful", "I don't like death metal, but I can appreciate the artistry and can see why someone would".

It seems distinctly possible to me that an entity with very alien values and preferences to me could still create many things I could appreciate and see beauty in, even if that beauty is tinted by an alienness and a lack of real comprehension of what I'm experiencing. I could even directly benefit from this. Indeed, many of my experiences in the world are like this, I am constantly surrounded by alien minds, who have created things I couldn't create without a new lifetime of learning, that I don't really understand the full functioning or engineering of, and yet nevertheless trust and rely on every day. (do you know in detail how your water, electrical, sewer, highway, transit, elevator, etc, systems work on an engineering level?).

And this is where the paperclip thing really gets kinda annoying, because "paperclips" aren't fun/interesting/novel/etc, they're a sort of anti-art item, like...tyres, or bank statements, or the DMV. A music-maximizer is importantly different then a DMV-maximizer in ways that make the nice-music-maximizer both more tolerable and also more likely to actually exist. (novelty seems rather intrinsic to agency).

The use of paperclips is designed to cast "alien values" in a light where they look valueness or even of negative value, but this seems unlikely to be the case because of the intrinsic link between complexity and novelty and value. An AI that makes something they consider amazing and transcendental and fantastic, I would predict that I would be able to see some of my own values reflected within, even if it was almost entirely incomprehensible to me. Even just saying something like: "each paperclip is unique and represents an aspect of reality, each paperclipper collects papperclips to represent important tokens, moments, ideas, and aspects of their life" suddenly gives the paperclipper an interesting and even spiritual characteristic.

I think this points towards the underlying "niceness towards an alien other" you're gesturing towards in several of these essays. It seems to me like there are some underlying universals which connect these things, the beauty inherent in the mathematics maybe, maybe.

I'm not saying we have a settled ethic here, and still less, that its rational structure is sufficiently natural and privileged that tons of agents will converge on it. Rather, my claim is that we have some ethic here – an ethic that behaves towards "agents with different values" in a manner importantly different from (and "nicer" than) paperclipping, utilitarianism, and a whole class of related forms of consequentialism; and in particular, an ethic that doesn't view the mere presence of (law-abiding, cooperative) people-who-like-paperclips as a major problem.

For a rather more specific (though still in fact Utilitarian) formulation of this ethic, see: A Moral Case for Evolved-Sapience-Chauvinism. Briefly, if your civilization includes some humans-who-like-paperclips (or indeed some allied sapient aliens who we can get along with, who happen to like alien-paperclips), then the appropriate definition of "utility" for that civilization includes putting some value on letting them have the paperclips they want (as long as this isn't massively decreasing anyone else's utility). And indeed for Alicia, Jim, Felipe, Maria, and Jason letting them enjoy their various favorite passtimes (again, as long as these don't heavily intrude on other people living their preferred lives). "Fun" or "utility" is in the eye of the beholder, just like beauty. I actually can't tell you what you find fun: you're the world expert on that, and I defer to your expertise (even if I have some suggestions of things you might want to try).

It's also extremely likely that, not only does everyone's view on how fun paperclips are in their own back yard get summed over or cohered, but also that "letting me do what I darn well want in my own house, as long as I'm not significantly harming anyone else" (a concept often called 'personal freedom') is in fact part of the complex and fragile "human values" that Yudkowski talks about, and would thus become part of the "Coherent Extrapolated Volition" that Utilitarians such as him would like to see optimized. The definition of "utility" sums over or coheres the opinions of all citizens/moral patients in the society, not just Eliezer, and in your own back yard, your opinion is particularly relevant, since you spend more time there than anyone else. So Utilitarianism includes Liberalism, pro tanto: the opposition of them you came up with in a previous essay was the result of taking the Atheism viewpoint to a ludicrous solipsistic extreme: it's not what most actual Utilitarians are proposing, including Yudkowski (or me).

I agree that this is the approach to a solution for those who agree with liberalism.

That said, in addition to having been convinced that consequentialist agency and utilitarian morality are wrong, I think I've also become persuaded that liberalism is simply wrong? Which is kind of a radical position that I need to stake out elsewhere, so let me reduce the critique to some more straightforward variants:

"Boundaries" seems to massively suffer from nearest unblocked strategy problems, since it's focused on blocking things.
Liberalism already in some ways struggles due to respecting boundaries too much. E.g. one of the justifications for NIMBYism is that dense housing ends up blocking sunlight. This is basically true (neighboring buildings inevitably have some externalities on each other), but AFAICT still counterproductive.
I think you are underestimating the difficulty in deciding which boundaries to respect. It's wrong for parents to sexually abuse their children, but in terms of boundaries it's hard to distinguish this from many other things that children have to deal with, e.g. being told what to eat, made to go to school, or vaccinated. Today it mainly gets distinguished in terms of harm rather than boundaries, but the way society decides what counts as harm and how to measure it is a giant political garbage fire, and it's not clear how an AI could do better.
In a sense, such an AI would be a "boundary-maximizer", but this incentivizes people to frame whatever they desire in terms of boundary violations in order to get help from the AI, and that doesn't seem like a mentally healthy way to be (like obsessing over every little violation).
The issue of moral patienthood is still huge. Someone could spam a bunch of copies of the minimal entity that gets its boundaries respected, and if the concept of boundaries packs any punch, then this spam will pack a lot of punch too.

Of course liberalism has struggles, the whole point of it is that it's the best currently known way to deal with competing interests and value differences short of war. This invites three possible categories of objection: that there is actually a better way, that there is no better way and liberalism also no longer works, or that wars are actually a desirable method of conflict resolution. From what I can tell, yours seem to fall into the second and/or third category, but I'm interested in whether you have anything in the first one.

When it comes to conflict deescalation specifically (which is needed to avoid war, but doesn't deal with other aspects of value), I guess the better way would be "negotiate some way for the different parties in the conflict to get as much of what they want as possible".

This is somewhat related to preference utilitarianism in that it might involve deference to some higher power that takes the preferences of all the members in the conflict into account, but it avoids population ethics and similar stuff because it just has to deal with the parties in the conflict, not other parties.

E.g. in the case of babyeaters vs humans, you could deescalate by letting humans do their human thing and babyeaters do their babyeating thing. Of course that requires both humans and babyeaters to each individually have non-totalizing preferences (including non-liberal preferences, e.g. humans must not care about others abusing their children), which is contradicted by the story setup.

This doesn't mean that humans have to give up caring about child abuse, it just has to be bounded in some way so as to not step on the babyeaters' domain, e.g. humans could care about abuse of human children but not babyeater children.

Well, so far no such higher power seems forthcoming, and totalizing ideologies grip public imagination as surely as ever, so the need for liberalism-or-something-better is still live, for those not especially into wars.

You could have a liberal society while making the AIs more bounded than full-blown liberalism maximizers. That's probably what I'd go for. (Still trying to decide.)

I don't have anything to add other than that I really appreciate how you've articulated a morass of vague intuitions I've begun to have re: boundaries-oriented ethics, and that I hope you end up writing this up as a full standalone post sometime.

at least absent some clear and not-just-species-ist story about why AIs-with-different-values should be excluded

My strongest reason for this is to preserve moral option value, in other words, preserving our options/resources to eventually do what is right. Imagine that one day we or our descendants build or become superintelligent super-competent philosophers who after exhaustively investigating moral philosophy for millions of years, decide that some moral theory or utility function is definitely right. But too bad, they're then controlling only a small fraction of the universe, the rest having been voluntarily or involuntarily handed off to AIs-with-different-values. What if this scenario turns out to constitute some kind of moral catastrophe? Wouldn't it be best to have preserved optionality instead? We can always hand off power or parts of the universe to AIs-with-different-values later, if/when we, upon full reflection, decide that is actually the right thing to do.

I do think this is an important consideration. But notice that at least absent further differentiating factors, it seems to apply symmetrically to a choice on the part of Yudkowsky's "programmers" to first empower only their own values, rather than to also empower the rest of humanity. That is, the programmers could in principle argue "sure, maybe it will ultimately make sense to empower the rest of humanity, but if that's right, then my CEV will tell me that and I can go do it. But if it's not right, I'll be glad I first just empowered myself and figured out my own CEV, lest I end up giving away too many resources up front."

That is, my point in the post is that absent direct speciesism, the main arguments for the programmers including all of humanity in the CEV "extrapolation base," rather than just doing their own CEV, apply symmetrically to AIs-we're-sharing-the-world-with at the time of the relevant thought-experimental power-allocation. And I think this point applies to "option value" as well.

the main arguments for the programmers including all of [current?] humanity in the CEV "extrapolation base" […] apply symmetrically to AIs-we're-sharing-the-world-with at the time

I think timeless values might possibly help resolve this; if some {AIs that are around at the time} are moral patients, then sure, just like other moral patients around they should get a fair share of the future.

If an AI grabs more resources than is fair, you do the exact same thing as if a human grabs more resources than is fair: satisfy the values of moral patients (including ones who are no longer around) not weighed by how much leverage they current have over the future, but how much leverage they would have over the future if things had gone more fairly/if abuse/powergrab/etc wasn't the kind of thing that gets your more control of the future.

"Sorry clippy, we do want you to get some paperclips, we just don't want you to get as many paperclips as you could if you could murder/brainhack/etc all humans, because that doesn't seem to be a very fair way to allocate the future." — and in the same breath, "Sorry Putin, we do want you to get some of whatever-intrinsic-values-you're-trying-to-satisfy, we just don't want you to get as much as ruthlessly ruling Russia can get you, because that doesn't seem to be a very fair way to allocate the future."

And this can apply regardless of how much of clippy already exists by the time you're doing CEV.

The main asymmetries I see are:

Other people not trusting the group to not be corrupted by power and to reflect correctly on their values, or not trusting that they'll decide to share power even after reflecting correctly. Thus "programmers" who decide to not share power from the start invite a lot of conflict. (In other words, CEV is partly just trying to not take power away from people, whereas I think you've been talking about giving AIs more power than they already have. "the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with")
The "programmers" not trusting themselves. I note that individuals or small groups trying to solve morality by themselves don't have very good track records. They seem to too easily become wildly overconfident and/or get stuck in intellectual dead-ends. Arguably the only group that we have evidence for being able to make sustained philosophical progress is humanity as a whole.

To the extent that these considerations don't justify giving every human equal power/weight in CEV, I may just disagree with Eliezer about that. (See also Hacking the CEV for Fun and Profit.)

trying to solve morality by themselves

It doesn't have to be by themselves; they can defer to others inside CEV, or come up with better schemes that their initial CEV inside CEV and then defer to that. Whatever other solutions than "solve everything on your own inside CEV" might exist, they can figure those out and defer to them from inside CEV. At least that's the case in my own attempts at implementing CEV in math (eg QACI).

Once they get into CEV, they may not want to defer to others anymore, or may set things up with a large power/status imbalance between themselves and everyone else which may be detrimental to moral/philosophical progress. There are plenty of seemingly idealistic people in history refusing to give up or share power once they got power. The prudent thing to do seems to never get that much power in the first place, or to share it as soon as possible.
If you're pretty sure you will defer to others once inside CEV, then you might as well do it outside CEV due to #1 in my grandparent comment.

I wonder how much of those seemingly idealistic people retained power when it was available because they were indeed only pretending to be idealistic. Assuming one is actually initially idealistic but then gets corrupted by having power in some way, one thing someone can do in CEV that you can't do in real life is reuse the CEV process to come up with even better CEV processes which will be even more likely to retain/recover their just-before-launching-CEV values. Yes, many people would mess this up or fail in some other way in CEV; but we only need one person or group who we'd be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this. Importantly, to me, this reduces outer alignment to "find someone smart and reasonable and likely to have good goal-content integrity", which is a matter of social & psychology that seems to be much smaller than the initial full problem of formal outer alignment / alignment target design.
One of the main reasons to do CEV is because we're gonna die of AI soon, and CEV is a way to have infinite time to solve the necessary problems. Another is that even if we don't die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.

but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this.

Why do you think this, and how would you convince skeptics? And there are two separate issues here. One is how to know their CEV won't be corrupted relative to what their values really are or should be, and the other is how to know that their real/normative values are actually highly altruistic. It seems hard to know both of these, and perhaps even harder to persuade others who may be very distrustful of such person/group from the start.

Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.

Would be interested in understanding your perspective on this better. I feel like aside from AI, our world is not being eaten by molochs very quickly, and I prefer something like stopping AI development and doing (voluntary and subsidized) embryo selection to increase human intelligence for a few generations, then letting the smarter humans decide what to do next. (Please contact me via PM if you want to have a chat about this.)

some fragments:

What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?

re: hard to know - it seems to me that we can't get a certifiably-going-to-be-good result from a CEV based ai solution unless we can make it certifiable that altruism is present. I think figuring out how to write down some form of what altruism is, especially altruism in contrast to being-a-pushover, is necessary to avoid issues - because even if any person considers themselves for CEV, how would they know they can trust their own behavior?

as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what's happening in a way that corrupts thoughts which previously implemented values. can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its "true wants, needs, and hopes for the future"?

What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?

I'm very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?

as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what's happening in a way that corrupts thoughts which previously implemented values.

Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (or extensively researched/debated if we can't definitely solve them) before starting something like CEV. This might involve creating the right virtual environment, social rules, epistemic norms, group composition, etc. A few things that seem easy to miss or get wrong:

Is it better to have no competition or some competition, and what kind? (Past "moral/philosophical progress" might have been caused or spread by competitive dynamics.)
How should social status work in CEV? (Past "progress" might have been driven by people motivated by certain kinds of status.)
No danger or some danger? (Could a completely safe environment / no time pressure cause people to lose motivation or some other kind of value drift? Related: What determines the balance between intelligence signaling and virtue signaling?)

can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its "true wants, needs, and hopes for the future"?

I think this is worth thinking about as well, as a parallel approach from the above. It seems related to metaphilosophy in that if we can discover what "correct philosophical reasoning" is, we can solve this problem by asking "What would this chunk of matter conclude if it were to follow correct philosophical reasoning?"

Imagine that one day we or our descendants build or become superintelligent super-competent philosophers who after exhaustively investigating moral philosophy for millions of years, decide that some moral theory or utility function is definitely right.

But what is the reason to think that we or our descendants would have a better chance of finding this kind of "definitely right" moral theory or utility function than other AIs or their descendants?

In some sense, the point of OP is that the difference between "us" and "not-us" here might be more nebulous than we usually believe, and that a more equal treatment is called for.

Otherwise, one might also argue (in a symmetric fashion) that we would destroy moral option value by preventing other entities who might have a better chance of building or becoming "superintelligent super-competent philosophers" from having a shot at that...

But what is the reason to think that we or our descendants would have a better chance of finding this kind of “definitely right” moral theory or utility function than other AIs or their descendants?

Humans have a history of making philosophical progress. We lack similar empirical evidence for AIs. I'll reevaluate my position if that changes, with the caveat that I want some reassurance that the AI is doing correct philosophy, not just optimizing to persuade me or humans in general (which I'm afraid will be the default).
So far AI capabilities seem more tilted towards technological progress than philosophical progress (compared to humans). See also AI doing philosophy = AI generating hands? for more reasons to worry about this. Under these circumstances it seems very easy to permanently mess up the trajectory of philosophical progress, for example by locking in one's current conception of what's right, or inventing new technology capable of corrupting everyone's values without knowing how to defend against that.
What is the right morality may be partly or wholly subjective (I'm not sure), in which case AIs will end up converging to different moral conclusions from us, independently of philosophical competence, and from our perspective, the right thing to do would be to follow our own conclusions.

But I don't know to what extent productive studies in philosophy at the top level of competence in philosophy are at all compatible with safety concerns. It's not an accident that people using base models show nice progress in joint human-AI philosophical brainstorms, whereas people using tamed models seem to be saying that those models are not creative enough, and that those models don't think in sufficiently non-standard ways.

It's might be a fundamental problem which might not have anything to do with human-AI differences. For example, Nietzsche is an important radical philosopher, and we need biological or artificial philosophers performing not just on that level, but on a higher level than that, if we want them to properly address fundamental problems, but Nietzsche is not "safe" in any way, shape, or form.

Thanks, that's very informative.

Humans have a history of making philosophical progress. We lack similar empirical evidence for AIs.

Hybrid philosophical discourse done by human-AI collaborations can be very good. For example, I feel that Janus has been doing very strong work in this sense with base models (so, not with RLHF'd, Constitutional, or otherwise "lesioned" and "mode-collapsed" models we tend to mostly use these days).

But, indeed, this does not tell us much about what would AIs do on their own.

When I question my intuitions about paperclip-loving humans, one thing that makes them less threatening is 1) an intuition that they're implementing - whether by mere hobbyistic delight or ideological fanatacism or both - a variation of the plasticity of human values 2) a bearish take on their ability to negate that plasticity and ensure that all anyone cares about is paperclippers forever.

Re: 1 when I imagine the paperclip enthusiasts, I imagine social media posts talking about how particular brands or styles of paperclips appeal to them, philosophical justifications for why paperclips should be maximized, different sects of paperclip maximizers who scorn each other as not the real thing, simple appreciation of paperclips and complex feelings associated with it, heroes who are admired for their contributions to the paperclipping project, still caring somewhat about friends and sex and physical comfort and so on. These and similar features seem pretty universal to human aesthetic, political, and religious movements, and they bake in elements of humanity that I care about and would prefer to keep existing. Presumably classical Clippy doesn't care about any of these things except perhaps instrumentally and is just implementing a sole "maximize paperclips" function. Evolved aliens probably care about at least a few of them, or things that are analogous to feel "intrinsically valuable" to me, even if they also really really care about paperclips.

If Nazis took over the world and implemented their preferred policies and raised everyone who was allowed to survive with Nazi values, that would be very bad (duh.) But if we're restricting ourselves to 20th century technology in this example, I'm not worried that their vision of the future would last forever, or even the advertised thousand years; my guess is that the great^n-grandchildren (possibly with a very low n) of the Nazi victors would look back and say "yeah, that was really bad" and that future Nazi-descended civilizations would keep varying around the human baseline: most less nice than they could be but nicer than Nazis. Collecting paperclips is way less bad than the Holocaust (duh), but implemented on human hardware I wouldn't expect it to last forever either.

I'm noticing two things:

It's suspicious to me that values of humans-who-like-paperclips are inherently tied to acquiring an unlimited amount of resources (no matter in which way). Maybe I don't treat such values as 100% innocent, so I'm OK keeping them in check. Though we can come up with thought experiments where the urge to get more resources is justified by something. Like, maybe instead of producing paperclips those people want to calculate Busy Beaver numbers, so they want more and more computronium for that.
How consensual were the trades if their outcome is predictable and other groups of people don't agree with the outcome? Looks like coercion.

I think this is a fascinating question, which is irrelevant to technical alignment and our near-term survival. I think not only that Corrigibility or DWIM is an attractive primary goal for AGI, but that it's so much easier as to almost certainly be what people actually try for their first AGI alignment attempts. Uunderstanding what you mean by what you say and checking when they're not sure is much simpler than understanding and implementing an ideal ethics. Alignment is hard enough without solving ethics, so we'll put that part off because we can.

I think you're hitting an important tension here: being nice and liberal seems like the value we'd endorse. The big problem is: if you're tolerant of those with other values, will they be tolerant of you? How would you know if they'll lie until they have power to do what they want with your backyard (deceptive alignment), or genuinely change their minds once they have that power?

The overal conclusion is that, while we'd like to be liberal and nice and tolerant, it will get us killed in a lot of situations where others aren't tolerant in return. Which ones could use some more careful analysis.

This logic is laid out in detail by Yudkowsky across many posts. I think he's considered the pull toward tolarance and niceness in detail. Steve Byrnes' comment here hits some high points. It's a topic worth more consideration.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Rather, they might be mere empty machines. Should you still tolerate/respect/etc them, then?"
My sense is that I'm unusually open to "yes," here.

I think the discussion following from here is a little ambiguous (perhaps purposefully so?). In particular, it is unclear which of the following points are being made:

1: Sufficient uncertainty with respect to the sentience (I'm taking this as synonymous with phenomenal consciousness) of future AIs should dictate that we show them tolerance/respect etc...
2: We should not be confident that sentience is a good criterion for moral patienthood (i.e., being shown tolerance/respect etc...), even though sentience is a genuine thing.
3: We should worry that sentience isn't a genuine thing at all (i.e, illusionism / as-yet-undescribed re-factorings of what we currently call sentience).

When you wrote that you are unusually open to "yes" in the quoted sentence, I took the qualifier "unusual" to indicate that you were making point 2, since I do not consider point 1 to be particularly unusual (Schwitzgebel has pushed for this view, for example). However, your discussion then mostly seemed to be making the case for point 1 (i.e., we could impose a criterion for moral worth that is intended to demarcate non-sentient and sentient entities but that fails). For what it's worth, I would be very interested to hear arguments for point 2 which do not collapse into point 1 (or, alternatively, some reason why I am mistaken for considering them distinct points). From my perspective, it is hard to understand how something which really lacks what I mean by phenomenal consciousness could possibly be a moral patient. Perhaps it is related to the fact that I have, despite significant effort, utterly failed to grok illusionism.

An aside on AI sentience

Do you in fact mean 'sapience' (i.e. being, like Homo sapiens, capable enough to use complex language, pass on cultural knowledge to your children, have a high tech society, target nuclear weapons, etc. etc.)? Or do you really mean 'sentience' as in, being able to sense things (and perhaps also feel pain), i.e. something that (at a minimum) basically every multicellular organism on Earth more complex than a sponge can do? Or do you mean something else, like 'agentic', or perhaps something more philosophical, like 'being a moral patient', or something that people have and philosophical zombies somehow don't? I think you might want to use a more specific or clearly-defined word, or, if you're intentionally being vague, at least mention that you're not using 'sentient' according to its standard dictionary definition of 'able to sense things'.

I read it as meaning sapience in the sense you describe, although in a somewhat deliberately vague way.

I did a little dive into the meaning of "sapience" and "sentience" recently. Both are used in a variety of ways; their etymological roots are vague, so they don't quite have a proper meaning.

It seems more common to use:

Sentience for feeling or having phenomenal consciousness. It is usually identical to being a moral patient of some sort.

Sapience for thinking, usually intellectual self-awareness and reflective cognition. I wrote a piece on this sense and why it's an important threshold for AGI cognition: Sapience, understanding, and "AGI"

I think these are two valuable terms to have so I endorse using them this way.

Great post! I’m getting a lot out of this series.

Belief that “pure-consequentialist AI” is kinda the only way to build very powerful (tech-inventing, self-improving, reflectively-stable) AI, so we should expect that to happen sooner or later.

(By “pure-consequentialist AI”, I mean AI that has preferences about the state of the universe in the distant future, and those preferences inform its actions.)

My impression is that Eliezer & Nate believe something like this. See for example Nate’s post Deep Deceptiveness (see also my response comment).

…But there's a softer version of that:

Belief that “pure-consequentialist AI” will outcompete the “non-pure-consequentialist AI”, even if (per above) the latter is real and powerful.

Anyway, I see this as an important dynamic to keep in mind, but I'm not sure how decisive it will be.

Belief that good enforceable boundaries are a temporary luxury of our technological immaturity, i.e. offense-defense balance will change in the future

Belief that the power of cooperation (Elua) is a temporary feature of our technology immaturity

(My perennial uncertainty is: AI 1 can straightforwardly send source code / model weights / whatever to AI 2, but how can AI 1 prove to AI 2 that this file is actually its real source code / model weights / whatever? There might be a good answer, I dunno.)

Well, even it did: land use is actually a very big deal.^[16] And to be clear: I don't like paperclips any more than you do. I much prefer stuff like joy and understanding and beauty and love.

I'm not saying we have a settled ethic here, and still less, that its rational structure is sufficiently natural and privileged that tons of agents will converge on it. Rather, my claim is that we have some ethic here – an ethic that behaves towards "agents with different values" in a manner importantly different from (and "nicer" than) paperclipping, utilitarianism, and a whole class of related forms of consequentialism; and in particular, an ethic that doesn't view the mere presence of (law-abiding, cooperative) people-who-like-paperclips as a major problem.

I agree that this is the approach to a solution for those who agree with liberalism.

"Boundaries" seems to massively suffer from nearest unblocked strategy problems, since it's focused on blocking things.
Liberalism already in some ways struggles due to respecting boundaries too much. E.g. one of the justifications for NIMBYism is that dense housing ends up blocking sunlight. This is basically true (neighboring buildings inevitably have some externalities on each other), but AFAICT still counterproductive.
I think you are underestimating the difficulty in deciding which boundaries to respect. It's wrong for parents to sexually abuse their children, but in terms of boundaries it's hard to distinguish this from many other things that children have to deal with, e.g. being told what to eat, made to go to school, or vaccinated. Today it mainly gets distinguished in terms of harm rather than boundaries, but the way society decides what counts as harm and how to measure it is a giant political garbage fire, and it's not clear how an AI could do better.
In a sense, such an AI would be a "boundary-maximizer", but this incentivizes people to frame whatever they desire in terms of boundary violations in order to get help from the AI, and that doesn't seem like a mentally healthy way to be (like obsessing over every little violation).
The issue of moral patienthood is still huge. Someone could spam a bunch of copies of the minimal entity that gets its boundaries respected, and if the concept of boundaries packs any punch, then this spam will pack a lot of punch too.

You could have a liberal society while making the AIs more bounded than full-blown liberalism maximizers. That's probably what I'd go for. (Still trying to decide.)

at least absent some clear and not-just-species-ist story about why AIs-with-different-values should be excluded

the main arguments for the programmers including all of [current?] humanity in the CEV "extrapolation base" […] apply symmetrically to AIs-we're-sharing-the-world-with at the time

And this can apply regardless of how much of clippy already exists by the time you're doing CEV.

The main asymmetries I see are:

Other people not trusting the group to not be corrupted by power and to reflect correctly on their values, or not trusting that they'll decide to share power even after reflecting correctly. Thus "programmers" who decide to not share power from the start invite a lot of conflict. (In other words, CEV is partly just trying to not take power away from people, whereas I think you've been talking about giving AIs more power than they already have. "the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with")
The "programmers" not trusting themselves. I note that individuals or small groups trying to solve morality by themselves don't have very good track records. They seem to too easily become wildly overconfident and/or get stuck in intellectual dead-ends. Arguably the only group that we have evidence for being able to make sustained philosophical progress is humanity as a whole.

To the extent that these considerations don't justify giving every human equal power/weight in CEV, I may just disagree with Eliezer about that. (See also Hacking the CEV for Fun and Profit.)

trying to solve morality by themselves

Once they get into CEV, they may not want to defer to others anymore, or may set things up with a large power/status imbalance between themselves and everyone else which may be detrimental to moral/philosophical progress. There are plenty of seemingly idealistic people in history refusing to give up or share power once they got power. The prudent thing to do seems to never get that much power in the first place, or to share it as soon as possible.
If you're pretty sure you will defer to others once inside CEV, then you might as well do it outside CEV due to #1 in my grandparent comment.

I wonder how much of those seemingly idealistic people retained power when it was available because they were indeed only pretending to be idealistic. Assuming one is actually initially idealistic but then gets corrupted by having power in some way, one thing someone can do in CEV that you can't do in real life is reuse the CEV process to come up with even better CEV processes which will be even more likely to retain/recover their just-before-launching-CEV values. Yes, many people would mess this up or fail in some other way in CEV; but we only need one person or group who we'd be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this. Importantly, to me, this reduces outer alignment to "find someone smart and reasonable and likely to have good goal-content integrity", which is a matter of social & psychology that seems to be much smaller than the initial full problem of formal outer alignment / alignment target design.
One of the main reasons to do CEV is because we're gonna die of AI soon, and CEV is a way to have infinite time to solve the necessary problems. Another is that even if we don't die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.

but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this.

Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.

some fragments:

What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?

What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?

I'm very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?

as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what's happening in a way that corrupts thoughts which previously implemented values.

Is it better to have no competition or some competition, and what kind? (Past "moral/philosophical progress" might have been caused or spread by competitive dynamics.)
How should social status work in CEV? (Past "progress" might have been driven by people motivated by certain kinds of status.)
No danger or some danger? (Could a completely safe environment / no time pressure cause people to lose motivation or some other kind of value drift? Related: What determines the balance between intelligence signaling and virtue signaling?)

can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its "true wants, needs, and hopes for the future"?

Imagine that one day we or our descendants build or become superintelligent super-competent philosophers who after exhaustively investigating moral philosophy for millions of years, decide that some moral theory or utility function is definitely right.

But what is the reason to think that we or our descendants would have a better chance of finding this kind of "definitely right" moral theory or utility function than other AIs or their descendants?

In some sense, the point of OP is that the difference between "us" and "not-us" here might be more nebulous than we usually believe, and that a more equal treatment is called for.

But what is the reason to think that we or our descendants would have a better chance of finding this kind of “definitely right” moral theory or utility function than other AIs or their descendants?

Humans have a history of making philosophical progress. We lack similar empirical evidence for AIs. I'll reevaluate my position if that changes, with the caveat that I want some reassurance that the AI is doing correct philosophy, not just optimizing to persuade me or humans in general (which I'm afraid will be the default).
So far AI capabilities seem more tilted towards technological progress than philosophical progress (compared to humans). See also AI doing philosophy = AI generating hands? for more reasons to worry about this. Under these circumstances it seems very easy to permanently mess up the trajectory of philosophical progress, for example by locking in one's current conception of what's right, or inventing new technology capable of corrupting everyone's values without knowing how to defend against that.
What is the right morality may be partly or wholly subjective (I'm not sure), in which case AIs will end up converging to different moral conclusions from us, independently of philosophical competence, and from our perspective, the right thing to do would be to follow our own conclusions.

Thanks, that's very informative.

Humans have a history of making philosophical progress. We lack similar empirical evidence for AIs.

But, indeed, this does not tell us much about what would AIs do on their own.

I'm noticing two things:

It's suspicious to me that values of humans-who-like-paperclips are inherently tied to acquiring an unlimited amount of resources (no matter in which way). Maybe I don't treat such values as 100% innocent, so I'm OK keeping them in check. Though we can come up with thought experiments where the urge to get more resources is justified by something. Like, maybe instead of producing paperclips those people want to calculate Busy Beaver numbers, so they want more and more computronium for that.
How consensual were the trades if their outcome is predictable and other groups of people don't agree with the outcome? Looks like coercion.

Rather, they might be mere empty machines. Should you still tolerate/respect/etc them, then?"
My sense is that I'm unusually open to "yes," here.

An aside on AI sentience

I read it as meaning sapience in the sense you describe, although in a somewhat deliberately vague way.

I did a little dive into the meaning of "sapience" and "sentience" recently. Both are used in a variety of ways; their etymological roots are vague, so they don't quite have a proper meaning.

It seems more common to use:

Sentience for feeling or having phenomenal consciousness. It is usually identical to being a moral patient of some sort.

I think these are two valuable terms to have so I endorse using them this way.

LESSWRONG
LW

LESSWRONG
LW

110

Being nicer than Clippy

110

Utilitarian vices

Boundaries

What if the humans-who-like-paperclips get a bunch of power, though?

An aside on AI sentience

Giving AIs-with-different-values a stake in civilization

The power of niceness, community, and civilization

Is niceness enough?

110

An aside on AI sentience

110

An aside on AI sentience