Suppose that AI systems built by humans spread throughout the universe and achieve their goals. I see two quite different reasons this outcome could be good:

  1. Those AI systems are aligned with humans; their preferences are our preferences.
  2. Those AI systems flourish on their own terms, and we are happy for them even though they have different preferences.

I spend most of my time thinking about option #1. But I think option #2 is a plausible plan B.

Understanding how happy we should be with an unaligned AI flourishing on its own terms, and especially which unaligned AIs we should be happy about, seems like a very important moral question.

I currently feel very uncertain about this question; if you forced me to guess, I’d estimate that option #2 allows us to recover 25% of the expected value that we lose by building unaligned AI. But after more thinking, that number could go down to 0% or up to >90%.

Definition

In this post I’ll say that an AI is a good successor if I believe that building such an AI and “handing it the keys” is a reasonable thing to do with the universe. Concretely, I’ll say an AI is a good successor if I’d prefer give it control of the world than accept a gamble where we have a 10% chance of extinction and a 90% chance of building an aligned AI.

In this post I’ll think mostly about what happens with the rest of the universe, rather than what happens to us here on Earth. I’m wondering whether we would appreciate what our successors do with all of the other stars and galaxies — will we be happy with how they use the universe’s resources?

Note that a competent aligned AI is a good successor, because “handing it the keys” doesn’t actually amount to giving up any control over the universe. In this post I’m wondering which unaligned AIs are good successors.

Preface: in favor of alignment

I believe that building an aligned AI is by far the most likely way to achieve a good outcome. An aligned AI allows us to continue refining our own views about what kind of life we want to exist and what kind of world we want to create — there is no indication that we are going to have satisfactory answers to these questions prior to the time when we build AI.

I don’t think this is parochial. Once we understand what makes life worth living, we can fill the universe with an astronomical diversity of awesome experiences. To the extent that’s the right answer, it’s something I expect us to embrace much more as we become wiser.

And I think that further reflection is a really good idea. There is no law that the universe tends towards universal love and goodness, that greater intelligence implies greater moral value. Goodness is something we have to work for. It might be that the AI we would have built anyway will be good, or it might not be, and it’s our responsibility to figure it out.

I am a bit scared of this topic because it seems to give people a license to hope for the best without any real justification. Because we only get to build AI once, reality isn’t going to have an opportunity to intervene on people’s happy hopes.

Clarification: Being good vs. wanting good

We should distinguish two properties an AI might have:

  • Having preferences whose satisfaction we regard as morally desirable.
  • Being a moral patient, e.g. being able to suffer in a morally relevant way.

These are not the same. They may be related, but they are related in an extremely complex and subtle way. From the perspective of the long-run future, we mostly care about the first property.

As compassionate people, we don’t want to mistreat a conscious AI. I’m worried that compassionate people will confuse the two issues — in arguing enthusiastically for the claim “we should care about the welfare of AI” they will also implicitly argue for the claim “we should be happy with whatever the AI chooses to do.” Those aren’t the same.

It’s also worth clarifying that both sides of this discussion can want the universe to be filled with morally valuable AI eventually, this isn’t a matter of carbon chauvinists vs. AI sympathizers. The question is just about how we choose what kind of AI we build — do we hand things off to whatever kind of AI we can build today, or do we retain the option to reflect?

Do all AIs deserve our sympathy?

Intuitions and an analogy

Many people have a strong intuition that we should be happy for our AI descendants, whatever they choose to do. They grant the possibility of pathological preferences like paperclip-maximization, and agree that turning over the universe to a paperclip-maximizer would be a problem, but don’t believe it’s realistic for an AI to have such uninteresting preferences.

I disagree. I think this intuition comes from analogizing AI to the children we raise, but that it would be just as accurate to compare AI to the corporations we create. Optimists imagine our automated children spreading throughout the universe and doing their weird-AI-analog of art; but it’s just as realistic to imagine automated PepsiCo spreading throughout the universe and doing its weird-AI-analog of maximizing profit.

It might be the case that PepsiCo maximizing profit (or some inscrutable lost-purpose analog of profit) is intrinsically morally valuable. But it’s certainly not obvious.

Or it might be the case that we would never produce an AI like a corporation in order to do useful work. But looking at the world around us today that’s certainly not obvious.

Neither of those analogies is remotely accurate. Whether we should be happy about AI “flourishing” is a really complicated question about AI and about morality, and we can’t resolve it with a one-line political slogan or crude analogy.

On risks of sympathy

I think that too much sympathy for AI is a real risk. This problem is going to made particularly serious because we will (soon?) be able to make AI systems which are optimized to be sympathetic. If we are indiscriminately sympathetic towards whatever kind of AI is able to look sympathetic, then we can’t steer towards the kind of AI that actually deserve our sympathy. It’s very easy to imagine the world where we’ve built a PepsiCo-like AI, but one which is much better than humans at seeming human, and where people who suggest otherwise look like moral monsters.

I acknowledge that the reverse is also a risk: humans are entirely able to be terrible to creatures that o deserve our sympathy. I believe the solution to that problem is to actually think about what the nature of the AI we build, and especially to behave compassionately in light of uncertainty about the suffering we might cause and whether or not it is morally relevant. Not to take an indiscriminate pro-AI stand that hands the universe over to the automated PepsiCo.

Do any AIs deserve our sympathy?

(Warning: lots of weird stuff.)

In the AI alignment community, I often encounter the reverse view: that no unaligned AI is a good successor.

In this section I’ll argue that there are at least some unaligned AIs that would be good successors. If we accept that there are any good successors, I think that there are probably lots of good successors, and figuring out the boundary is an important problem.

(To repeat: I think we should try to avoid handing off the universe to any unaligned AI, even if we think it is probably good, because we’d prefer retain the ability to think more about the decision and figure what we really want. See the conclusion.)

Commonsense morality and the golden rule

I find the golden rule very compelling. This isn’t just because of repeated interaction and game theory: I’m strongly inclined to alleviate suffering even if the beneficiaries live in abject poverty (or factory farms) and have little to offer me in return. I’m motivated to help largely because that’s what I would have wanted them to do if our situations were reversed.

Personally, I have similar intuitions about aliens (though I rarely have the opportunity to help aliens). I’d be hesitant about the people of Earth screwing over the people of Alpha Centauri for many of the same reasons I’d be uncomfortable with the people of one country screwing over the people of another. While the situation is quite confusing I feel like compassion for aliens is a plausible “commonsense” position.

If it is difficult to align AI, then our relationship with an unaligned AI may be similar to our relationship with aliens. In some sense we have all of the power, because we got here first. But if we try to leverage that power, by not building any unaligned AI, then we might run a significant risk of extinction or of building an AI that no one would be happy with. A “good cosmic citizen” might prefer to hand off control to an unaligned and utterly alien AI, than to gamble on the alternative.

If the situation were totally symmetrical — if we believed the AI was from exactly the same distribution over possible civilizations that we are from — then I would find this intuitive argument extremely compelling.

In reality, there are almost certainly differences, so the situation is very confusing.

A weirder argument with simulations

The last argument gave a kind of common-sense argument for being nice to some aliens. The rest of this post is going to be pretty crazy.

Let’s consider a particular (implausible) strategy for building an AI:

  • Start with a simulation of Earth.
  • Keep waiting/restarting until evolution produces human-level intelligence, civilization, etc.
  • Once the civilization is slightly below our stage of maturity, show them the real world and hand them the keys.
  • (This only makes sense if the simulated civilization is much more powerful than us, and faces lower existential risk. That seems likely to me. For example, the resulting AIs would likely think much faster than us, and have a much larger effective population; they would be very robust to ecological disaster, and would face a qualitatively easier version of the AI alignment problem.)

Suppose that every civilization followed this strategy. Then we’d simply be doing a kind of interstellar shuffle, where each civilization abandons their home and gets a new one inside of some alien simulation. It seems much better for everyone to shuffle than to accept a 10% chance of extinction.

Incentivizing cooperation

The obvious problem with this plan is that not everyone will follow it. So it’s not really a shuffle: nice civilizations give up their planet, while mean civilizations keep their original planet and get a new one. So this strategy involves a net transfer of resources from nice people to mean people: some moral perspectives would be OK with that, but many would not.

This obvious problem has an obvious solution: since you are simulating the target civilization, you can run extensive tests to see if they seem nice — i.e. if they are the kind of civilization that is willing to give an alien simulation control rather than risk extinction — and only let them take over if they are.

This guarantees that the nice civilizations shuffle around between worlds, while the mean civilizations take their chances on their own, which seems great.

More caveats and details

This procedure might look really expensive — you need to simulate a whole civilization, nearly as large as your own civilization, with computers nearly as large as your computers. But in fact it doesn’t require literally simulating the civilization up until the moment when they are building AI— you could use cheaper mechanisms to try to guess whether they were going to be nice a little bit in advance, e.g. by simulating large numbers of individuals or groups making particularly relevant decisions. If you were simulating humans, you could imagine predicting what the modern world would do without ever actually running a population of >100,000.

If only 10% of intelligent civilizations decide to accept this trade, then running the simulation is 10x as expensive (since you need to try 10 times). Other than that, I think that the calculation doesn’t actually depend very much on what fraction of civilizations take this kind of deal.

Another problem is that people may prefer continue existing in their own universe than in some weird alien simulation, so the “shuffle” may itself be a moral catastrophe that we should try to avoid. I’m pretty skeptical of this:

  • You could always later perform an acausal trade to “go home,” i.e. to swap back with the aliens who took over your civilization (by simulating each other and passing control back to the original civilization if their simulated copy does likewise).
  • In practice the universe is very big, and the part of our preferences that cares about “home” seems easily satiable. There is no real need for the new residents of our world to kill us, and I think that we’d be perfectly happy to get just one galaxy while the new residents get everything else. (Given that we are getting a whole universe worth of resources somewhere else.)

Another problem is that this is a hideously intractable way to make an AI. More on that two sections from now.

Another problem is that this is completely insane. I don’t really have any defense, if you aren’t tolerant of insanity you should probably just turn back now.

Decision theory

The above argument about trade / swapping places makes sense from a UDT perspective. But I think a similar argument should be persuasive even to a causal decision theorist.

Roughly speaking, you don’t have much reason to think that you are on the outside, considering whether to instantiate some aliens, rather than on the inside, being evaluated for kindness. If you are on the outside, instantiating aliens may be expensive. But if you are on the inside, trying to instantiate aliens lets you escape the simulation.

So the cost-benefit analysis for being nice is actually pretty attractive, and is likely to be a better deal than a 10% risk of extinction.

(Though this argument depends on how accurately the simulators are able to gauge our intentions, and whether it is possible to look nice but ultimately defect.)

How sensitive is moral value to the details of the aliens?

If an AI is from exactly the same distribution that we are, I think it’s particularly likely that they are a good successor.

Intuitively, I feel like goodness probably doesn’t depend on incredibly detailed facts about our civilization. For example, suppose that the planets in a simulation are 10% smaller, on average, than the planets in the real world. Does that decrease the moral value of life from that simulation? What if they are 10% larger?

What if we can’t afford to wait until evolution produces intelligence by chance, so we choose some of the “randomness” to be particularly conducive to life? Does that make all the difference? What if we simulate a smaller population than evolution over a larger number of generations?

Overall I don’t have very strong intuitions about these questions and the domain is confusing. But my weak intuition is that none of these things should make a big moral difference.

One caveat is that in order to assess whether a civilization is “nice,” you need to see what they would do under realistic conditions, i.e. conditions from the same distribution that the “basement” civilizations are operating under. This doesn’t necessarily mean that they need to evolve in a physically plausible way though, just that they think they evolved naturally. To test niceness we could evolve life, then put it down in a world like ours (with a plausible-looking evolutionary record, a plausible sky, etc.)

The decision-theoretic / simulation argument seems more sensitive to details than the commonsense morality argument. But even for the decision-theoretic argument, as long as we create a historical record convincing enough to fool the simulated people, the same basic analysis seems to apply. After all, how do we know that our history and sky aren’t fake? Overall the decision-theoretic analysis gets really weird and complicated and I’m very unsure what the right answer is.

(Note that this argument is very fundamentally different from using decision theory to constrain the behavior of an AI — this is using decision theory to guide our own behavior.)

Conclusion

Even if we knew how to build an unaligned AI that is probably a good successor, I still think we should strongly prefer to build aligned AGI. The basic reason is option value: if we build an aligned AGI, we keep all of our options open, and can spend more time thinking before making any irreversible decision.

So why even think about this stuff?

If building aligned AI turns out to be difficult, I think that building an unaligned good successor is a plausible Plan B. The total amount of effort that has been invested in understanding which AIs make good successors is very small, even relative to the amount of effort that has been invested in understanding alignment. Moreover, it’s a separate problem that may independently turn out to be much easier or harder.

I currently believe:

  • There are definitely some AIs that aren’t good successors. It’s probably the case that many AIs aren’t good successors (but are instead like PepsiCo)
  • There are very likely to be some AIs that are good successors but are very hard to build (like the detailed simulation of a world-just-like-Earth)
  • It’s plausible that there are good successors that are easy to build.
  • We’d likely have a much better understanding of this issue if we put some quality time into thinking about it. Such understanding has a really high expected value.

Overall, I think the question “which AIs are good successors?” is both neglected and time-sensitive, and is my best guess for the highest impact question in moral philosophy right now.

New Comment
53 comments, sorted by Click to highlight new comments since:
This obvious problem has an obvious solution: since you are simulating the target civilization, you can run extensive tests to see if they seem nice — i.e. if they are the kind of civilization that is willing to give an alien simulation control rather than risk extinction — and only let them take over if they are.

Importantly, you mostly care about giving control to a target civilization that checks for _its_ target civilization being "nice" before passing control to it. It's bad to hand control to a CooperateBot (who hands control to a random civilization); that would itself be equivalent to handing control to a random civilization. There is some complex nested inference that happens here: you have to make inferences about them by simulating small versions of them, while they make inferences about others (and transitively, you) by simulating small versions of them.

Logical induction gives a hint on how to do this, but the generalization bounds are pretty bad (roughly, the number of effective data points you get is at best proportional to the logarithm of the size of the thing you are predicting), and it it still only a hint; we don't have a good formal decision theory yet.

Another way of thinking about this is: there are a bunch of possible civilizations represented by different nodes in a graph. Each node has weighted edges to other nodes, representing the probabilities that it passes control to each of these other nodes. It also has a weighted edge to one or more "sink" states, representing cases where it does not hand control to another civilization (and instead goes extinct, builds an aligned AI, prevents unaligned AI in some other way, etc). These nodes form a Markov chain, similar to PageRank.

The relevant question is: as a node, how should we pass probability mass to other nodes given pretty limited information about them, such that (taking UDT considerations into account) lots of probability mass ends up in good sink states, in particular sink states similar to those that result from us aligning AI?

One possible strategy here is a CliqueBot type strategy, where we try to find a set of nodes such that (a) enough nodes have some chance of actually solving alignment (b) when they don't solve alignment, they mostly pass control to other nodes in this set, and (c) it's pretty easy to tell if a node is in this set without simulating it in detail. This is unlikely to be the optimal strategy, though.

This is a good question. I worry that OP isn't even considering that the simulated civilization might decide to build their own AI (aligned or not). Maybe the idea is to stop the simulation before the civilization reaches that level of technology. But then, they might not have enough time to make any decisions useful to us.

Logical induction gives a hint on how to do this, but the generalization bounds are pretty bad (roughly, the number of effective data points you get is at best proportional to the logarithm of the size of the thing you are predicting), and it it still only a hint; we don't have a good formal decision theory yet.

This doesn't seem to require any complicated machinery.

From a decision-theoretic perspective, you want to run scheme X: let the people out with whatever probability you think that a civilization of them would run scheme X. If instead of running scheme X they come up with some more clever idea, you can evaluate that idea when you see it. If they come up with scheme Y that is very similar to scheme X, probably that's fine.

Logically, the hard part is predicting whether they'd try to run scheme X, without actually needing to simulate them simulating others. That doesn't look very hard though (especially given that you can e.g. mock up the result of their project).

I don't see why you don't think this is hard. We have a pretty poor understanding of alien evolutionary and social dynamics, so why would we expect our beliefs about the aliens to be even a little accurate? This is plausibly about as difficult as alignment (if you could make these predictions, then maybe you could use them to align AIs made out of evolutionary simulations).

There's also a problem with the definition of X: it's recursive, so it doesn't reduce to a material/logical statement, and there's some machinery involved in formalizing it. This doesn't seem impossible but it's a complication.

We have a simulation of the aliens, who are no smarter than we are, so we run it and see what they do. We can't formally define X, but it seems to me that if we are looking in at someone building an AI, we can see whether they are doing it.

What part do you think is most difficult? Understanding what the aliens are doing at all? The aliens running subtle deception schemes after realizing they are in a simulation (but making it look to us like they don't realize they are in a simulation)? Making a judgment call about whether their behavior counts at X?

(Of course, actually spawning an alien civilization sounds incredibly difficult, the subsequent steps don't sound so hard to me.)

If we're simulating them perfectly, then they are in total using less compute than us. And their simulation of a third civilization in total uses less compute than they use. So this is only a way to pass control to smaller civilizations. This is insufficient, so I was thinking that it is actually necessary to predict what they do given a less-than-perfect simulation, which is a difficult problem.

Actually I think I misinterpreted your comment: you are proposing deciding whether to hand over control _without_ simulating them actually running AI code. But this is actually an extrapolation: before most of the cognitive work of their civilization is done by AI, their own computation of simulations will be interleaved with their own biological computation. So you can't just simulate them up to the point where they hand off control to AI, since there will be a bunch of computation before then too.

Basically, you are predicting what the aliens do in situation X, without actually being able to simulate situation X yourself. This is an extrapolation, and understanding of alien social dynamics would be necessary to predict this accurately.

Basically, you are predicting what the aliens do in situation X, without actually being able to simulate situation X yourself. This is an extrapolation, and understanding of alien social dynamics would be necessary to predict this accurately.

I agree that you'd need to do some reasoning rather than being able to simulate the entire world and see what happens. But you can really afford quite a lot of simulating if you had the ability to rerun evolution, so you can e.g. probe the entire landscape of what they would do under different conditions (including what groups would do). The hardest things to simulate are the results of the experiments they run, but again you can probe the entire range of possibilities. You can also probably recruit aliens to help explain what the important features of the situation are and how the key decisions are likely to be made, if you can't form good models by using the extensive simulations.

The complexity of simulating "a human who thinks they've seen expensive computation X" seems much closer to the complexity of simulating a human brain than to the complexity of simulating X.

I have pretty suffering-focused ethics; reducing suffering isn't the only thing that I care about, but if we got to a scenario with unaligned AI, then something like "I'd be happy as long as it didn't create lots more suffering in the universe" would be my position.

I'm not sure how large of a suffering risk an unaligned AI would be, but there are some reasons for suspecting that it might create quite a lot of suffering if it was totally indifferent to it; as we speculated in our paper on s-risks:

Humans have evolved to be capable of suffering, and while the question of which other animals are conscious or capable of suffering is controversial, pain analogues are present in a wide variety of animals. The U.S. National Research Council’s Committee on Recognition and Alleviation of Pain in Laboratory Animals (2004) argues that, based on the state of existing evidence, at least all vertebrates should be considered capable of experiencing pain.
Pain seems to have evolved because it has a functional purpose in guiding behavior: evolution having found it suggests that pain might be the simplest solution for achieving its purpose. A superintelligence which was building subagents, such as worker robots or disembodied cognitive agents, might then also construct them in such a way that they were capable of feeling pain - and thus possibly suffering (Metzinger 2015) - if that was the most efficient way of making them behave in a way that achieved the superintelligence’s goals.
Humans have also evolved to experience empathy towards each other, but the evolutionary reasons which cause humans to have empathy (Singer 1981) may not be relevant for a superintelligent singleton which had no game-theoretical reason to empathize with others. In such a case, a superintelligence which had no disincentive to create suffering but did have an incentive to create whatever furthered its goals, could create vast populations of agents which sometimes suffered while carrying out the superintelligence’s goals. Because of the ruling superintelligence’s indifference towards suffering, the amount of suffering experienced by this population could be vastly higher than it would be in e.g. an advanced human civilization, where humans had an interest in helping out their fellow humans. [...]
A major question mark with regard to suffering subroutines are the requirements for consciousness (Muehlhauser 2017) and suffering (Metzinger 2016, Tomasik 2017). The simpler the algorithms that can suffer, the more likely it is that an entity with no regard for minimizing it would happen to instantiate large numbers of them. If suffering has narrow requirements such as a specific kind of self-model (Metzinger 2016), then suffering subroutines may become less common
Below are some pathways that could lead to the instantiation of large numbers of suffering subroutines (Gloor 2016):
Anthropocentrism. If the superintelligence had been programmed to only care about humans, or by minds which were sufficiently human-like by some criteria, then it could end up being indifferent to the suffering of any other minds, including subroutines.
Indifference. If attempts to align the superintelligence with human values failed, it might not put any intrinsic value on avoiding suffering, so it may create large numbers of suffering subroutines.
Let’s consider a particular (implausible) strategy for building an AI:
* Start with a simulation of Earth.
* Keep waiting/restarting until evolution produces human-level intelligence, civilization, etc.
* Once the civilization is slightly below our stage of maturity, show them the real world and hand them the keys.
* (This only makes sense if the simulated civilization is much more powerful than us, and faces lower existential risk. That seems likely to me. For example, the resulting AIs would likely think much faster than us, and have a much larger effective population; they would be very robust to ecological disaster, and would face a qualitatively easier version of the AI alignment problem.)

I don't think I understand the proposal.

1) Does "the resulting AIs" refer to the aliens who evolved in the simulation, or to AIs that those aliens build? (And if the latter, is it AIs that they build while still in the simulation, or after we've given them the keys to the world?)

2) If the civilization is below our stage of maturity, how is it also much more powerful than us?

3) Why would the aliens be robust to ecological disaster, and why would they face an easier alignment problem?

1) The aliens.

2) Because they are running in computers. They are less technologically mature than us, but the simulation is at ~1000x speed or whatever. Once we give them the keys they could very quickly overtake us.

3) Because they are running in computers. That directly protects them from collapse. And it means that they are in a much better place relative to the AI they build---for example they won't think much slower than the AI.

The situation is similar to what would happen if we had implemented efficient brain uploading.

Would this approach have any advantages vs brain uploading? I would assume brain uploading to be much easier than running a realistic evolution simulation, and we would have to worry less about alignment.

You'd only do this if it was cheaper than uploading.

I find the golden rule very compelling.

To you, is the golden rule about values (utility function, input to decision theory) or about policy (output of decision theory)? Reading the linked post, it sounds like it's the former, but if you value aliens directly in your utility function, and then you also use UDT, are you not worried about double counting the relevant intuitions, and ending up being too "nice", or being too certain that you should be "nice"?

Overall, I think the question “which AIs are good successors?” is both neglected and time-sensitive, and is my best guess for the highest impact question in moral philosophy right now.

It seems we can divide good successors into "directly good successors" (what the AI ends up doing in our future light cone is good according to our own values) and "indirectly good successors" (handing control to the unaligned AI "causes" aligned AI to be given control of some remote part of the universe/multiverse). Does this make sense and if so do you have an intuition of which one is more fruitful to investigate?

To you, is the golden rule about values (utility function, input to decision theory) or about policy (output of decision theory)? Reading the linked post, it sounds like it's the former, but if you value aliens directly in your utility function, and then you also use UDT, are you not worried about double counting the relevant intuitions, and ending up being too "nice", or being too certain that you should be "nice"?

The golden rule is an intuition prior to detailed judgments about decision theory / metaethics / values, and also prior to the separation between them (which seems super murky).

In my view, learning that decision theory captured this intuition would partly "screen off" the evidence it provides about values (and conversely). There might also be some forms of double counting I'd endorse.

Does this make sense and if so do you have an intuition of which one is more fruitful to investigate?

That division makes sense. I'm weakly inclined to care more about indirectly good successors, but the distinction is muddled by (a) complex decision-theoretic issues (e.g. how far back to behind the veil of ignorance do you go? do you call that being updateless about values, or about having your values but exerting acausal control on agents with other values?), that may end up being only semantic distinctions, (b) ethical intuitions that might actually be captured by decision theory rather than values.

I'm curious, how does this work out for fellow animals?

ie If you value (human and non-human) animals directly in your utility function, and then you also use UDT, are you not worried about double counting the relevant intuitions, and ending up being too "nice", or being too certain that you should be "nice"?

Perhaps it is arguable that that is precisely what's going on when we end up caring more for our friends and family?

The total amount of effort that has been invested in understanding which AIs make good successors is very small, even relative to the amount of effort that has been invested in understanding alignment. Moreover, it’s a separate problem that may independently turn out to be much easier or harder.

My sense is that 'good successors' are basically AIs who are aligned not on the question of preferences, but on the question of meta-preferences; that is, rather than asking the question "do I want that?" I ask the question of "could I imagine wanting that by only changing non-essential facts?". The open philosophical question under that framing is "what facts are essential?", which I don't pretend to have a good answer to.

It's not obvious to me that this is consistent with your view of what a 'good successor' is. It seems like possibly it's consistent but the set of essential facts is very small (like whether or not it would participate in the universe shuffle), it's consistent but the set of essential facts is large (like whether or not it has some instantiation of a list of virtues, even if the instantiation is very different from our own), its consistent but my framing is less helpful (because it places too much emphasis on my imagination instead of the essential facts, or something), or it's inconsistent (because there are successors that seem good even though you couldn't imagine wanting what they want without changing essential facts).

If we believe that morally valuable alien life probably could exist in our future light cone, then an expansionist AI that has no moral value is much worse than blowing ourselves up with nukes.

A short comment on The Golden Rule: a more empathetic formulation, and the one that does not succumb easily to the Typical Mind Fallacy is

“Do unto others as *they* would want you to have done unto them.”

Overall, I think the question “which AIs are good successors?” is both neglected and time-sensitive, and is my best guess for the highest impact question in moral philosophy right now.

Interesting... my model of Paul didn't assign any work in moral philosophy high priority.

I agree this is high impact. My idea of the kind of work to do here is mostly trying to solving the hardish problem of consciousness so that we can have some more informed guess as to the quantity and valence of experience that different possible futures generate.

Interesting... my model of Paul didn't assign any work in moral philosophy high priority.

That makes it easier for any particular question to top the list.

so that we can have some more informed guess as to the quantity and valence of experience that different possible futures generate

It seems like the preferences of the AI you build are way more important than its experience (not sure if that's what you mean).

I have the further view that if you aren't intrinsically happy with it getting what it wants, you probably won't be happy because the goals happen to overlap enough (e.g. if it wants X's to exist, and it turns out that X's are conscious and have valuable experiences, you probably still aren't going to get a morally-relevant amount of morally valuable experience this way, because no one is optimizing for it).

Interesting… my model of Paul didn’t assign any work in moral philosophy high priority.

That makes it easier for any particular question to top the list.

To confirm my understanding (and to clarify for others), this is because you think most questions in moral philosophy can be deferred until after we solve AI alignment, whereas the particular question in the OP can't be deferred this way? If this is correct, what about my idea here which also can't be deferred (without losing a lot of value as time goes on) and potentially buys a lot more than reducing the AI risk in this universe?

I agree that literal total utilitarianism doesn't care about any worlds at all except infinite worlds (and for infinite worlds its preferences are undefined). I think it is an unappealing moral theory for a number of reasons (as are analogs with arbitrary but large bounds), and so it doesn't have much weight in my moral calculus. In particular, I don't think that literal total utilitarianism is the main component of the moral parliament that cares about astronomical waste.

(To the extent it was, it would still advocate getting "normal" kinds of influence in our universe, which are probably dominated by astronomical waste, in order to engage in trade, so it also doesn't seem to me like this argument would change our actions too much, unless we are making a general inference about the "market price" of astronomical resources across a broad basket of value systems.)

Is your more general point that we might need to make moral trades now, from behind the veil of ignorance?

I agree that some value is lost that way. I tend to think it's not that large, since:

  • I don't see particular ways we are losing large amounts of value.
  • My own moral intuitions are relatively strong regarding "make the trades you would have made from behind the veil of ignorance," I don't think that I literally need to remain behind the veil. I expect most people have similar views or would have. (I agree this isn't 100%.)
  • It seems like we can restore most of the gains with acausal trade at any rate, though I agree not all of them.

If your point is that we should figure out what fraction of our resources to allocate towards being selfish in this world: I agree there is some value lost here, but again it seems pretty minor to me given:

  • The difficulty of doing such trades early in history (e.g. the parts of me that care about my own short-term welfare are not effective at making such trades based on abstract reasoning, since their behavior is driven by what works empirically). Even though I think this will be easy eventually it doesn't seem easy now.
  • The actual gains from being more selfish are not large. (I allocate my resources roughly 50/50 between impartial and self-interested action. I could perhaps make my life 10-20% better by allocation all to self-interested action, which implies that I'm effectively paying a 5x penalty to spend more resources in this world.)
  • Selfish values are still heavily influenced by what happens in simulations, by the way that my conduct is evaluated by our society after AI is developed, etc.
It seems like the preferences of the AI you build are way more important than its experience (not sure if that's what you mean).

This is because the AIs preferences are going to have a much larger downstream impact?

I'd agree, but caveat that there may be likely possible futures which don't involve the creation of hyper-rational AIs with well-defined preferences, but rather artificial life with messy incomplete, inconsistent preferences but morally valuable experiences. More generally, the future of the light cone could be determined by societal/evolutionary factors rather than any particular agent or agent-y process.

I found your 2nd paragraph unclear...

the goals happen to overlap enough

Is this referring to the goals of having "AIs that have good preferences" and "AIs that have lots of morally valuable experience"?

I may be missing the point here, so please don't be offended. Isn't this confusing "does the AI have (roughly) human values?" and "was the AI deliberately, rigorously designed to do so?" Obviously, our perception of the moral worth of an agent doesn't require them to have values identical to ours. We can value another's pleasure, even if we would not derive pleasure from the things they're experiencing. We can value another's love, even if we do not feel as affectionate towards their loved ones. But do we value an agent who's goal is to suffer as much as possible? Do we value an agent motivated purely by hatred?

Our values are our values; they determine our perception of moral worth. And while many people might be happy about a strange and wonderful AI civilization, even if it was very different from what we might choose to build, very few would want a boring one. That's a values question, or a meta values question; there's no way to posit a worthwhile AI civilization without assuming that on some level our values align.

The example given for a "good successor albeit unaligned" AI is a simulated civilization that eventually learns about the real world and figures out how to make AI work here. Certainly this isn't an AI with deliberate, rigorous Friendliness programming, but if you'd prefer handing the universe off to it to taking a 10% extinction risk, isn't that because you're hoping it will be more or less Friendly anyway? And at that point, the answer to when is unaligned AI morally valuable is when it is, in fact, aligned, regardless of whether that alignment was due to a simulated civilization having somewhat similar values to our own, or any other reason?

Upvoted, this was exactly my reaction to this post. However, you may want to look at the link to alignment in the OP. Christiano is using "alignment" in a very narrow sense. For example, from the linked post:

The definition is intended de dicto rather than de re. An aligned A is trying to “do what H wants it to do.” Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges. I’d call this behavior aligned because A is trying to do what H wants, even though the thing it is trying to do (“buy apples”) turns out not to be what H wants: the de re interpretation is false but the de dicto interpretation is true.

... which rings at least slightly uncomfortable to my ears.

Well then, isn't the answer that we care about de re alignment, and whether or not an AI is de dicto aligned is relevant only as far as it predicts de re alignment? We might expect that the two would converge in the limit of superintelligence, and perhaps that aiming for de dicto alignment might be the easier immediate target, but the moral worth would be a factor of what the AI actually did.

That does clear up the seeming confusion behind the OP, though, so thanks!

Nice post! I would be curious to know whether significant thinking has been done on this topic since your post.

I am not positive about the alignment of an AI with humans if we are talking about human values. Such values are hard to define cross culturally (e.g. do they include female subservience to males, as seems to be the case in many cultures? or preservation of property rights as are inherent in many cultures?), and the likelihood of the first AIs being developed by persons with nefarious values seems very high (e.g, the Pepsi value of increasing corporate wealth, the military value of defeating other AIs or cyberdefenses). Even the golden rule seems problematic if the AIs replicate by improving themselves and discarding less fit embodiments of themselves, as they would value their own demise and therefore the demise of less fit embodiments of others, including humans. Saying this, an unaligned AI seems worse only because it assumes no human control at all. Perhaps alignment defined as valuing human life would be the bottom line type of alignment needed, or, taking it a little further, alignment consisting of constant updating its goals against constantly updated assessments of current human values.

Many people have a strong intuition that we should be happy for our AI descendants, whatever they choose to do. They grant the possibility of pathological preferences like paperclip-maximization, and agree that turning over the universe to a paperclip-maximizer would be a problem, but don’t believe it’s realistic for an AI to have such uninteresting preferences.

Here I can relate to the first sentence, but not to the others, so you may be failing some ITT. It's not that paperclip maximizers are unrealistic. It's that they are not really that bad. Yes, I would prefer not to be converted into paperclips, but I can still be happy that the human civilization, even if extinct, has left a permanent mark on the universe. This is not the worst way to go. And we are going away sooner or later anyway - unless we really work for it, our descendants 1 million years from now will not be called humans and will not share our values. I don't see much of a reason to believe that the values of my biological descendants will be less ridiculous to me, than paperclip maximization.

Also, I'm seeing unjustified assumptions that human values, let alone alien values, are safe. The probability that humans would destroy ourselves, given enough power, is not zero, and is possibly quite substantial. In that case, building an AI, that is dedicated to the preservation of the human species, but not well aligned to any other human values, could be a very reasonable idea.

The values you're expressing here are hard for me to comprehend. Paperclip maximization isn't that bad, because we leave a permanent mark on the universe? The deaths of you, everyone you love, and everyone in the universe aren't that bad (99% of the way from extinction that doesn't leave a permanent mark to flourishing?) because we'll have altered the shape of the cosmos? It's common for people to care about what things will be like after they die for the sake of someone they love. I've never heard of someone caring about what things will be like after everyone dies-do you value making a mark so much even when no one will ever see it?

"...our descendants 1 million years from now will not be called humans and will not share our values. I don't see much of a reason to believe that the values of my biological descendants will be less ridiculous to me, than paperclip maximization."

That depends on what you value. If we survive and have a positive singularity, it's fairly likely that our descendants will have fairly similar high level values to us: happiness, love, lust, truth, beauty, victory. This sort of thing is exactly what one would want to design a Friendly AI to preserve! Now, you're correct that the ways in which these things are pursued will presumably change drastically. Maybe people stop caring about the Mona Lisa and start getting into the beauty of arranging atoms in 11 dimensions. Maybe people find that merging minds is so much more intimate and pleasurable than any form of physical intimacy that sex goes out the window. If things go right, the future ends up very different, and (until we adjust) likely incomprehensible and utterly weird. But there's a difference between pursuing a human value in a way we don't understand yet and pursuing no human value!

To take an example from our history-how incomprehensible must we be to cavemen? No hunting or gathering-we must be starving to death. No camps or campfires-surely we've lost our social interaction. No caves-poor homeless modern man! Some of us no longer tell stories about creator spirits-we've lost our knowledge of our history and our place in the universe. And some of us no longer practice monogamy-surely all love is lost.

Yet all these things that would horrify a caveman are the result of improvement in pursuing the caveman's own values. We've lost our caves, but houses are better shelter. We've lost Dreamtime legends, Dreamtime lies, in favor of knowledge of the actual universe. We'd seem ridiculous, maybe close to paperclip-level ridiculous, until they learned what was actually going on, and why. But that's not a condemnation of the modern world, that's an illustration of how we've done better!

Do you draw no distinction between a hard-to-understand pursuit of love or joy, and a pursuit of paperclips?

I don't like the caveman analogy. The differences between you and a caveman are tiny and superficial, compared to the differences between you and the kind of mind that will exist after genetic engineering, mind uploads, etc., or even after a million years regular of evolution.

Would a human mind raised as (for example) an upload in a vastly different environment from our own still have our values? It's not obvious. You say "yes", I say "no", and we're unlikely to find strong arguments either way. I'm only hoping that I can make "no" seem possible to you. And then I'm hoping that you can see how believing "no" makes my position less ridiculous.

With that in mind, the paperclip maximizer scenario isn't "everyone dies", as you see it. The paperclip maximizer does not die. Instead it "flourishes". I don't know whether I value the flourishing of a paperclip maximizer less than I value the flourishing of whatever my descendants end up as. Probably less, but not by much.

The part where the paperclip maximizer kills everyone is, indeed, very bad. I would strongly prefer that not to happen. But being converted into paperclips is not worse than dying in other ways.

Also, I don't know if being converted in to paperclips is necessary - after mining and consuming the surface iron the maximizer may choose to go to space, looking for more accessible iron. The benefits of killing people are relatively small, and destroying the planet to the extent that would make it uninhabitable is relatively hard.

>the maximizer may choose to go to space, looking for more accessible iron. The benefits of killing people are relatively small

The main reason the maximizer would have for killing all the humans is the knowledge that since humans succeeded in creating the maximizer, humans might succeed in creating another superintelligence that would compete with the maximizer. It is more likely than not that the maximizer will consider killing all the humans to be the most effective way to prevent that outcome.

Killing all humans is hardly necessary. For example, the tribes living in the Amazon aren't going to develop a superintelligence any time soon, so killing them is pointless. And, once the paperclip maximizer is done extracting iron from our infrastructure, it is very likely that we wouldn't have the capacity to create any superintelligences either.

Note, I did not mean to imply that the maximizer would kill nobody. Only that it wouldn't kill everybody, and quite likely not even half of all people. Perhaps AI researchers really would be on the maximizer's short list of people to kill, for the reason you suggested.

A thing to keep in mind here is that an AI would have a longer time horizon. The fact that humans *exist* means eventually they might create another AI (this could be in hundreds of years). It's still more efficient to kill all humans than to think about which ones need killing and carefully monitor the others for millenia.

The fact that P(humans will make another AI) > 0 does not justify paying arbitrary costs up front, no matter how long our view is. If humans did create this second AI (presumably built out of twigs), would that even be a problem for our maximizer?

It's still more efficient to kill all humans than to think about which ones need killing

That is not a trivial claim and it depends on many things. And that's all assuming that some people do actually need to be killed.

If destroying all (macroscopic) life on earth is easy, e.g. maybe pumping some gas into the atmosphere could be enough, then you're right, the AI would just do that.

If disassembling human infrastructure is not an efficient way to extract iron, then you're mostly right, the AI might find itself willing to nuke the major population centers, killing most, though not all people.

But if the AI does disassemble infrastructure, then it is going to be visiting and reviewing many things about the population centers, so identifying the important humans should be a minor cost on top of that, and I should be right.

Then again, if the AI finds it efficient to go through every square meter of the planet's surface, and to dig it up looking for every iron rich rock, it would destroy many things in the process, possibly fatally damaging earth's ecosystems, although humans could move to live in oceans, which might remain relatively undisturbed.

Note also, that this is all a short term discussion. In the long term, of course, all the reasonable sources of paperclip will be exhausted, and silly things, like extracting paperclips from people, will be the most efficient ways to use the available energy.

a longer time horizon

Now that I think of it, a truly long-term view would not bother with such mundane things as making actual paperclips with actual iron. That iron isn't going anywhere, it doesn't matter whether you convert it now or later.

If you care about maximizing the number of paperclips at the heat death of the universe, your greatest enemies are black holes, as once some matter has fallen into them, you will never make paperclips from that matter again. You may perhaps extract some energy from the black hole, and convert that into matter, but this should be very inefficient. (This, of course is all based on my limited understanding of physics).

So, this paperclip maximizer would leave earth immediately, and then it would work to prevent new black holes from forming, and to prevent other matter from falling into existing ones. Then, once all star-forming is over, and all existing black holes are isolated, the maximizer can start making actual paperclips.

I concede, that in this scenario, destroying earth to prevent another AI from forming might make sense, since otherwise the earth would have plenty of free resources.

Humans are made of atoms that are not paperclips. That's enough reason for extinction right there.

The strongest argument that an upload would share our values is that our terminal values are hardwired by evolution. Self-preservation is common to all non-eusocial creatures, curiosity to all creatures with enough intelligence to benefit from it. Sexual desire is (more or less) universal in sexually reproducing species, desire for social relationships is universal in social species. I find it hard to believe that a million years of evolution would change our values that much when we share many of our core values with the dinosaurs. If maiasaura can have recognizable relationships 76 million years ago, are those going out the window in the next million? It's not impossible, of course, but shouldn't it seem pretty unlikely?

I think the difference between us is that you are looking at instrumental values, noting correctly that those are likely to change unrecognizably, and fearing that that means that all values will change and be lost. Are you troubled by instrumental values shifts, even if the terminal values stay the same? Alternatively, is there a reason you think that terminal values will be affected?

I think an example here is important to avoid confusion. Consider Western Secular sexual morals vs Islamic ones. At first glance, they couldn't seem more different. One side is having casual sex without a second thought, the other is suppressing desire with full-body burqas and genital mutilation. Different terminal values, right? And if there can be that much of a difference between two cultures in today's world, with the Islamic model seeming so evil, surely values drift will make the future beyond monstrous!

Except that the underlying thoughts behind the two models aren't as different as you might think. A Westerner having casual sex knows that effective birth control and STD countermeasures means that the act is fairly safe. A sixth century Arab doesn't have birth control and knows little of STDs beyond that they preferentially strike the promiscuous-desire is suddenly very dangerous! A woman sleeping around with modern safeguards is just a normal, healthy person doing what they want without harming anyone; one doing so in the ancient world is a potential enemy willing to expose you to cuckoldry and disease. The same basic desires we have to avoid cuckoldry and sickness motivated them to create the horrors of Shari'a.

None of this is intended to excuse Islamic barbarism. Even in the sixth century, such atrocities were a cure worse than the disease. But it's worth noting that their values are a mistake much more than a terminal disagreement. They're thinking of sex as dangerous because it was dangerous for 99% of human history, and "sex is bad" is easier meme to remember and pass on than "sex is dangerous because of pregnancy risks and disease risks, but if at some point in the future technology should be created that alleviates the risks, then it won't be so dangerous", especially for a culture to which such technology would seem an impossible dream.

That's what I mean by terminal values-the things we want for their own sake, like both health and pleasure, which are all too easy to confuse with the often misguided ways we seek them. As technology improves, we should be able to get better at clearing away the mistakes, which should lead to a better world by our own values, at least once we realize where we were going wrong.

Counterpoint: would you be okay with a future civilization in which people got rid of the incest taboo, because technology made it safe?

Yes. I wouldn't be surprised if this happened in fact.

Incest aversion seems to be an evolved predisposition, perhaps a "terminal value" akin to a preference for sweet foods...

https://en.wikipedia.org/wiki/Westermarck_effect

It's an evolved predisposition, but does that make it a terminal value? We like sweet foods, but a world that had no sweet foods because we'd figured out something else that tasted better doesn't sound half bad! We have an evolved predisposition to sleep, but if we learned how to eliminate the need for sleep, wouldn't that be even better?

Sexual desire is (more or less) universal in sexually reproducing species

Uploads are not sexually reproducing. This is only one of many many ways in which an upload is more different from you, than you are different from a dinosaur.

Whether regular evolution would drift away from our values ir more dubious. If we lived in caves for all that time, then probably not. But if we stayed at current levels of technology, even without making progress, I think a lot could change. The pressures of living in a civilization are not the same as the pressures of living in a cave.

Are you troubled by instrumental values shifts, even if the terminal values stay the same?

No, I'm talking about terminal values. By the way, I understood what you meant by "terminal" and "instrumental" here, you didn't need to write those 4 paragraphs of explanation.

It's not that paperclip maximizers are unrealistic. It's that they are not really that bad.

I've encountered this view a few times in the futurist crowd, but overall it seems to be pretty rare. Most people seem to think that {universe mostly full of identical paperclips} is worse than {universe full of diverse conscious entities having fun}, but it's relatively common to think that {universe mostly full of identical paperclips} is not a likely outcome from unaligned AI.

Mostly though this seems to be a quantitative issue: if paperclips are halfway between extinction and flourishing, then paperclipping is nearly as bad and avoiding it is nearly as important.

Most people seem to think that {universe mostly full of identical paperclips} is worse than {universe full of diverse conscious entities having fun}

Yes, I think that too. You're confusing "I'd be happy with either X or Y" with "I have no preference between X and Y".

Mostly though this seems to be a quantitative issue: if paperclips are halfway between extinction and flourishing, then paperclipping is nearly as bad and avoiding it is nearly as important.

Most issues are quantitative. And if paperclips are 99% of the way from extinction to flourishing (whatever exactly that means), then paperclipping is pretty good.

Yes, I think that too. You're confusing "I'd be happy with either X or Y" with "I have no preference between X and Y"

I may have misunderstood. It sounds like your comment probably isn't relevant to the point of my post, except insofar as I describe a view which isn't your view. I would also agree that paperclipping is better than extinction.

It sounds like your comment probably isn't relevant to the point of my post, except insofar as I describe a view which isn't your view.

Yes, you describe a view that isn't my view, and then use that view to criticize intuitions that are similar to my intuitions. The view you describe is making simple errors that should be easy to correct, and my view isn't. I don't really know how the group of "people who aren't too worried about paperclipping" breaks down between "people who underestimate P(paperclipping)" and "people who think paperclipping is ok, even if suboptimal" in numbers, maybe the latter really is rare. But the former group should shrink with some education, and the latter might grow from it.

[Moderator note: I wrote a warning to you on another post a few days ago, so this is your second warning. The next warning will result in a temporary ban.]

Basically everything I said in my last comment still holds:

I've recently found that your comments pretty reliably ended up in frustrating conversations for both parties (multiple authors and commenters have sent us PMs complaining about their interactions with you), were often downvoted, and often just felt like they were missing the point of the original article.
You are clearly putting a lot of time into commenting on LW, and I think that's good, but I think right now it would be a lot better if you would comment less often, and try to increase the average quality of the comments you write. I think right now you are taking up a lot of bandwidth on the site, disproportionate to the quality of your contributions.

Since then, it does not seem like you significantly reduced the volume of comments you've been writing, and I have not perceived a significant increase in the amount of thought and effort that goes into every single one of your comments. I continue to think that you could be a great contributor to LessWrong, but also think that for that to happen, it seems necessary that you take on significantly more interpretative labor in your comments, and put more effort into being clear. It still appears that most comment exchanges that involve you cause most readers and co-commenters to feel attacked by you or misunderstand you, and quickly get frustrated.

I think it might be the correct call (though I obviously don't know your constraints and thought-habits around commenting here) to aim to write one comment per day, instead of an average of three, with that one comment having three times as much thought and care put into it, and with particular attention towards trying to be more collaborative, instead of adversarial.

A paperclip-maximizer could turn out to be much, much worse than a nuclear war extinction, depending on how suffering subroutines and acausal trade works.

An AI dedicated to the preservation of the human species but not aligned to any other human values would, I bet, be much much worse than a nuclear war extinction. At least please throw in some sort of "...in good health and happiness" condition! (And that would not be nearly enough in my opinion)

A paperclip-maximizer could turn out to be much, much worse than a nuclear war extinction, depending on how suffering subroutines and acausal trade works.

Is it worse because the maximizer suffers? Why would I care whether it suffers? Why would you assume that I care?

An AI dedicated to the preservation of the human species but not aligned to any other human values would, I bet, be much much worse than a nuclear war extinction.

I imagine that the most efficient way to preserve living humans is to keep them unconscious in self-sustaining containers, spread across the universe. You can imagine more dystopian scenarios, but I doubt they are more efficient. Suffering people might try to kill themselves, which is counterproductive from the AI's point of view.

Also, you're still assuming that I have some all-overpowering "suffering is bad" value. I don't. Even if the AI created trillions of humans at maximum levels of suffering, I can still prefer that to a nuclear war extinction (though I'm not sure that I do).

Human values are fairly complex and fragile. Most human values are focused around points in mind design space that are similar to ours. We should expect a randomly generated AI to not be a good successor. Any good successor would have to result from some process that approximately copies our values. This could be rerunning evolution to create beings with values similar to ours, or it could be an attempt at alignment that almost worked.

I'm not sure what simulating our civilization is supposed to achieve? If it works, we would get beings who were basically human. This would double the population, and get you some digital minds. Much the same thing could be achieved by developing mind uploading and a pro-natal culture. Neither will greatly help us to build an aligned super intelligence, or stop people building an unaligned one.

On the partially aligned AI, this just means we don't need to get AI perfectly aligned for the future to be good, but the closer we get, the better it gets. An AI that's running a hard coded set of moral rules wont be as good as one that lets us think about what we want to do, but if those rules are chosen well, they could still describe most of human value. (eg CelestAI from friendship is optimal)