Epistemic status

These are my cursory thoughts on this topic after having read about it over a few days and conversed with some other people. I still have high uncertainty and am raising questions that may address some uncertainties.

Content warning

Discussion of risks of astronomical suffering[1]

Why focus on s-risks to contemporary humans?

Most discussions of suffering risks from artificial superintelligence focus on future minds or non-human beings. These concerns are important. However, might an ASI also inflict severe suffering on humans who exist when it takes over or on simulations of their minds?

If this specific category of s-risks is significant, I think that talking about it may encourage more people to care about s-risks even if they do not believe in the moral foundations of longtermism. Rightly or wrongly, many people value the well-being of themselves and people they know in the present over that of other beings. If we can show that s-risks could affect contemporary humans, that can help build broader support for avoiding these risks. 

Some of the questions I raise here also apply to estimating s-risk probabilities more generally.

Summary

Placing a probability on s-risks to contemporary humans seems difficult due to limited understanding of AI's goals. However, there are some factors that may increase this risk beyond mere chance, including the path-dependent nature of AI development, instrumental use of sentient simulations, spiteful inner objectives, near-miss scenarios, and misuse of intent-aligned AI. However, the overall probability remains uncertain. 

Difficulty of assigning probability to human-preserving goals of misaligned ASI

Existing AI models often behave in unpredicted ways,[2] and if AI reaches human-level intelligence before we solve inner alignment, its inner objective will be uncontrollable. This unpredictability makes it hard to meaningfully estimate the probability that an unaligned ASI’s terminal goals will involve human minds (virtual or physical, suffering or happy)[3] rather than being a “paperclip maximizer.” [4] A few arguments suggest that the consequence of misalignment is much more likely to be “mere extinction” than s-risks affecting contemporary humans:

  • If we assume an ignorance prior over possible AI objectives, futures involving humans make up only a tiny fraction of possibilities. However, it's unclear exactly how small.[5]
  • Occam’s razor: A scenario in which misaligned AI chooses to emulate the brains of contemporary humans or maintain us in meat-space is much more complex than tiling the universe with paperclips. [6]
  • Outside view: Most people in the AI safety community seem to focus on extinction.[7] And although some people focus on s-risk, these discussions tend to focus on, e.g. sentient subroutines, non-human animals, or minds that may come to exist in the far future, rather than existing humans. To the extent that you defer to this community's concensus, this is evidence against s-risks to current humans being a major concern. 

However, even if the probability of s-risk is low compared to x-risk, it may still be worth worrying about, given the astronomically worse stakes.[8] While it is plausible that the risk is too low to worry about, even given these stakes;[9] it is also plausible that the risk is significant because of the uncertainty about what prior to assign and reasons to think that the probability is higher than it would seem a priori.

The goals of a misaligned AI may not be purely random. When current AI systems fail, they typically optimize for proxies that correlate with their training objectives rather than completely unrelated goals. This matters because most AI systems train extensively on human-related tasks and data. Even a misaligned AI might develop goals that involve humans or human models, but not necessarily in ways we want. However, it is unclear how likely this is to involve conscious humans rather than other ways of maximizing reward. 

Questions:

  • Assuming that an ASI develops a random terminal goal, what prior should one place on such a goal involving humans?
  • To what extent is it accurate to describe a misaligned AI’s inner objective as random?
  • What evidence do we have that can help us predict an unaligned AI’s goals? What does this evidence say about whether its goals would involve human suffering rather than extinction?

Instrumental simulations

Even an AI with a terminal goal of creating paperclips may instantiate suffering for instrumental reasons.[10] 

For example, Nick Bostrom suggests that an ASI may run conscious simulations of human minds in order to understand our psychology.[11] Sotala and Gloor (2017) describe other scenarios in which suffering simulations may come about.

It's unclear whether this suffering would involve contemporary humans, but some questions that would be relevant to this probability are below. 

Significance

This kind of suffering may be smaller in scope than other risks because the agent faces opportunity costs. For example, Bostrom suggests that conscious simulations would eventually be discarded “once their informational usefulness has been exhausted.” However, Gloor (2016) writes that “an AI with goals that are easy to fulfill – e.g. a ‘paperclip protector’ that only cares about protecting a single paperclip, or an AI that only cares about its own reward signal – would have much greater room pursuing instrumentally valuable computations.” 

The agent would also not necessarily maximize the level of suffering, but would only need to incur as much as is necessary to achieve its goals. 

However, instrumental suffering is perhaps more likely than other kinds of s-risk because it can happen across a wide range of terminal goals (not just those that involve sentient minds for terminal reasons), provided that it is correct that creating sentient simulations is a convergent instrumental goal. 

Questions

  • To what degree can an intelligent agent infer facts about reality based on limited information?[12] If it can infer a lot, it may not have to gather information in the physical world in ways that could harm existing humans.
  • Would instrumental simulations be likely to contain copies of the minds of particular existing beings, or would they be simulations of generic beings?[13]

Spite

Might an ASI be more likely than pure chance to develop a terminal goal of human suffering? 

Macé et al. (2023) suggest some reasons why spite may naturally arise in AI systems. For example, AI may learn a spiteful objective if humans demonstrate a similar objective in its training data, or a spiteful objective may be a convergent instrumental strategy.

Given that AIs are created by humans and trained based on human data, we shouldn’t assume a priori that the likelihood of ASI exhibiting anthropomorphic behavior is no more likely than other sections of the possibility space, even if the probability remains small. Since AI are trained to mimic humans, humans may serve as a rough reference class when predicting AI behavior, although it is hard to draw strong anthropomorphic parallels given the significant differences. 

Some have argued that a sentient AI could take revenge. However, an AI need not actually be sentient for it to develop such a goal, it only needs to mimic vengeful behavior patterns. If such objectives develop, AI may target contemporary humans.

Significance 

The scale and severity of suffering from an AI with a spiteful objective could be high compared to instrumental suffering, since the agent would have an intrinsic goal of causing suffering. It’s unclear how likely this is, since I haven’t seen much discussion of it.

Questions

  • How likely is it that an AI would develop a spiteful objective?
  • Is there evidence to show that the behavior of humans or sentient beings can serve as a rough reference class for AI behavior?

Alignment as narrowing possibility space

Even if the fraction of AI futures that involve humans in any form is small, alignment efforts could succeed in focusing the probability on this section. However, a large part of this fraction may involve high levels of suffering. Thus, alignment efforts may decrease x-risk at the expense of increasing s-risk.

Questions:

  • Will increases in s-risk as we get closer to full alignment be continuous, or will they increase in discrete levels with certain advancements (e.g. solving inner alignment or giving an AI a specification of the human value function)?
  • How likely is it that we might solve “parts of” alignment without solving other parts (e.g. solving inner alignment without having correctly specified human values, or vice versa)?
  • Do scenarios where ASI is aligned or “close to” aligned have higher probability of s-risk than scenarios where it is totally misaligned? If yes, this would reduce overall s-risk if you believe alignment to be unlikely.

Near-miss scenarios

"Near-miss" scenarios may lead to severe suffering if an error causes an AI to maximize the opposite of an accurately specified human value function[14] or if it incorrectly actualizes important parts of human values.[15] 

Regarding the latter scenario, one may argue that avoiding suffering is such a basic human value that an AI that is aligned with even a rough picture of human values would understand this. But learning what experiences constitute suffering may not be straightforward, and not all value systems consider suffering as categorically bad. This ambiguity may lead to a significant level of unnecessary suffering in some of the futures with “semi-aligned” ASI. 

Such suffering may also occur if humans intentionally teach an AI a goal but do not correctly understand the consequences.[16] 

Significance

Some of these scenarios maximize the level of suffering, whereas others produce it only incidentally. However, the extent and duration of suffering may be high. Because the suffering is part of the ASI’s terminal goal, it would continue creating it indefinitely in order to maximize its expected utility.[17] 

Misuse

Finally, even if humans successfully create an intent-aligned ASI, some people might misuse it in ways that intentionally or instrumentally create suffering.[18] 

Some risks of misuse could fall on contemporary humans, such as from a sadistic dictator, although others would fall on other beings. 

It may be more common for humans to create suffering as a byproduct, rather than increasing it for its own sake, but DiGiovanni (2023) notes:

Malevolent traits known as the Dark Tetrad—Machiavellianism, narcissism, psychopathy, and sadism—have been found to correlate with each other (Althaus and Baumann 2020; Paulhus 2014; Buckels et al. 2013; Moshagen et al. 2018). This suggests that individuals who want to increase suffering may be disproportionately effective at social manipulation and inclined to seek power. If such actors established stable rule, they would be able to cause the suffering they desire indefinitely into the future.

Another possibility is that of s-risks via retribution. While a preference for increasing suffering indiscriminately is rare among humans, people commonly have the intuition that those who violate fundamental moral or social norms deserve to suffer, beyond the degree necessary for rehabilitation or deterrence (Moore 1997). Retributive sentiments could be amplified by hostility to one’s “outgroup,” an aspect of human psychology that is deeply ingrained and may not be easily removed (Lickel 2006). To the extent that pro-retribution sentiments are apparent throughout history (Pinker 2012, Ch. 8), values in favor of causing suffering to transgressors might not be mere contingent “flukes.”

  1. ^

    The tone of this article has been edited slightly.

  2. ^
  3. ^

    Credit to Tariq Ali for this point.

  4. ^

    Bostrom, Nick (2014) Superintelligence: Paths, Dangers, Strategies, p. 150

  5. ^

    We can’t say “it could involve humans or it could not involve humans, so it might be something like 50:50”; this is an anthropocentric and arbitrary way to divide the probability.

  6. ^

    Bostrom (2014) wrote that “because a meaningless reductionistic goal is easier for humans to code and easier for an AI to learn, it is just the kind of goal that a programmer would install in his seed AI if his focus is on taking the quickest path to ‘getting the AI to work’” (p. 129). However, this was 10 years ago, and I’m not sure if this stands up in the context of modern alignment techniques.

  7. ^

    However, there may be reasons to doubt this consensus.

  8. ^

    Bostrom (2014) estimates 10^58 simulated centuries of human life could exist over the course of the far future (p. 123). See also Fenwick (2023).

  9. ^

    Considering that simulating sentient minds may be quite complex, if the Kolmogorov complexity of this scenario is > 200 bits (this seems like an underestimate), this gives a Solomonoff prior < 10^-60, which could be low enough to make the expected disvalue small. One may argue that an ASI would itself be a simulated mind, but there would still be additional complexity involved in figuring out how to emulate the minds of particular beings, know whether they are conscious or not, etc., depending on the specific scenario.

  10. ^

    Note that a small Solomonoff prior for sentient simulations would also be low for simulations created for instrumental reasons, but if there are strong reasons to believe these simulations would be a convergent instrumental goal, one could update this p. However, given that this reasoning is inevitably speculative, it can only update the prior slightly if one starts with a very low prior.

  11. ^

    Bostrom (2014), p. 153-4, See also Sotala and Gloor (2017) section 5.2

  12. ^

    See here.

  13. ^

    It seems to me that if the AI is not intentionally emulating specific humans, and if it can emulate a generic sentient mind without emulating specific humans, it would be unlikely to create particular human minds by coincidence. Taking the estimate of 10^60 subjective years from above (fn 6), it would seem that the AI could not simulate any particular mind for very long if it wanted to go through all possible minds, as the number of all possible minds is probably much larger than this, and may even be computationally intractable. (ChatGPT gives a rough estimate of 2^(10^11) possible minds, for what it’s worth.) 

  14. ^

    Daniel Kokotajlo has informally estimated the probability of this at 1/30,000 ± 1 order of magnitude (which seems concerningly high given the scale of harm!) for what it’s worth.

  15. ^

    See also Sotala and Gloor (2017) section 5.3

  16. ^

    See, e.g. Ansell (2023) and here.

  17. ^

    See, Bostrom (2014) p. 152: “[T]he AI, if reasonable, never assigns exactly zero probability to it having failed to achieve its goal; therefore the expected utility of continuing activity (e.g. by counting and recounting the paperclips) is greater than the expected utility of halting.”

  18. ^

    See, e.g. DiGiovanni (2023) section 3

New Answer
New Comment

2 Answers sorted by

Seth Herd

30

Your first point, that this is a route to getting people to care about ASI risk, is an excellent one that I haven't heard before. I don't think people need to imagine astronomical S-risk to be emotionally affected by less severe and more likely s-risk arguments.

I don't think we should adopt an ignorance prior over goals. Humans are going to try to assign goals to AGI. Those goals will very likely involve humans somehow.

The misuse risks seem much more important, both as real risks, and in their saliency to ordinary people. It is intuitively apparent that many historical monarchs have inflicted dreadful suffering on individuals they disliked, and inflicted mundane suffering on the majority of people under their power.

We might hope that the sheer ease of providing good lives to everyone might sway even a modestly sadistic sociopath to restrict their sadism to a few "enemies of the glorious emperor" while providing relatively utopian conditions to most humans. But we should not assume it, and the public will not.

Zvi's recent piece The Risk of Gradual Disempowerment from AI which crystallized for me why people fear concentration of power enabled by AI, and is an intuitive argument for how particularly vicious people might wind up running the world. My own If we solve alignment, do we die anyway?, which outlines a different misuse to fear: sheer villainy (or pragmatic self-interest) from anyone with an intent-aligned AGI capable of recursive self improvement.

So this is an interesting line of argument for convincing people AGI risk is worth worrying about now.

Thanks for your comment.

The misuse risks seem much more important, both as real risks, and in their saliency to ordinary people. 

I agree that it may be easier to persuade the general public about misuse risks and that these risks are likely to occur if we achieve intent alignment, but in terms of assessing the relative probability: "If we solve alignment" is a significant "if." I take it you view solving intent alignment as not all that unlikely? If so, why? Specifically, how do you expect we will figure out how to prevent deceptive alignment and goal... (read more)

avturchin

20

Content warning – the idea below may increase your subjective estimation of personal s-risks. 

If there is at least one aligned AI, other AIs may have an incentive to create s-risks for currently living humans – in order to blackmail the aligned AI. Thus, s-risk probabilities depend on the likelihood of a multipolar scenario.

Makes sense. What probability do you place on this? It would require solving alignment, a second AI being created before the first can create a singleton, and then the misaligned AI choosing this kind of blackmail over other possible tactics. If the blackmail involves sentient simulations (as is sometimes suggested, although not in your comment), it would seem that the misaligned AI would have to solve the hard problem of consciousness and be able to prove this to the other AI (not a valid blackmail if the simulations are not known to be sentient).

2avturchin
I think it becomes likely in a multipolar scenario with 10-100 Als.  One thing to take into account is that other AIs will consider such risk and keep their real preferences secret. This means that which AIs are aligned will be unknowable both for humans and for other AIs