Superintelligence 20: The value-loading problem

4 KatjaGrace 27 January 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the twentieth section in the reading guide: the value-loading problem

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “The value-loading problem” through “Motivational scaffolding” from Chapter 12


Summary

  1. Capability control is a short-term measure: at some point, we will want to select the motivations of AIs. (p185)
  2. The value loading problem: how do you cause an AI to pursue your goals? (p185)
  3. Some ways to instill values into an AI:
    1. Explicit representation: Hand-code desirable values (185-7)
    2. Evolutionary selection: Humans evolved to have values that are desirable to humans—maybe it wouldn't be too hard to artificially select digital agents with desirable values. (p187-8)
    3. Reinforcement learning: In general, a machine receives reward signal as it interacts with the environment, and tries to maximize the reward signal. Perhaps we could reward a reinforcement learner for aligning with our values, and it could learn them. (p188-9)
    4. Associative value accretion: Have the AI acquire values in the way that humans appear to—starting out with some machinery for synthesizing appropriate new values as we interact with our environments. (p189-190)
    5. Motivational scaffolding: start the machine off with some values, so that it can run and thus improve and learn about the world, then swap them out for the values you want once the machine has sophisticated enough concepts to understand your values. (191-192)
    6. To be continued...

Another view

Ernest Davis, on a 'serious flaw' in Superintelligence

The unwarranted belief that, though achieving intelligence is more or less easy, giving a computer an ethical point of view is really hard.

Bostrom writes about the problem of instilling ethics in computers in a language reminiscent of 1960’s era arguments against machine intelligence; how are you going to get something as complicated as intelligence, when all you can do is manipulate registers?

The definition [of moral terms] must bottom out in the AI’s programming language and ultimately in primitives such as machine operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.

In the following paragraph he goes on to argue from the complexity of computer vision that instilling ethics is almost hopelessly difficult, without, apparently, noticing that computer vision itself is a central AI problem, which he is assuming is going to be solved. He considers that the problems of instilling ethics into an AI system is “a research challenge worthy of some of the next generation’s best mathematical talent”.

It seems to me, on the contrary, that developing an understanding of ethics as contemporary humans understand it is actually one of the easier problems facing AI. Moreover, it would be a necessary part, both of aspects of human cognition, such as narrative understanding, and of characteristics that Bostrom attributes to the superintelligent AI. For instance, Bostrom refers to the AI’s “social manipulation superpowers”. But if an AI is to be a master manipulator, it will need a good understanding of what people consider moral; if it comes across as completely amoral, it will be at a very great disadvantage in manipulating people. There is actually some truth to the idea, central to The Lord of the Rings and Harry Potter, that in dealing with people, failing to understand their moral standards is a strategic gap. If the AI can understand human morality, it is hard to see what is the technical difficulty in getting it to follow that morality.

Let me suggest the following approach to giving the superintelligent AI an operationally useful definition of minimal standards of ethics that it should follow. You specify a collection of admirable people, now dead. (Dead, because otherwise Bostrom will predict that the AI will manipulate the preferences of the living people.) The AI, of course knows all about them because it has read all their biographies on the web. You then instruct the AI, “Don’t do anything that these people would have mostly seriously disapproved of.”

This has the following advantages:

  • It parallels one of the ways in which people gain a moral sense.
  • It is comparatively solidly grounded, and therefore unlikely to have an counterintuitive fixed point.
  • It is easily explained to people.

Of course, it is completely impossible until we have an AI with a very powerful understanding; but that is true of all Bostrom’s solutions as well. To be clear: I am not proposing that this criterion 3should be used as the ethical component of every day decisions; and I am not in the least claiming that this idea is any kind of contribution to the philosophy of ethics. The proposal is that this criterion would work well enough as a minimal standard of ethics; if the AI adheres to it, it will not exterminate us, enslave us, etc.

This may not seem adequate to Bostrom, because he is not content with human morality in its current state; he thinks it is important for the AI to use its superintelligence to find a more ultimate morality. That seems to me both unnecessary and very dangerous. It is unnecessary because, as long as the AI follows our morality, it will at least avoid getting horribly out of whack, ethically; it will not exterminate us or enslave us. It is dangerous because it is hard to be sure that it will not lead to consequences that we would reasonably object to. The superintelligence might rationally decide, like the King of Brobdingnag, that we humans are “the most pernicious race of little odious vermin that nature ever suffered to crawl upon the surface of the earth,” and that it would do well to exterminate us and replace us with some much more worthy species. However wise this decision, and however strongly dictated by the ultimate true theory of morality, I think we are entitled to object to it, and to do our best to prevent it. I feel safer in the hands of a superintelligence who is guided by 2014 morality, or for that matter by 1700 morality, than in the hands of one that decides to consider the question for itself.

Notes

1. At the start of the chapter, Bostrom says ‘while the agent is unintelligent, it might lack the capability to understand or even represent any humanly meaningful value. Yet if we delay the procedure until the agent is superintelligent, it may be able to resist our attempt to meddle with its motivation system.' Since presumably the AI only resists being given motivations once it is turned on and using some other motivations, you might wonder why we wouldn't just wait until we had built an AI smart enough to understand or represent human values, before we turned it on. I believe the thought here is that the AI will come to understand the world and have the concepts required to represent human values by interacting with the world for a time. So it is not so much that the AI will need to be turned on to become fundamentally smarter, but that it will need to be turned on to become more knowledgeable.

2. A discussion of Davis' response to Bostrom just started over at the Effective Altruism forum.

3. Stuart Russell thinks of value loading as an intrinsic part of AI research, in the same way that nuclear containment is an intrinsic part of modern nuclear fusion research.

4. Kaj Sotala has written about how to get an AI to learn concepts similar to those of humans, for the purpose of making safe AI which can reason about our concepts. If you had an oracle which understood human concepts, you could basically turn it into an AI which plans according to arbitrary goals you can specify in human language, because you can say 'which thing should I do to best forward [goal]?' (This is not necessarily particularly safe as it stands, but is a basic scheme for turning conceptual understanding and a motivation to answer questions into any motivation).

5. Inverse reinforcement learning and goal inference are approaches to having machines discover goals by observing actions—these could be useful instilling our own goals into machines (as has been observed before).

6. If you are interested in whether values are really so complex, Eliezer has written about it. Toby Ord responds critically to the general view around the LessWrong community that value is extremely likely to be complex, pointing out that this thesis is closely related to anti-realism—a relatively unpopular view among academic philosophers—and so that overall people shouldn't be that confident. Lots of debate ensues.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. How can we efficiently formally specify human values? This includes for instance how to efficiently collect data on human values and how to translate it into a precise specification (and at the meta-level, how to be confident that it is correct).
  2. Are there other plausible approaches to instil desirable values into a machine, beyond those listed in this chapter?
  3. Investigate further the feasibility of particular approaches suggested in this chapter.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about how an AI might learn about values. To prepare, read “Value learning” from Chapter 12. The discussion will go live at 6pm Pacific time next Monday 2 February. Sign up to be notified here.

Superintelligence 19: Post-transition formation of a singleton

7 KatjaGrace 20 January 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the nineteenth section in the reading guidepost-transition formation of a singleton. This corresponds to the last part of Chapter 11.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: : “Post-transition formation of a singleton?” from Chapter 11


Summary

  1. Even if the world remains multipolar through a transition to machine intelligence, a singleton might emerge later, for instance during a transition to a more extreme technology. (p176-7)
  2. If everything is faster after the first transition, a second transition may be more or less likely to produce a singleton. (p177)
  3. Emulations may give rise to 'superorganisms': clans of emulations who care wholly about their group. These would have an advantage because they could avoid agency problems, and make various uses of the ability to delete members. (p178-80) 
  4. Improvements in surveillance resulting from machine intelligence might allow better coordination, however machine intelligence will also make concealment easier, and it is unclear which force will be stronger. (p180-1)
  5. Machine minds may be able to make clearer precommitments than humans, changing the nature of bargaining somewhat. Maybe this would produce a singleton. (p183-4)

Another view

Many of the ideas around superorganisms come from Carl Shulman's paper, Whole Brain Emulation and the Evolution of Superorganisms. Robin Hanson critiques it:

...It seems to me that Shulman actually offers two somewhat different arguments, 1) an abstract argument that future evolution generically leads to superorganisms, because their costs are generally less than their benefits, and 2) a more concrete argument, that emulations in particular have especially low costs and high benefits...

...On the general abstract argument, we see a common pattern in both the evolution of species and human organizations — while winning systems often enforce substantial value sharing and loyalty on small scales, they achieve much less on larger scales. Values tend to be more integrated in a single organism’s brain, relative to larger families or species, and in a team or firm, relative to a nation or world. Value coordination seems hard, especially on larger scales.

This is not especially puzzling theoretically. While there can be huge gains to coordination, especially in war, it is far less obvious just how much one needs value sharing to gain action coordination. There are many other factors that influence coordination, after all; even perfect value matching is consistent with quite poor coordination. It is also far from obvious that values in generic large minds can easily be separated from other large mind parts. When the parts of large systems evolve independently, to adapt to differing local circumstances, their values may also evolve independently. Detecting and eliminating value divergences might in general be quite expensive.

In general, it is not at all obvious that the benefits of more value sharing are worth these costs. And even if more value sharing is worth the costs, that would only imply that value-sharing entities should be a bit larger than they are now, not that they should shift to a world-encompassing extreme.

On Shulman’s more concrete argument, his suggested single-version approach to em value sharing, wherein a single central em only allows (perhaps vast numbers of) brief copies, can suffer from greatly reduced innovation. When em copies are assigned to and adapt to different tasks, there may be no easy way to merge their minds into a single common mind containing all their adaptations. The single em copy that is best at doing an average of tasks, may be much worse at each task than the best em for that task.

Shulman’s other concrete suggestion for sharing em values is “psychological testing, staged situations, and direct observation of their emulation software to form clear pictures of their loyalties.” But genetic and cultural evolution has long tried to make human minds fit well within strongly loyal teams, a task to which we seem well adapted. This suggests that moving our minds closer to a “borg” team ideal would cost us somewhere else, such as in our mental agility.

On the concrete coordination gains that Shulman sees from superorganism ems, most of these gains seem cheaply achievable via simple long-standard human coordination mechanisms: property rights, contracts, and trade. Individual farmers have long faced starvation if they could not extract enough food from their property, and farmers were often out-competed by others who used resources more efficiently.

With ems there is the added advantage that em copies can agree to the “terms” of their life deals before they are created. An em would agree that it starts life with certain resources, and that life will end when it can no longer pay to live. Yes there would be some selection for humans and ems who peacefully accept such deals, but probably much less than needed to get loyal devotion to and shared values with a superorganism.

Yes, with high value sharing ems might be less tempted to steal from other copies of themselves to survive. But this hardly implies that such ems no longer need property rights enforced. They’d need property rights to prevent theft by copies of other ems, including being enslaved by them. Once a property rights system exists, the additional cost of applying it within a set of em copies seems small relative to the likely costs of strong value sharing.

Shulman seems to argue both that superorganisms are a natural endpoint of evolution, and that ems are especially supportive of superorganisms. But at most he has shown that ems organizations may be at a somewhat larger scale, not that they would reach civilization-encompassing scales. In general, creatures who share values can indeed coordinate better, but perhaps not by much, and it can be costly to achieve and maintain shared values. I see no coordinate-by-values free lunch...

Notes

1. The natural endpoint

Bostrom says that a singleton is natural conclusion of long-term trend toward larger scales of political integration (p176). It seems helpful here to be more precise about what we mean by singleton. Something like a world government does seem to be a natural conclusion to long term trends. However this seems different to the kind of singleton I took Bostrom to previously be talking about. A world government would by default only make a certain class of decisions, for instance about global level policies. There has been a long term trend for the largest political units to become larger, however there have always been smaller units as well, making different classes of decisions, down to the individual. I'm not sure how to measure the mass of decisions made by different parties, but it seems like the individuals may be making more decisions more freely than ever, and the large political units have less ability than they once did to act against the will of the population. So the long term trend doesn't seem to point to an overpowering ruler of everything.

2. How value-aligned would emulated copies of the same person be?

Bostrom doesn't say exactly how 'emulations that were wholly altruistic toward their copy-siblings' would emerge. It seems to be some combination of natural 'altruism' toward oneself and selection for people who react to copies of themselves with extreme altruism (confirmed by a longer interesting discussion in Shulman's paper). How easily one might select for such people depends on how humans generally react to being copied. In particular, whether they treat a copy like part of themselves, or merely like a very similar acquaintance.

The answer to this doesn't seem obvious. Copies seem likely to agree strongly on questions of global values, such as whether the world should be more capitalistic, or whether it is admirable to work in technology. However I expect many—perhaps most—failures of coordination come from differences in selfish values—e.g. I want me to have money, and you want you to have money. And if you copy a person, it seems fairly likely to me the copies will both still want the money themselves, more or less.

From other examples of similar people—identical twins, family, people and their future selves—it seems people are unusually altruistic to similar people, but still very far from 'wholly altruistic'. Emulation siblings would be much more similar than identical twins, but who knows how far that would move their altruism?

Shulman points out that many people hold views about personal identity that would imply that copies share identity to some extent. The translation between philosophical views and actual motivations is not always complete however.

3. Contemporary family clans

Family-run firms are a place to get some information about the trade-off between reducing agency problems and having access to a wide range of potential employees. Given a brief perusal of the internet, it seems to be ambiguous whether they do better. One could try to separate out the factors that help them do better or worse.

4. How big a problem is disloyalty?

I wondered how big a problem insider disloyalty really was for companies and other organizations. Would it really be worth all this loyalty testing? I can't find much about it quickly, but 59% of respondents to a survey apparently said they had some kind of problems with insiders. The same report suggests that a bunch of costly initiatives such as intensive psychological testing are currently on the table to address the problem. Also apparently it's enough of a problem for someone to be trying to solve it with mind-reading, though that probably doesn't say much.

5. AI already contributing to the surveillance-secrecy arms race

Artificial intelligence will help with surveillance sooner and more broadly than in the observation of people's motives. e.g. here and here.

6. SMBC is also pondering these topics this week



In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. What are the present and historical barriers to coordination, between people and organizations? How much have these been lowered so far? How much difference has it made to the scale of organizations, and to productivity? How much further should we expect these barriers to be lessened as a result of machine intelligence?
  2. Investigate the implications of machine intelligence for surveillance and secrecy in more depth.
  3. Are multipolar scenarios safer than singleton scenarios? Muehlhauser suggests directions.
  4. Explore ideas for safety in a singleton scenario via temporarily multipolar AI. e.g. uploading FAI researchers (See Salamon & Shulman, “Whole Brain Emulation, as a platform for creating safe AGI.”)
  5. Which kinds of multipolar scenarios would be more likely to resolve into a singleton, and how quickly?
  6. Can we get whole brain emulation without producing neuromorphic AGI slightly earlier or shortly afterward? See section 3.2 of Eckersley & Sandberg (2013).
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the 'value loading problem'. To prepare, read “The value-loading problem” through “Motivational scaffolding” from Chapter 12The discussion will go live at 6pm Pacific time next Monday 26 January. Sign up to be notified here.

Superintelligence 18: Life in an algorithmic economy

4 KatjaGrace 13 January 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the eighteenth section in the reading guideLife in an algorithmic economy. This corresponds to the middle of Chapter 11.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Life in an algorithmic economy” from Chapter 11


Summary

  1. In a multipolar scenario, biological humans might lead poor and meager lives. (p166-7)
  2. The AIs might be worthy of moral consideration, and if so their wellbeing might be more important than that of the relatively few humans. (p167)
  3. AI minds might be much like slaves, even if they are not literally. They may be selected for liking this. (p167)
  4. Because brain emulations would be very cheap to copy, it will often be convenient to make a copy and then later turn it off (in a sense killing a person). (p168)
  5. There are various other reasons that very short lives might be optimal for some applications. (p168-9)
  6. It isn't obvious whether brain emulations would be happy working all of the time. Some relevant considerations are current human emotions in general and regarding work, probable selection for pro-work individuals, evolutionary adaptiveness of happiness in the past and future -- e.g. does happiness help you work harder?--and absence of present sources of unhappiness such as injury. (p169-171)
  7. In the long run, artificial minds may not even be conscious, or have valuable experiences, if these are not the most effective ways for them to earn wages. If such minds replace humans, Earth might have an advanced civilization with nobody there to benefit. (p172-3)
  8. In the long run, artificial minds may outsource many parts of their thinking, thus becoming decreasingly differentiated as individuals. (p172)
  9. Evolution does not imply positive progress. Even those good things that evolved in the past may not withstand evolutionary selection in a new circumstance. (p174-6)

Another view

Robin Hanson on others' hasty distaste for a future of emulations: 

Parents sometimes disown their children, on the grounds that those children have betrayed key parental values. And if parents have the sort of values that kids could deeply betray, then it does make sense for parents to watch out for such betrayal, ready to go to extremes like disowning in response.

But surely parents who feel inclined to disown their kids should be encouraged to study their kids carefully before making such a choice. For example, parents considering whether to disown their child for refusing to fight a war for their nation, or for working for a cigarette manufacturer, should wonder to what extend national patriotism or anti-smoking really are core values, as opposed to being mere revisable opinions they collected at one point in support of other more-core values. Such parents would be wise to study the lives and opinions of their children in some detail before choosing to disown them.

I’d like people to think similarly about my attempts to analyze likely futures. The lives of our descendants in the next great era after this our industry era may be as different from ours’ as ours’ are from farmers’, or farmers’ are from foragers’. When they have lived as neighbors, foragers have often strongly criticized farmer culture, as farmers have often strongly criticized industry culture. Surely many have been tempted to disown any descendants who adopted such despised new ways. And while such disowning might hold them true to core values, if asked we would advise them to consider the lives and views of such descendants carefully, in some detail, before choosing to disown.

Similarly, many who live industry era lives and share industry era values, may be disturbed to see forecasts of descendants with life styles that appear to reject many values they hold dear. Such people may be tempted to reject such outcomes, and to fight to prevent them, perhaps preferring a continuation of our industry era to the arrival of such a very different era, even if that era would contain far more creatures who consider their lives worth living, and be far better able to prevent the extinction of Earth civilization. And such people may be correct that such a rejection and battle holds them true to their core values.

But I advise such people to first try hard to see this new era in some detail from the point of view of its typical residents. See what they enjoy and what fills them with pride, and listen to their criticisms of your era and values. I hope that my future analysis can assist such soul-searching examination. If after studying such detail, you still feel compelled to disown your likely descendants, I cannot confidently say you are wrong. My job, first and foremost, is to help you see them clearly.

More on whose lives are worth living here and here.

Notes

1. Robin Hanson is probably the foremost researcher on what the finer details of an economy of emulated human minds would be like. For instance, which company employees would run how fast, how big cities would be, whether people would hang out with their copies. See a TEDx talk, and writings hereherehere and here (some overlap - sorry). He is also writing a book on the subject, which you can read early if you ask him. 

2. Bostrom says,

Life for biological humans in a post-transition Malthusian state need not resemble any of the historical states of man...the majority of humans in this scenario might be idle rentiers who eke out a marginal living on their savings. They would be very poor, yet derive what little income they have from savings or state subsidies. They would live in a world with  extremely advanced technology, including not only superintelligent machines but also anti-aging medicine, virtual reality, and various enhancement technologies and pleasure drugs: yet these might be generally unaffordable....(p166)

It's true this might happen, but it doesn't seem like an especially likely scenario to me. As Bostrom has pointed out in various places earlier, biological humans would do quite well if they have some investments in capital, do not have too much of their property stolen or artfully manouvered away from them, and do not undergo too massive population growth themselves. These risks don't seem so large to me.

3. Paul Christiano has an interesting article on capital accumulation in a world of machine intelligence.

4. In discussing worlds of brain emulations, we often talk about selecting people for having various characteristics - for instance, being extremely productive, hard-working, not minding frequent 'death', being willing to work for free and donate any proceeds to their employer (p167-8). However there are only so many humans to select from, so we can't necessarily select for all the characteristics we might want. Bostrom also talks of using other motivation selection methods, and modifying code, but it is interesting to ask how far you could get using only selection. It is not obvious to what extent one could meaningfully modify brain emulation code initially. 

I'd guess less than one in a thousand people would be willing to donate everything to their employer, given a random employer. This means to get this characteristic, you would have to lose a factor of 1000 on selecting for other traits. All together you have about 33 bits of selection power in the present world (that is, 7 billion is about 2^33; you can divide the world in half about 33 times before you get to a single person). Lets suppose you use 5 bits in getting someone who both doesn't mind their copies dying (I guess 1 bit, or half of people) and who is willing to work an 80h/week (I guess 4 bits, or one in sixteen people). Lets suppose you are using the rest of your selection (28 bits) on intelligence, for the sake of argument. You are getting a person of IQ 186. If instead you use 10 bits (2^10 = ~1000) on getting someone to donate all their money to their employer, you can only use 18 bits on intelligence, getting a person of IQ 167. Would it not often be better to have the worker who is twenty IQ points smarter and pay them above subsistance?

5. A variety of valuable uses for cheap to copy, short-lived brain emulations are discussed in Whole brain emulation and the evolution of superorganisms, LessWrong discussion on the impact of whole brain emulation, and Robin's work cited above.

6. Anders Sandberg writes about moral implications of emulations of animals and humans.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Is the first functional whole brain emulation likely to be (1) an emulation of low-level functionality that doesn’t require much understanding of human cognitive neuroscience at the computational level, as described in Sandberg & Bostrom (2008), or is it more likely to be (2) an emulation that makes heavy use of advanced human cognitive neuroscience, as described by (e.g.) Ken Hayworth, or is it likely to be (3) something else?
  2. Extend and update our understanding of when brain emulations might appear (see Sandberg & Bostrom (2008)).
  3. Investigate the likelihood of a multipolar outcome?
  4. Follow Robin Hanson (see above) in working out the social implications of an emulation scenario
  5. What kinds of responses to the default low-regulation multipolar outcome outlined in this section are likely to be made? e.g. is any strong regulation likely to emerge that avoids the features detailed in the current section?
  6. What measures are useful for ensuring good multipolar outcomes?
  7. What qualitatively different kinds of multipolar outcomes might we expect? e.g. brain emulation outcomes are one class.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the possibility of a multipolar outcome turning into a singleton later. To prepare, read “Post-transition formation of a singleton?” from Chapter 11The discussion will go live at 6pm Pacific time next Monday 19 January. Sign up to be notified here.

Superintelligence 17: Multipolar scenarios

4 KatjaGrace 06 January 2015 06:44AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the seventeenth section in the reading guideMultipolar scenarios. This corresponds to the first part of Chapter 11.

Apologies for putting this up late. I am traveling, and collecting together the right combination of electricity, wifi, time, space, and permission from an air hostess to take out my computer was more complicated than the usual process.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Of horses and men” from Chapter 11


Summary

  1. 'Multipolar scenario': a situation where no single agent takes over the world
  2. A multipolar scenario may arise naturally, or intentionally for reasons of safety. (p159)
  3. Knowing what would happen in a multipolar scenario involves analyzing an extra kind of information beyond that needed for analyzing singleton scenarios: that about how agents interact (p159)
  4. In a world characterized by cheap human substitutes, rapidly introduced, in the presence of low regulation, and strong protection of property rights, here are some things that will likely happen: (p160)
    1. Human labor will earn wages at around the price of the substitutes - perhaps below subsistence level for a human. Note that machines have been complements to human labor for some time, raising wages. One should still expect them to become substitutes at some point and reverse this trend.  (p160-61)
    2. Capital (including AI) will earn all of the income, which will be a lot. Humans who own capital will become very wealthy. Humans who do not own income may be helped with a small fraction of others' wealth, through charity or redistribution. p161-3)
    3. If the humans, brain emulations or other AIs receive resources from a common pool when they are born or created, the population will likely increase until it is constrained by resources. This is because of selection for entities that tend to reproduce more. (p163-6) This will happen anyway eventually, but AI would make it faster, because reproduction is so much faster for programs than for humans. This outcome can be avoided by offspring receiving resources from their parents' purses.

Another view

Tyler Cowen expresses a different view (video, some transcript):

The other point I would make is I think smart machines will always be complements and not substitutes, but it will change who they’re complementing. So I was very struck by this woman who was a doctor sitting here a moment ago, and I fully believe that her role will not be replaced by machines. But her role didn’t sound to me like a doctor. It sounded to me like therapist, friend, persuader, motivational coach, placebo effect, all of which are great things. So the more you have these wealthy patients out there, the patients are in essense the people who work with the smart machines and augment their power, those people will be extremely wealthy. Those people will employ in many ways what you might call personal servants. And because those people are so wealthy, those personal servants will also earn a fair amount.

So the gains from trade are always there, there’s still a law of comparative advantage. I think people who are very good at working with the machines will earn much much more. And the others of us will need to find different kinds of jobs. But again if total output goes up, there’s always an optimistic scenario.

Though perhaps his view isn't as different as it sounds.

Notes

1. The small space devoted to multipolar outcomes in Superintelligence probably doesn't reflect a broader consensus that a singleton is more likely or more important. Robin Hanson is perhaps the loudest proponent of the 'multipolar outcomes are more likely' position. e.g. in The Foom Debate and more briefly here. This week is going to be fairly Robin Hanson themed in fact.

2. Automation can both increase the value produced by a human worker (complementing human labor) and replace the human worker altogether (substituting human labor). Over the long term, it seems complementarity has been been the overall effect. However by the time a machine can do everything a human can do, it is hard to imagine a human earning more than a machine needs to run, i.e. less than they do now. Thus at some point substitution must take over. Some think recent unemployment is due in large part to automation. Some think this time is the beginning of the end, and the jobs will never return to humans. Others disagree, and are making bets. Eliezer Yudkowsky and John Danaher clarify some arguments. Danaher adds a nice diagram:

3. Various policies have been proposed to resolve poverty from widespread permanent technological unemployment. Here is a list, though it seems to miss a straightforward one: investing ahead of time in the capital that will become profitable instead of one's own labor, or having policies that encourage such diversification. Not everyone has resources to invest in capital, but it might still help many people. Mentioned here and here:

And then there are more extreme measures. Everyone is born with an endowment of labor; why not also an endowment of capital? What if, when each citizen turns 18, the government bought him or her a diversified portfolio of equity? Of course, some people would want to sell it immediately, cash out, and party, but this could be prevented with some fairly light paternalism, like temporary "lock-up" provisions. This portfolio of capital ownership would act as an insurance policy for each human worker; if technological improvements reduced the value of that person's labor, he or she would reap compensating benefits through increased dividends and capital gains. This would essentially be like the kind of socialist land reforms proposed in highly unequal Latin American countries, only redistributing stock instead of land.

4. Even if the income implications of total unemployment are sorted out, some are concerned about the psychological and social consequences. According to Voltaire, 'work saves us from three great evils: boredom, vice and need'. Sometimes people argue that even if our work is economically worthless, we should toil away for our own good, lest the vice and boredom overcome us.

I find this unlikely, given for instance the ubiquity of more fun and satisfying things to do than most jobs. And while obscolesence and the resulting loss of purpose may be psychologically harmful, I doubt a purposeless job solves that. Also, people already have a variety of satisfying purposes in life other than earning a living. Note also that people in situations like college and lives of luxury seem to do ok on average. I'd guess that unemployed people and some retirees do less well, but this seems more plausibly from losing a previously significant  source of purpose and respect, rather than from lack of entertainment and constraint. And in a world where nobody gets respect from bringing home dollars, and other purposes are common, I doubt either of these costs will persist. But this is all speculation. 

On a side note, the kinds of vices that are usually associated with not working tend to be vices of parasitic unproductivity, such as laziness, profligacy, and tendency toward weeklong video game stints. In a world where human labor is worthless, these heuristics for what is virtuous or not might be outdated.

Nils Nielson discusses this issue more, along with the problem of humans not earning anything.

5. What happens when selection for expansive tendencies go to space? This.

6.  A kind of robot that may change some job markets:

A kind of robot which may change some job markets.

(picture by Steve Jurvetson)

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. How likely is one superintelligence, versus many intelligences? What empirical data bears on this question? Bostrom briefly investigated characteristic time lags between large projects for instance, on p80-81.
  2. Are whole brain emulations likely to come first? This might be best approached by estimating timelines for different technologies (each an ambitious project) and comparing them, or there may be ways to factor out some considerations.
  3. What are the long term trends in automation replacing workers?
  4. What else can we know about the effects of automation on employment? (this seems to have a fair literature)
  5. What levels of population growth would be best in the long run, given machine intelligences? (this sounds like an ethics question, but one could also assume some kind of normal human values and investigate the empirical considerations that would make situations better or worse in their details.
  6. Are there good ways to avoid malthusian outcomes in the kind of scenario discussed in this section, if 'as much as possible' is not the answer to 6?
  7. What policies might help a society deal with permanent, almost complete unemployment caused by AI progress?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about 'life in an algorithmic economy'. To prepare, read the section of that name in Chapter 11The discussion will go live at 6pm Pacific time next Monday January 12. Sign up to be notified here.

Superintelligence 16: Tool AIs

7 KatjaGrace 30 December 2014 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the sixteenth section in the reading guideTool AIs. This corresponds to the last parts of Chapter Ten.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: : “Tool-AIs” and “Comparison” from Chapter 10 


Summary

  1. Tool AI: an AI that is not 'like an agent', but more like an excellent version of contemporary software. Most notably perhaps, it is not goal-directed (p151)
  2. Contemporary software may be safe because it has low capability rather than because it reliably does what you want, suggesting a very smart version of contemporary software would be dangerous (p151)
  3. Humans often want to figure out how to do a thing that they don't already know how to do. Narrow AI is already used to search for solutions. Automating this search seems to mean giving the machine a goal (that of finding a great way to make paperclips, for instance). That is, just carrying out a powerful search seems to have many of the problems of AI. (p152)
  4. A machine intended to be a tool may cause similar problems to a machine intended to be an agent, by searching to produce plans that are perverse instantiations, infrastructure profusions or mind crimes. It may either carry them out itself or give the plan to a human to carry out. (p153)
  5. A machine intended to be a tool may have agent-like parts. This could happen if its internal processes need to be optimized, and so it contains strong search processes for doing this. (p153)
  6. If tools are likely to accidentally be agent-like, it would probably be better to just build agents on purpose and have more intentional control over the design. (p155)
  7. Which castes of AI are safest is unclear and depends on circumstances. (p158) 

Another view

Holden prompted discussion of the Tool AI in 2012, in one of several Thoughts on the Singularity Institute:

...Google Maps is a type of artificial intelligence (AI). It is far more intelligent than I am when it comes to planning routes.

Google Maps - by which I mean the complete software package including the display of the map itself - does not have a "utility" that it seeks to maximize. (One could fit a utility function to its actions, as to any set of actions, but there is no single "parameter to be maximized" driving its operations.)

Google Maps (as I understand it) considers multiple possible routes, gives each a score based on factors such as distance and likely traffic, and then displays the best-scoring route in a way that makes it easily understood by the user. If I don't like the route, for whatever reason, I can change some parameters and consider a different route. If I like the route, I can print it out or email it to a friend or send it to my phone's navigation application. Google Maps has no single parameter it is trying to maximize; it has no reason to try to "trick" me in order to increase its utility.

In short, Google Maps is not an agent, taking actions in order to maximize a utility parameter. It is a tool, generating information and then displaying it in a user-friendly manner for me to consider, use and export or discard as I wish.

Every software application I know of seems to work essentially the same way, including those that involve (specialized) artificial intelligence such as Google Search, Siri, Watson, Rybka, etc. Some can be put into an "agent mode" (as Watson was on Jeopardy!) but all can easily be set up to be used as "tools" (for example, Watson can simply display its top candidate answers to a question, with the score for each, without speaking any of them.)

The "tool mode" concept is importantly different from the possibility of Oracle AI sometimes discussed by SI. The discussions I've seen of Oracle AI present it as an Unfriendly AI that is "trapped in a box" - an AI whose intelligence is driven by an explicit utility function and that humans hope to control coercively. Hence the discussion of ideas such as the AI-Box Experiment. A different interpretation, given in Karnofsky/Tallinn 2011, is an AI with a carefully designed utility function - likely as difficult to construct as "Friendliness" - that leaves it "wishing" to answer questions helpfully. By contrast with both these ideas, Tool-AGI is not "trapped" and it is not Unfriendly or Friendly; it has no motivations and no driving utility function of any kind, just like Google Maps. It scores different possibilities and displays its conclusions in a transparent and user-friendly manner, as its instructions say to do; it does not have an overarching "want," and so, as with the specialized AIs described above, while it may sometimes "misinterpret" a question (thereby scoring options poorly and ranking the wrong one #1) there is no reason to expect intentional trickery or manipulation when it comes to displaying its results.

Another way of putting this is that a "tool" has an underlying instruction set that conceptually looks like: "(1) Calculate which action A would maximize parameter P, based on existing data set D. (2) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc." An "agent," by contrast, has an underlying instruction set that conceptually looks like: "(1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2) Execute Action A." In any AI where (1) is separable (by the programmers) as a distinct step, (2) can be set to the "tool" version rather than the "agent" version, and this separability is in fact present with most/all modern software. Note that in the "tool" version, neither step (1) nor step (2) (nor the combination) constitutes an instruction to maximize a parameter - to describe a program of this kind as "wanting" something is a category error, and there is no reason to expect its step (2) to be deceptive.

I elaborated further on the distinction and on the concept of a tool-AI in Karnofsky/Tallinn 2011.

This is important because an AGI running in tool mode could be extraordinarily useful but far more safe than an AGI running in agent mode...

Notes

1. While Holden's post was probably not the first to discuss this kind of AI, it prompted many responses. Eliezer basically said that non-catastrophic tool AI doesn't seem that easy to specify formally; that even if tool AI is best, agent-AI researchers are probably pretty useful to that problem; and that it's not so bad of MIRI to not discuss tool AI more, since there are a bunch of things other people think are similarly obviously in need of discussion. Luke basically agreed with Eliezer. Stuart argues that having a tool clearly communicate possibilities is a hard problem, and talks about some other problems. Commenters say many things, including that only one AI needs to be agent-like to have a problem, and that it's not clear what it means for a powerful optimizer to not have goals.

2. A problem often brought up with powerful AIs is that when tasked with communicating, they will try to deceive you into liking plans that will fulfil their goals. It seems to me that you can avoid such deception problems by using a tool which searches for a plan you could do that would produce a lot of paperclips, rather than a tool that searches for a string that it could say to you that would produce a lot of paperclips. A plan that produces many paperclips but sounds so bad that you won't do it still does better than a persuasive lower-paperclip plan on the proposed metric. There is still a danger that you just won't notice the perverse way in which the instructions suggested to you will be instantiated, but at least the plan won't be designed to hide it.

3. Note that in computer science, an 'agent' means something other than 'a machine with a goal', though it seems they haven't settled on exactly what [some example efforts (pdf)].

Figure: A 'simple reflex agent' is not goal directed (but kind of looks goal-directed: one in action)

4. Bostrom seems to assume that a powerful tool would be a search process. This is related to the idea that intelligence is an 'optimization process'. But this is more of a definition than an empirical relationship between the kinds of technology we are thinking of as intelligent and the kinds of processes we think of as 'searching'. Could there be things that merely contribute massively to the intelligence of a human - such that we would think of them as very intelligent tools - that naturally forward whatever goals the human has?

One can imagine a tool that is told what you are planning to do, and tries to describe the major consequences of it. This is a search or optimization process in the sense that it outputs something improbably apt from a large space of possible outputs, but that quality alone seems not enough to make something dangerous. For one thing, the machine is not selecting outputs for their effect on the world, but rather for their accuracy as descriptions. For another, the process being run may not be an actual 'search' in the sense of checking lots of things and finding one that does well on some criteria. It could for instance perform a complicated transformation on the incoming data and spit out the result. 

5. One obvious problem with tools is that they maintain humans as a component in all goal-directed behavior. If humans are some combination of slow and rare compared to artificial intelligence, there may be strong pressure to automate all aspects of decisionmaking, i.e. use agents.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

 

  1. Would powerful tools necessarily become goal-directed agents in the troubling sense?
  2. Are different types of entity generally likely to become optimizers, if they are not? If so, which ones? Under what dynamics? Are tool-ish or Oracle-ish things stable attractors in this way?
  3. Can we specify communication behavior in a way that doesn't rely on having goals about the interlocutor's internal state or behavior?
  4. If you assume (perhaps impossibly) strong versions of some narrow-AI capabilities, can you design a safe tool which uses them? e.g. If you had a near perfect predictor, can you design a safe super-Google Maps?

 

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about multipolar scenarios - i.e. situations where a single AI doesn't take over the world. To prepare, read “Of horses and men” from Chapter 11The discussion will go live at 6pm Pacific time next Monday 5 January. Sign up to be notified here.

Superintelligence 15: Oracles, genies and sovereigns

6 KatjaGrace 23 December 2014 02:01AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the fifteenth section in the reading guideOracles, genies, and sovereigns. This corresponds to the first part of Chapter ten.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Oracles” and “Genies and Sovereigns” from Chapter 10


Summary

  1. Strong AIs might come in different forms or 'castes', such as oracles, genies, sovereigns and tools. (p145) 
  2. Oracle: an AI that does nothing but answer questions. (p145)
    1. The ability to make a good oracle probably allows you to make a generally capable AI. (p145)
    2. Narrow superintelligent oracles exist: e.g. calculators. (p145-6)
    3. An oracle could be a non-agentlike 'tool' (more next week) or it could be a rational agent constrained to only act through answering questions (p146)
    4. There are various ways to try to constrain an oracle, through motivation selection (see last week) and capability control (see the previous week) (p146-7)
    5. An oracle whose goals are not aligned with yours might still be useful (p147-8)
    6. An oracle might be misused, even if it works as intended (p148)
  3. Genie: an AI that carries out a high level command, then waits for another. (p148)
    1. It would be nice if a genie sought to understand and obey your intentions, rather than your exact words. (p149)
  4. Sovereign: an AI that acts autonomously in the world, in pursuit of potentially long range objectives (p148)
  5. A genie or a sovereign might have preview functionality, where it describes what it will do before doing it. (p149)
  6. A genie seems more dangerous than an oracle: if you are going to strongly physically contain the oracle, you may have been better just denying it so much access to the world and asking for blueprints instead of actions. (p148)
  7. The line between genies and sovereigns is fine. (p149)
  8. All of the castes could emulate all of the other castes more or less, so they do not differ in their ultimate capabilities. However they represent different approaches to the control problem. (p150)
  9. The ordering of safety of these castes is not as obvious as it may seem, once we consider factors such as dependence on a single human, and added dangers of creating strong agents whose goals don't match our own (even if they are tame 'domesticated' goals). (p150)

Another view

An old response to suggestions of oracle AI, from Eliezer Yudkowsky (I don't know how closely this matches his current view):

When someone reinvents the Oracle AI, the most common opening remark runs like this:

"Why not just have the AI answer questions, instead of trying to do anything?  Then it wouldn't need to be Friendly.  It wouldn't need any goals at all.  It would just answer questions."

To which the reply is that the AI needs goals in order to decide how to think: that is, the AI has to act as a powerful optimization process in order to plan its acquisition of knowledge, effectively distill sensory information, pluck "answers" to particular questions out of the space of all possible responses, and of course, to improve its own source code up to the level where the AI is a powerful intelligence.  All these events are "improbable" relative to random organizations of the AI's RAM, so the AI has to hit a narrow target in the space of possibilities to make superintelligent answers come out.

Now, why might one think that an Oracle didn't need goals?  Because on a human level, the term "goal" seems to refer to those times when you said, "I want to be promoted", or "I want a cookie", and when someone asked you "Hey, what time is it?" and you said "7:30" that didn't seem to involve any goals.  Implicitly, you wanted to answer the question; and implicitly, you had a whole, complicated, functionally optimized brain that let you answer the question; and implicitly, you were able to do so because you looked down at your highly optimized watch, that you bought with money, using your skill of turning your head, that you acquired by virtue of curious crawling as an infant.  But that all takes place in the invisible background; it didn't feel like you wanted anything.

Thanks to empathic inference, which uses your own brain as an unopened black box to predict other black boxes, it can feel like "question-answering" is a detachable thing that comes loose of all the optimization pressures behind it - even the existence of a pressure to answer questions!

Notes

1. What are the axes we are talking about?

This chapter talks about different types or 'castes' of AI. But there are lots of different ways you could divide up kinds of AI (e.g. earlier we saw brain emulations vs. synthetic AI). So in what ways are we dividing them here? They are related to different approaches to the control problem, but don't appear to be straightforwardly defined by them.

It seems to me we are looking at something close to these two axes: 

  • Goal-directedness: the extent to which the AI acts in accordance with a set of preferences (instead of for instance reacting directly to stimuli, or following rules without regard to consequence)
  • Oversight: the degree to which humans have an ongoing say in what the AI does (instead of the AI making all decisions itself)

The castes fit on these axes something like this:

They don't quite neatly fit -- tools are spread between two places, and oracles are a kind of tool (or a kind of genie if they are of the highly constrained agent variety). But I find this a useful way to think about these kinds of AI.

Note that when we think of 'tools', we usually think of them having a lot of oversight - that is, being used by a human, who is making decisions all the time. However you might also imagine what I have called 'autonomous tools', which run on their own but aren't goal directed. For instance an AI that continually reads scientific papers and turns out accurate and engaging science books, without particularly optimizing for doing this more efficiently or trying to get any particular outcome. 

We have two weeks on this chapter, so I think it will be good to focus a bit on goal directedness one week and oversight the other, alongside the advertised topics of specific castes. So this week let's focus on oversight, since tools (next week) primarily differ from the other castes mentioned in not being goal-directed.

2. What do goal-directedness and oversight have to do with each other?

Why consider goal-directedness and oversight together? It seems to me there are a couple of reasons.

Goal-directedness and oversight are substitutes, broadly. The more you direct a machine, the less it needs to direct itself. Somehow the machine has to assist with some goals, so either you or the machine needs to care about those goals and direct the machine according to them. The 'autonomous tools' I mentioned appear to be exceptions, but they only seem plausible for a limited range of tasks where minimal goal direction is needed beyond what a designer can do ahead of time.

Another way goal-directedness and oversight are connected is that we might expect both to change as we become better able to align an AI's goals with our own. In order for an AI to be aligned with our goals, the AI must naturally be goal-directed. Also, better alignment should make oversight less necessary.

3. A note on names

'Sovereign AI' sounds powerful and far reaching. Note that more mundane AIs would also fit under this category. For instance, an AI who works at an office and doesn't take over the world would also be a sovereign AI. You would be a sovereign AI if you were artificial.

4. Costs of oversight

Bostrom discussed some problems with genies. I'll mention a few others. 

One clear downside of a machine which follows your instructions and awaits your consent is that you have to be there giving instructions and consenting to things. In a world full of powerful AIs which needed such oversight, there might be plenty of spare human labor around to do this at the start, if each AI doesn't need too much oversight. However a need for human oversight might bottleneck the proliferation of such AIs.

Another downside of using human labor beyond the cost to the human is that it might be prohibitively slow, depending on the oversight required. If you only had to check in with the AI daily, and it did unimaginably many tasks the rest of the time, oversight probably wouldn't be a great cost. However if  you had to be in the loop and fully understand the decision every time the AI chose how to allocate its internal resources, things could get very slow.

Even if these costs are minor compared to the value of avoiding catastrophe, they may be too large to allow well overseen AIs to compete with more autonomous AIs. Especially if the oversight is mostly to avoid low probability terrible outcomes. 

5. How useful is oversight? 

Suppose you have a genie that doesn't totally understand human values, but tries hard to listen to you and explain things and do what it thinks you want. How useful is it that you can interact with this genie and have a say in what it does rather than it just being a sovereign?

If the genie's understanding of your values is wrong such that its intended actions will bring about a catastrophe, it's not clear that the genie can describe the outcome to you such that you will notice this. The future is potentially pretty big and complicated, especially compared to your brain, or a short conversation between you and a genie. So the genie would need to summarize a lot. For you to notice the subtle details that would make the future worthless (remember that the genie basically understands your values, so they are probably not really blatant details) the genie will need to direct your attention to them. So your situation would need to be in a middle ground where the AI knew about some features of a potential future that might bother you (so that it could point them out), but wasn't sure if you really would hate them. It seems hard for the AI giving you a 'preview' to help if the AI is just wrong about your values and doesn't know how it is wrong.

6. More on oracles

Thinking inside the box seems to be the main paper on the topic. Christiano's post on how to use an unfriendly AI is again relevant to how you might use an oracle.

 

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. How stable are the castes? Bostrom mentioned that these castes mostly have equivalent long-run capabilities, because they can be used to make one another. A related question is how likely they are to turn into one another. Another related question is how likely an attempt to create one is to lead to a different one (e.g. Yudkowsky's view above suggests that if you try to make an oracle, it might end up being a sovereign). Another related question, is which ones are likely to win out if they were developed in parallel and available for similar applications? (e.g. How well would genies prosper in a world with many sovereigns?)
  2. How useful is oversight likely to be? (e.g. At what scale might it be necessary? Could an AI usefully communicate its predictions to you such that you can evaluate the outcomes of decisions? Is there likely to be direct competition between AIs which are overseen by people and those that are not?)
  3. Are non-goal-directed oracles likely to be feasible?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the last caste of this chapter: the tool AI. To prepare, read “Tool-AIs” and “Comparison” from Chapter 10The discussion will go live at 6pm Pacific time next Monday December 29. Sign up to be notified here.

Superintelligence 14: Motivation selection methods

5 KatjaGrace 16 December 2014 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the fourteenth section in the reading guideMotivation selection methods. This corresponds to the second part of Chapter Nine.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Motivation selection methods” and “Synopsis” from Chapter 9.


Summary

  1. One way to control an AI is to design its motives. That is, to choose what it wants to do (p138)
  2. Some varieties of 'motivation selection' for AI safety:
    1. Direct specification: figure out what we value, and code it into the AI (p139-40)
      1. Isaac Asimov's 'three laws of robotics' are a famous example
      2. Direct specification might be fairly hard: both figuring out what we want and coding it precisely seem hard
      3. This could be based on rules, or something like consequentialism
    2. Domesticity: the AI's goals limit the range of things it wants to interfere with (140-1)
      1. This might make direct specification easier, as the world the AI interacts with (and thus which has to be thought of in specifying its behavior) is simpler.
      2. Oracles are an example
      3. This might be combined well with physical containment: the AI could be trapped, and also not want to escape.
    3. Indirect normativity: instead of specifying what we value, specify a way to specify what we value (141-2)
      1. e.g. extrapolate our volition
      2. This means outsourcing the hard intellectual work to the AI
      3. This will mostly be discussed in chapter 13 (weeks 23-5 here)
    4. Augmentation: begin with a creature with desirable motives, then make it smarter, instead of designing good motives from scratch. (p142)
      1. e.g. brain emulations are likely to have human desires (at least at the start)
      2. Whether we use this method depends on the kind of AI that is developed, so usually we won't have a choice about whether to use it (except inasmuch as we have a choice about e.g. whether to develop uploads or synthetic AI first).
  3. Bostrom provides a summary of the chapter:
  4. The question is not which control method is best, but rather which set of control methods are best given the situation. (143-4)

Another view

Icelizarrd:

Would you say there's any ethical issue involved with imposing limits or constraints on a superintelligence's drives/motivations? By analogy, I think most of us have the moral intuition that technologically interfering with an unborn human's inherent desires and motivations would be questionable or wrong, supposing that were even possible. That is, say we could genetically modify a subset of humanity to be cheerful slaves; that seems like a pretty morally unsavory prospect. What makes engineering a superintelligence specifically to serve humanity less unsavory?

Notes

1. Bostrom tells us that it is very hard to specify human values. We have seen examples of galaxies full of paperclips or fake smiles resulting from poor specification. But these - and Isaac Asimov's stories - seem to tell us only that a few people spending a small fraction of their time thinking does not produce any watertight specification. What if a thousand researchers spent a decade on it? Are the millionth most obvious attempts at specification nearly as bad as the most obvious twenty? How hard is it? A general argument for pessimism is the thesis that 'value is fragile', i.e. that if you specify what you want very nearly but get it a tiny bit wrong, it's likely to be almost worthless. Much like if you get one digit wrong in a phone number. The degree to which this is so (with respect to value, not phone numbers) is controversial. I encourage you to try to specify a world you would be happy with (to see how hard it is, or produce something of value if it isn't that hard).

2. If you'd like a taste of indirect normativity before the chapter on it, the LessWrong wiki page on coherent extrapolated volition links to a bunch of sources.

3. The idea of 'indirect normativity' (i.e. outsourcing the problem of specifying what an AI should do, by giving it some good instructions for figuring out what you value) brings up the general question of just what an AI needs to be given to be able to figure out how to carry out our will. An obvious contender is a lot of information about human values. Though some people disagree with this - these people don't buy the orthogonality thesis. Other issues sometimes suggested to need working out ahead of outsourcing everything to AIs include decision theory, priors, anthropics, feelings about pascal's mugging, and attitudes to infinity. MIRI's technical work often fits into this category.

4. Danaher's last post on Superintelligence (so far) is on motivation selection. It mostly summarizes and clarifies the chapter, so is mostly good if you'd like to think about the question some more with a slightly different framing. He also previously considered the difficulty of specifying human values in The golem genie and unfriendly AI (parts one and two), which is about Intelligence Explosion and Machine Ethics.

5. Brian Clegg thinks Bostrom should have discussed Asimov's stories at greater length:

I think it’s a shame that Bostrom doesn’t make more use of science fiction to give examples of how people have already thought about these issues – he gives only half a page to Asimov and the three laws of robotics (and how Asimov then spends most of his time showing how they’d go wrong), but that’s about it. Yet there has been a lot of thought and dare I say it, a lot more readability than you typically get in a textbook, put into the issues in science fiction than is being allowed for, and it would have been worthy of a chapter in its own right.

If you haven't already, you might consider (sort-of) following his advice, and reading some science fiction.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Can you think of novel methods of specifying the values of one or many humans?
  2. What are the most promising methods for 'domesticating' an AI? (i.e. constraining it to only care about a small part of the world, and not want to interfere with the larger world to optimize that smaller part).
  3. Think more carefully about the likely motivations of drastically augmenting brain emulations
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will start to talk about a variety of more and less agent-like AIs: 'oracles', genies' and 'sovereigns'. To prepare, read Chapter “Oracles” and “Genies and Sovereigns” from Chapter 10The discussion will go live at 6pm Pacific time next Monday 22nd December. Sign up to be notified here.

Superintelligence 13: Capability control methods

7 KatjaGrace 09 December 2014 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the thirteenth section in the reading guide: capability control methods. This corresponds to the start of chapter nine.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Two agency problems” and “Capability control methods” from Chapter 9


Summary

  1. If the default outcome is doom, how can we avoid it? (p127)
  2. We can divide this 'control problem' into two parts:
    1. The first principal-agent problem: the well known problem faced by a sponsor wanting an employee to fulfill their wishes (usually called 'the principal agent problem')
    2. The second principal-agent problem: the emerging problem of a developer wanting their AI to fulfill their wishes
  3. How to solve second problem? We can't rely on behavioral observation (as seen in week 11). Two other options are 'capability control methods' and 'motivation selection methods'. We see the former this week, and the latter next week.
  4. Capability control methods: avoiding bad outcomes through limiting what an AI can do. (p129)
  5. Some capability control methods:
    1. Boxing: minimize interaction between the AI and the outside world. Note that the AI must interact with the world to be useful, and that it is hard to eliminate small interactions. (p129)
    2. Incentive methods: set up the AI's environment such that it is in the AI's interest to cooperate. e.g. a social environment with punishment or social repercussions often achieves this for contemporary agents. One could also design a reward system, perhaps with cryptographic rewards (so that the AI could not wirehead) or heavily discounted rewards (so that long term plans are not worth the short term risk of detection) (p131)
      • Anthropic capture: an AI thinks it might be in a simulation, and so tries to behave as will be rewarded by simulators (box 8; p134)
    3. Stunting: limit the AI's capabilities. This may be hard to do to a degree that avoids danger and is still useful. An option here is to limit the AI's information. A strong AI may infer much from little apparent access to information however. (p135)
    4. Tripwires: test the system without its knowledge, and shut it down if it crosses some boundary. This might be combined with 'honey pots' to attract undesirable AIs take an action that would reveal them. Tripwires could test behavior, ability, or content. (p137)

Another view

Brian Clegg reviews the book mostly favorably, but isn't convinced that controlling an AI via merely turning it off should be so hard:

I also think a couple of the fundamentals aren’t covered well enough, but pretty much assumed. One is that it would be impossible to contain and restrict such an AI. Although some effort is put into this, I’m not sure there is enough thought put into the basics of ways you can pull the plug manually – if necessary by shutting down the power station that provides the AI with electricity.
Kevin Kelly also apparently doubts that AI will substantially impede efforts to modify it:

...We’ll reprogram the AIs if we are not satisfied with their performance...

...This is an engineering problem. So far as I can tell, AIs have not yet made a decision that its human creators have regretted. If they do (or when they do), then we change their algorithms. If AIs are making decisions that our society, our laws, our moral consensus, or the consumer market, does not approve of, we then should, and will, modify the principles that govern the AI, or create better ones that do make decisions we approve. Of course machines will make “mistakes,” even big mistakes – but so do humans. We keep correcting them. There will be tons of scrutiny on the actions of AI, so the world is watching. However, we don’t have universal consensus on what we find appropriate, so that is where most of the friction about them will come from. As we decide, our AI will decide...

This may be related to his view that AI is unlikely to modify itself (from further down the same page):

3. Reprogramming themselves, on their own, is the least likely of many scenarios.

The great fear pumped up by some, though, is that as AI gain our confidence in making decisions, they will somehow prevent us from altering their decisions. The fear is they lock us out. They go rogue. It is very difficult to imagine how this happens. It seems highly improbable that human engineers would program an AI so that it could not be altered in any way. That is possible, but so impractical. That hobble does not even serve a bad actor. The usual scary scenario is that an AI will reprogram itself on its own to be unalterable by outsiders. This is conjectured to be a selfish move on the AI’s part, but it is unclear how an unalterable program is an advantage to an AI. It would also be an incredible achievement for a gang of human engineers to create a system that could not be hacked. Still it may be possible at some distant time, but it is only one of many possibilities. An AI could just as likely decide on its own to let anyone change it, in open source mode. Or it could decide that it wanted to merge with human will power. Why not? In the only example we have of an introspective self-aware intelligence (hominids), we have found that evolution seems to have designed our minds to not be easily self-reprogrammable. Except for a few yogis, you can’t go in and change your core mental code easily. There seems to be an evolutionary disadvantage to being able to easily muck with your basic operating system, and it is possible that AIs may need the same self-protection. We don’t know. But the possibility they, on their own, decide to lock out their partners (and doctors) is just one of many possibilities, and not necessarily the most probable one.

 

Notes

1. What do you do with a bad AI once it is under your control?

Note that capability control doesn't necessarily solve much: boxing, stunting and tripwires seem to just stall a superintelligence rather than provide means to safely use one to its full capacity. This leaves the controlled AI to be overtaken by some other unconstrained AI as soon as someone else isn't so careful. In this way, capability control methods seem much like slowing down AI research: helpful in the short term while we find better solutions, but not in itself a solution to the problem.

However this might be too pessimistic. An AI whose capabilities are under control might either be almost as useful as an uncontrolled AI who shares your goals (if interacted with the right way), or at least be helpful in getting to a more stable situation.

Paul Christiano outlines a scheme for safely using an unfriendly AI to solve some kinds of problems. We have both blogged on general methods for getting useful work from adversarial agents, which is related.

2. Cryptographic boxing

Paul Christiano describes a way to stop an AI interacting with the environment using a cryptographic box.

3. Philosophical Disquisitions

Danaher again summarizes the chapter well. Read it if you want a different description of any of the ideas, or to refresh your memory. He also provides a table of the methods presented in this chapter.

4. Some relevant fiction

That Alien Message by Eliezer Yudkowsky

5. Control through social integration

Robin Hanson argues that it matters more that a population of AIs are integrated into our social institutions, and that they keep the peace among themselves through the same institutions we keep the peace among ourselves, than whether they have the right values. He thinks this is why you trust your neighbors, not because you are confident that they have the same values as you. He has several followup posts.

6. More miscellaneous writings on these topics

LessWrong wiki on AI boxingArmstrong et al on controlling and using an oracle AIRoman Yampolskiy on 'leakproofing' the singularity. I have not necessarily read these.

 

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

 

  1. Choose any control method and work out the details better. For instance:
    1. Could one construct a cryptographic box for an untrusted autonomous system?
    2. Investigate steep temporal discounting as an incentives control method for an untrusted AGI.
  2. Are there other capability control methods we could add to the list?
  3. Devise uses for a malicious but constrained AI.
  4. How much pressure is there likely to be to develop AI which is not controlled?
  5. If existing AI methods had unexpected progress and were heading for human-level soon, what precautions should we take now?

 

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about 'motivation selection methods'. To prepare, read “Motivation selection methods” and “Synopsis” from Chapter 9The discussion will go live at 6pm Pacific time next Monday 15th December. Sign up to be notified here.

Superintelligence 12: Malignant failure modes

7 KatjaGrace 02 December 2014 02:02AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the twelfth section in the reading guideMalignant failure modes

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: 'Malignant failure modes' from Chapter 8


Summary

  1. Malignant failure mode: a failure that involves human extinction; in contrast with many failure modes where the AI doesn't do much. 
  2. Features of malignant failures 
    1. We don't get a second try 
    2. It supposes we have a great deal of success, i.e. enough to make an unprecedentedly competent agent
  3. Some malignant failures:
    1. Perverse instantiation: the AI does what you ask, but what you ask turns out to be most satisfiable in unforeseen and destructive ways.
      1. Example: you ask the AI to make people smile, and it intervenes on their facial muscles or neurochemicals, instead of via their happiness, and in particular via the bits of the world that usually make them happy.
      2. Possible counterargument: if it's so smart, won't it know what we meant? Answer: Yes, it knows, but it's goal is to make you smile, not to do what you meant when you programmed that goal.
      3. AI which can manipulate its own mind easily is at risk of 'wireheading' - that is, a goal of maximizing a reward signal might be perversely instantiated by just manipulating the signal directly. In general, animals can be motivated to do outside things to achieve internal states, however AI with sufficient access to internal state can do this more easily by manipulating internal state. 
      4. Even if we think a goal looks good, we should fear it has perverse instantiations that we haven't appreciated.
    2. Infrastructure profusion: in pursuit of some goal, an AI redirects most resources to infrastructure, at our expense.
      1. Even apparently self-limiting goals can lead to infrastructure profusion. For instance, to an agent whose only goal is to make ten paperclips, once it has apparently made ten paperclips it is always more valuable to try to become more certain that there are really ten paperclips than it is to just stop doing anything.
      2. Examples: Riemann hypothesis catastrophe, paperclip maximizing AI
    3. Mind crime: AI contains morally relevant computations, and treats them badly
      1. Example: AI simulates humans in its mind, for the purpose of learning about human psychology, then quickly destroys them.
      2. Other reasons for simulating morally relevant creatures:
        1. Blackmail
        2. Creating indexical uncertainty in outside creatures

Another view

In this chapter Bostrom discussed the difficulty he perceives in designing goals that don't lead to indefinite resource acquisition. Steven Pinker recently offered a different perspective on the inevitability of resource acquisition:

...The other problem with AI dystopias is that they project a parochial alpha-male psychology onto the concept of intelligence. Even if we did have superhumanly intelligent robots, why would they want to depose their masters, massacre bystanders, or take over the world? Intelligence is the ability to deploy novel means to attain a goal, but the goals are extraneous to the intelligence itself: being smart is not the same as wanting something. History does turn up the occasional megalomaniacal despot or psychopathic serial killer, but these are products of a history of natural selection shaping testosterone-sensitive circuits in a certain species of primate, not an inevitable feature of intelligent systems. It’s telling that many of our techno-prophets can’t entertain the possibility that artificial intelligence will naturally develop along female lines: fully capable of solving problems, but with no burning desire to annihilate innocents or dominate the civilization.

Of course we can imagine an evil genius who deliberately designed, built, and released a battalion of robots to sow mass destruction.  But we should keep in mind the chain of probabilities that would have to multiply out before it would be a reality. A Dr. Evil would have to arise with the combination of a thirst for pointless mass murder and a genius for technological innovation. He would have to recruit and manage a team of co-conspirators that exercised perfect secrecy, loyalty, and competence. And the operation would have to survive the hazards of detection, betrayal, stings, blunders, and bad luck. In theory it could happen, but I think we have more pressing things to worry about. 

Notes

1. Perverse instantiation is a very old idea. It is what genies are most famous for. King Midas had similar problems. Apparently it was applied to AI by 1947, in With Folded Hands.

2. Adam Elga writes more on simulating people for blackmail and indexical uncertainty.

3. More directions for making AI which don't lead to infrastructure profusion:

  • Some kinds of preferences don't lend themselves to ambitious investments. Anna Salamon talks about risk averse preferences. Short time horizons and goals which are cheap to fulfil should also make long term investments in infrastructure or intelligence augmentation less valuable, compared to direct work on the problem at hand.
  • Oracle and tool AIs are intended to not be goal-directed, but as far as I know it is an open question whether this makes sense. We will get to these later in the book.

5. Often when systems break, or we make errors in them, they don't work at all. Sometimes, they fail more subtly, working well in some sense, but leading us to an undesirable outcome, for instance a malignant failure mode. How can you tell whether a poorly designed AI is likely to just not work, vs. accidentally take over the world? An important consideration for systems in general seems to be the level of abstraction at which the error occurs. We try to build systems so that you can just interact with them at a relatively abstract level, without knowing how the parts work. For instance, you can interact with your GPS by typing places into it, then listening to it, and you don't need to know anything about how it works. If you make an error while up writing your address into the GPS, it will fail by taking you to the wrong place, but it will still direct you there fairly well. If you fail by putting the wires inside the GPS into the wrong places the GPS is more likely to just not work. 

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Are there better ways to specify 'limited' goals? For instance, to ask for ten paperclips without asking for the universe to be devoted to slightly improving the probability of success?
  2. In what circumstances could you be confident that the goals you have given an AI do not permit perverse instantiations? 
  3. Explore possibilities for malignant failure vs. other failures. If we fail, is it actually probable that we will have enough 'success' for our creation to take over the world?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about capability control methods, section 13. To prepare, read “Two agency problems” and “Capability control methods” from Chapter 9The discussion will go live at 6pm Pacific time next Monday December 8. Sign up to be notified here.

Superintelligence 11: The treacherous turn

10 KatjaGrace 25 November 2014 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.


Welcome. This week we discuss the 11th section in the reading guideThe treacherous turn. This corresponds to Chapter 8.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Existential catastrophe…” and “The treacherous turn” from Chapter 8


Summary

  1. The possibility of a first mover advantage + orthogonality thesis + convergent instrumental values suggests doom for humanity (p115-6)
    1. First mover advantage implies the AI is in a position to do what it wants
    2. Orthogonality thesis implies that what it wants could be all sorts of things
    3. Instrumental convergence thesis implies that regardless of its wants, it will try to acquire resources and eliminate threats
    4. Humans have resources and may be threats
    5. Therefore an AI in a position to do what it wants is likely to want to take our resources and eliminate us. i.e. doom for humanity.
  2. One kind of response: why wouldn't the makers of the AI be extremely careful not to develop and release dangerous AIs, or relatedly, why wouldn't someone else shut the whole thing down? (p116)
  3. It is hard to observe whether an AI is dangerous via its behavior at a time when you could turn it off, because AIs have convergent instrumental reasons to pretend to be safe, even if they are not. If they expect their minds to be surveilled, even observing their thoughts may not help. (p117)
  4. The treacherous turn: while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values. (p119)
  5. We might expect AIs to be more safe as they get smarter initially - when most of the risks come from crashing self-driving cars or mis-firing drones - then to get much less safe as they get too smart. (p117)
  6. One can imagine a scenario where there is little social impetus for safety (p117-8): alarmists will have been wrong for a long time, smarter AI will have been safer for a long time, large industries will be invested, an exciting new technique will be hard to set aside, useless safety rituals will be available, and the AI will look cooperative enough in its sandbox.
  7. The conception of deception: that moment when the AI realizes that it should conceal its thoughts (footnote 2, p282)

Another view

Danaher:

This is all superficially plausible. It is indeed conceivable that an intelligent system — capable of strategic planning — could take such treacherous turns. And a sufficiently time-indifferent AI could play a “long game” with us, i.e. it could conceal its true intentions and abilities for a very long time. Nevertheless, accepting this has some pretty profound epistemic costs. It seems to suggest that no amount of empirical evidence could ever rule out the possibility of a future AI taking a treacherous turn. In fact, its even worse than that. If we take it seriously, then it is possible that we have already created an existentially threatening AI. It’s just that it is concealing its true intentions and powers from us for the time being.

I don’t quite know what to make of this. Bostrom is a pretty rational, bayesian guy. I tend to think he would say that if all the evidence suggests that our AI is non-threatening (and if there is a lot of that evidence), then we should heavily discount the probability of a treacherous turn. But he doesn’t seem to add that qualification in the chapter. He seems to think the threat of an existential catastrophe from a superintelligent AI is pretty serious. So I’m not sure whether he embraces the epistemic costs I just mentioned or not.

Notes

1. Danaher also made a nice diagram of the case for doom, and relationship with the treacherous turn:

 

2. History

According to Luke Muehlhauser's timeline of AI risk ideas, the treacherous turn idea for AIs has been around at least 1977, when a fictional worm did it:

1977: Self-improving AI could stealthily take over the internet; convergent instrumental goals in AI; the treacherous turn. Though the concept of a self-propagating computer worm was introduced by John Brunner's The Shockwave Rider (1975), Thomas J. Ryan's novel The Adolescence of P-1 (1977) tells the story of an intelligent worm that at first is merely able to learn to hack novel computer systems and use them to propagate itself, but later (1) has novel insights on how to improve its own intelligence, (2) develops convergent instrumental subgoals (see Bostrom 2012) for self-preservation and resource acquisition, and (3) learns the ability to fake its own death so that it can grow its powers in secret and later engage in a "treacherous turn" (see Bostrom forthcoming) against humans.

 

3. The role of the premises

Bostrom's argument for doom has one premise that says AI could care about almost anything, then another that says regardless of what an AI cares about, it will do basically the same terrible things anyway. (p115) Do these sound a bit strange together to you? Why do we need the first, if final values don't tend to change instrumental goals anyway?

It seems the immediate reason is that an AI with values we like would not have the convergent goal of taking all our stuff and killing us. That is, the values we want an AI to have are some of those rare values that don't lead to destructive instrumental goals. Why is this? Because we (and thus the AI) care about the activites the resources would be grabbed from. If the resources were currently being used for anything we didn't care about, then our values would also suggest grabbing resources, and look similar to all of the other values. The difference that makes our values special here is just that most resources are already being used for them somewhat. 

4. Signaling

It is hard to tell apart a safe and an unsafe AI, because both would like to look safe. This is a very common problem in human interactions. For instance, it can be nontrivial to tell a genuine lover from a gold digger, a businessman from a conman, and an expert from a crank. All of them want to look like the desirable sort. Particularly similar to the AI case is that of hiring a new employee for a trial period. You will sometimes find that the employee's values are much better aligned during the trial period, and then they undergo a 'treacherous turn' once they have been hired more thoroughly.

'Costly signaling' is a general purpose solution to this problem, which works some of the time. The basic idea is this. Everyone has instrumental reasons to look like the good kind of person, but perhaps their reasons aren't exactly as strong as one other's, or the strength of their desire is harder to act on for one group than the other, so we can construct a set of options that will lead the different types of people to choose differently, even though they know this will set them apart. For instance, while an honest businessman and a conman would both like to say 'I'm an honest businessman', and to behave honestly if you watch them for five minutes, a conman would benefit less from expensive shopfront as an honest businessman, because his reputation is less valuable, so a brand is a signal of being honest.

Similarly, we could perhaps design situations where most harmful AIs would have a reasonable incentive to reveal themselves. This is difficult because so much is at stake, compared to the cost of the signal. But I haven't thought enough about it to rule it out, and I haven't seen anyone else do so either.

5. When would the 'conception of deception' take place?

Below the level of the best humans presumably, since we have already thought of all this.

6. Surveillance of the mind

Bostrom is not optimistic that AI researchers could protect much against risks by watching AI mental processes directly for signs of deception, rapid progress, or undesirable scheming. He says if the AI is smart it will realize this might happen, and hide the offending thoughts.

This seems an open question to me, for several reasons:

  1. Making your mental contents look innocuous while maintaining their semantic content sounds potentially very hard
  2. Especially for a creature which has only just become smart enough to realize it should treacherously turn
  3. From the AI's perspective, even if it is smart, surveillance could seem fairly unlikely, especially if we deceive it about its surroundings
As a consequence of 2, it seems better if the 'conception of deception' comes earlier.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. How transparent are AI minds likely to be? Should we expect to be able to detect deception? What are the answers to these questions for different specific architectures and methods? This might be relevant.
  2. Are there other good ways to filter AIs with certain desirable goals from others? e.g. by offering them choices that would filter them.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about 'malignant failure modes' (as opposed presumably to worse failure modes). To prepare, read “Malignant failure modes” from Chapter 8The discussion will go live at 6pm Pacific time next Monday December 1. Sign up to be notified here.

View more: Prev | Next