Superintelligence 10: Instrumentally convergent goals

KatjaGrace

13 Superintelligence 10: Instrumentally convergent goals

18th Nov 2014

6 min read

13

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the tenth section in the reading guide: Instrumentally convergent goals. This corresponds to the second part of Chapter 7.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. And if you are behind on the book, don't let it put you off discussing. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: Instrumental convergence from Chapter 7 (p109-114)

Summary

The instrumental convergence thesis: we can identify 'convergent instrumental values' (henceforth CIVs). That is, subgoals that are useful for a wide range of more fundamental goals, and in a wide range of situations. (p109)
Even if we know nothing about an agent's goals, CIVs let us predict some of the agent's behavior (p109)
Some CIVs:
1. Self-preservation: because you are an excellent person to ensure your own goals are pursued in future.
2. Goal-content integrity (i.e. not changing your own goals): because if you don't have your goals any more, you can't pursue them.
3. Cognitive enhancement: because making better decisions helps with any goals.
4. Technological perfection: because technology lets you have more useful resources.
5. Resource acquisition: because a broad range of resources can support a broad range of goals.
For each CIV, there are plausible combinations of final goals and scenarios under which an agent would not pursue that CIV. (p109-114)

Notes

1. Why do we care about CIVs?
CIVs to acquire resources and to preserve oneself and one's values play important roles in the argument for AI risk. The desired conclusions are that we can already predict that an AI would compete strongly with humans for resources, and also than an AI once turned on will go to great lengths to stay on and intact.

2. Related work
Steve Omohundro wrote the seminal paper on this topic. The LessWrong wiki links to all of the related papers I know of. Omohundro's list of CIVs (or as he calls them, 'basic AI drives') is a bit different from Bostrom's:

Self-improvement
Rationality
Preservation of utility functions
Avoiding counterfeit utility
Self-protection
Acquisition and efficient use of resources

3. Convergence for values and situations
It seems potentially helpful to distinguish convergence over situations and convergence over values. That is, to think of instrumental goals on two axes - one of how universally agents with different values would want the thing, and one of how large a range of situations it is useful in. A warehouse full of corn is useful for almost any goals, but only in the narrow range of situations where you are a corn-eating organism who fears an apocalypse (or you can trade it). A world of resources converted into computing hardware is extremely valuable in a wide range of scenarios, but much more so if you don't especially value preserving the natural environment. Many things that are CIVs for humans don't make it onto Bostrom's list, I presume because he expects the scenario for AI to be different enough. For instance, procuring social status is useful for all kinds of human goals. For an AI in the situation of a human, it would appear to also be useful. For an AI more powerful than the rest of the world combined, social status is less helpful.

4. What sort of things are CIVs?
Arguably all CIVs mentioned above could be clustered under 'cause your goals to control more resources'. This implies causing more agents to have your values (e.g. protecting your values in yourself), causing those agents to have resources (e.g. getting resources and transforming them into better resources) and getting the agents to control the resources effectively as well as nominally (e.g. cognitive enhancement, rationality). It also suggests convergent values we haven't mentioned. To cause more agents to have one's values, one might create or protect other agents with your values, or spread your values to existing other agents. To improve the resources held by those with one's values, a very convergent goal in human society is to trade. This leads to a convergent goal of creating or acquiring resources which are highly valued by others, even if not by you. Money and social influence are particularly widely redeemable 'resources'. Trade also causes others to act like they have your values when they don't, which is a way of spreading one's values.

As I mentioned above, my guess is that these are left out of Superintelligence because they involve social interactions. I think Bostrom expects a powerful singleton, to whom other agents will be irrelevant. If you are not confident of the singleton scenario, these CIVs might be more interesting.

5. Another discussion
John Danaher discusses this section of Superintelligence, but not disagreeably enough to read as 'another view'.

Another view

I don't know of any strong criticism of the instrumental convergence thesis, so I will play devil's advocate.

The concept of a sub-goal that is useful for many final goals is unobjectionable. However the instrumental convergence thesis claims more than this, and this stronger claim is important for the desired argument for AI doom. The further claims are also on less solid ground, as we shall see.

According to the instrumental convergence thesis, convergent instrumental goals not only exist, but can at least sometimes be identified by us. This is needed for arguing that we can foresee that AI will prioritize grabbing resources, and that it will be very hard to control. That we can identify convergent instrumental goals may seem clear - after all, we just did: self-preservation, intelligence enhancement and the like. However to say anything interesting, our claim must not only be that these values are better than not, but that they will be prioritized by the kinds of AI that will exist, in a substantial range of circumstances that will arise. This is far from clear, for several reasons.

Firstly, to know what the AI would prioritize we need to know something about its alternatives, and we can be much less confident that we have thought of all of the alternative instrumental values an AI might have. For instance, in the abstract intelligence enhancement may seem convergently valuable, but in practice adult humans devote little effort to it. This is because investments in intelligence are rarely competitive with other endeavors.

Secondly, we haven't said anything quantitative about how general or strong our proposed convergent instrumental values are likely to be, or how we are weighting the space of possible AI values. Without even any guesses, it is hard to know what to make of resulting predictions. The qualitativeness of the discussion also raises the concern that thinking on the problem has not been very concrete, and so may not be engaged with what is likely in practice.

Thirdly, we have arrived at these convergent instrumental goals by theoretical arguments about what we think of as default rational agents and 'normal' circumstances. These may be very different distributions of agents and scenarios from those produced by our engineering efforts. For instance, perhaps almost all conceivable sets of values - in whatever sense - would favor accruing resources ruthlessly. It would still not be that surprising if an agent somehow created noisily from human values cared about only acquiring resources by certain means or had blanket ill-feelings about greed.

In sum, it is unclear that we can identify important convergent instrumental values, and consequently unclear that such considerations can strongly help predict the behavior of real future AI agents.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

Do approximately all final goals make an optimizer want to expand beyond the cosmological horizon?
Can we say anything more quantitative about the strength or prevalence of these convergent instrumental values?
Can we say more about values that are likely to be convergently instrumental just across AIs that are likely to be developed, and situations they are likely to find themselves in?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the treacherous turn. To prepare, read “Existential catastrophe…” and “The treacherous turn” from Chapter 8. The discussion will go live at 6pm Pacific time next Monday 24th November. Sign up to be notified here.

Instrumental convergenceReading GroupAI

Personal Blog

13

New Comment

Rendering 0/33 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 11:50 PM

Moderation Log

13 Superintelligence 10: Instrumentally convergent goals

by KatjaGrace

18th Nov 2014

6 min read

13

Welcome. This week we discuss the tenth section in the reading guide: Instrumentally convergent goals. This corresponds to the second part of Chapter 7.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

Reading: Instrumental convergence from Chapter 7 (p109-114)

Summary

The instrumental convergence thesis: we can identify 'convergent instrumental values' (henceforth CIVs). That is, subgoals that are useful for a wide range of more fundamental goals, and in a wide range of situations. (p109)
Even if we know nothing about an agent's goals, CIVs let us predict some of the agent's behavior (p109)
Some CIVs:
1. Self-preservation: because you are an excellent person to ensure your own goals are pursued in future.
2. Goal-content integrity (i.e. not changing your own goals): because if you don't have your goals any more, you can't pursue them.
3. Cognitive enhancement: because making better decisions helps with any goals.
4. Technological perfection: because technology lets you have more useful resources.
5. Resource acquisition: because a broad range of resources can support a broad range of goals.
For each CIV, there are plausible combinations of final goals and scenarios under which an agent would not pursue that CIV. (p109-114)

Notes

Self-improvement
Rationality
Preservation of utility functions
Avoiding counterfeit utility
Self-protection
Acquisition and efficient use of resources

5. Another discussion
John Danaher discusses this section of Superintelligence, but not disagreeably enough to read as 'another view'.

Another view

I don't know of any strong criticism of the instrumental convergence thesis, so I will play devil's advocate.

The concept of a sub-goal that is useful for many final goals is unobjectionable. However the instrumental convergence thesis claims more than this, and this stronger claim is important for the desired argument for AI doom. The further claims are also on less solid ground, as we shall see.

According to the instrumental convergence thesis, convergent instrumental goals not only exist, but can at least sometimes be identified by us. This is needed for arguing that we can foresee that AI will prioritize grabbing resources, and that it will be very hard to control. That we can identify convergent instrumental goals may seem clear - after all, we just did: self-preservation, intelligence enhancement and the like. However to say anything interesting, our claim must not only be that these values are better than not, but that they will be prioritized by the kinds of AI that will exist, in a substantial range of circumstances that will arise. This is far from clear, for several reasons.

Firstly, to know what the AI would prioritize we need to know something about its alternatives, and we can be much less confident that we have thought of all of the alternative instrumental values an AI might have. For instance, in the abstract intelligence enhancement may seem convergently valuable, but in practice adult humans devote little effort to it. This is because investments in intelligence are rarely competitive with other endeavors.

Secondly, we haven't said anything quantitative about how general or strong our proposed convergent instrumental values are likely to be, or how we are weighting the space of possible AI values. Without even any guesses, it is hard to know what to make of resulting predictions. The qualitativeness of the discussion also raises the concern that thinking on the problem has not been very concrete, and so may not be engaged with what is likely in practice.

Thirdly, we have arrived at these convergent instrumental goals by theoretical arguments about what we think of as default rational agents and 'normal' circumstances. These may be very different distributions of agents and scenarios from those produced by our engineering efforts. For instance, perhaps almost all conceivable sets of values - in whatever sense - would favor accruing resources ruthlessly. It would still not be that surprising if an agent somehow created noisily from human values cared about only acquiring resources by certain means or had blanket ill-feelings about greed.

In sum, it is unclear that we can identify important convergent instrumental values, and consequently unclear that such considerations can strongly help predict the behavior of real future AI agents.

In-depth investigations

Do approximately all final goals make an optimizer want to expand beyond the cosmological horizon?
Can we say anything more quantitative about the strength or prevalence of these convergent instrumental values?
Can we say more about values that are likely to be convergently instrumental just across AIs that are likely to be developed, and situations they are likely to find themselves in?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

Instrumental convergenceReading GroupAI

Personal Blog

13

Mentioned in

31Superintelligence reading group

9Towards a solution to the alignment problem via objective detection and evaluation

New Comment

Rendering 0/33 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 11:50 PM

Moderation Log

More from KatjaGrace

Curated and popular this week

33Comments

Comment Permalink

SteveG12y10

I think I see what you're saying, but I am going to go out on a limb here and stick by "bug." Unflagging, unhedged optimization of a single goal seems like an error, no matter what.

Please continue to challenge me on this, and I'll try to develop this idea.

Approach #1:

I am thinking that in practical situations single-mindedness actually does not even achieve the ends of a single-minded person. It leads them in wrong directions.

Suppose the goals and values of a person or a machine are entirely single-minded (for instance, "I only eat, sleep and behave ethically so I can play Warcraft or do medical research for as many years as possible, until I die") and the rest are all "instrumental."

I am inclined to believe that if they allocated their cognitive resources in that way, such a person or machine would run into all kinds of problems very rapidly, and fail to accomplish their basic goal..

If you are constantly asking "but how does every small action I take fit into my Warcraft-playing?" then you're spending too much effort on constant re-optimization, and not enough on action.

Trying to optimize all of the time costs a lot. That's why we use rules of thumb for behavior instead.

Even if all you want is to be an optimal WarCraft player, it's better to just designate some time and resources for self-care or for learning how to live effectively with the people who can help. The optimal player would really focus on self-care or social skills during that time, and stop imagining WarCraft games for a while.

While the optimal Warcraft player is learning social skills, learning social skills effectively becomes her primary objective. For all practical purposes, she has swapped utility functions for a while.

Now let's suppose we're in the middle of a game of WarCraft. To be an optimal Warcraft player for more one game, we also have to have a complex series of interrupts and rules (smell smoke, may screw up important relationship, may lose job and therefore not be able to buy new joystick).

If you smell smoke, the better mind architecture seems to involve swapping out the larger goal of Warcraft-playing in favor of extreme focus on dealing with the possibility that the house is burning down.

Approach #2: Perhaps finding the perfect goal is impossible-that goals must be discovered and designed over time. Goal-creation is subject to bounded rationality, so perhaps a superintelligence, like people, would incorporate a goal-revision algorithm on purpose.

Approach #3: Goals may derive from first principles which are arrived at non-rationally (I did not say irrationally, there is a difference). If a goal is non-rational, and its consequences have yet to be fully explored, then there is a non-zero probability that, at a later time, this goal may prove self-inconsistent, and have to be altered.

Under such circumstances, single-minded drives risk disaster.

Approach #4:

Suppose the system is designed in some way to be useful to people. It is very difficult to come up with unambiguous, airtightly consistent goals in this realm.

If a goal has anything to do with pleasing people, what they want changes unpredictably with time. Changing the landscape of an entire planet, for example, would not be an appropriate response for an AI that was very driven to please its master, even if the master claimed that was they really wanted.

I am still exploring here, but I am veering toward thinking that utility function optimization, in any pure form, just plain yields flawed minds.

Sebastian_Hagen12y20

Approach #1: Goal-evaluation is expensive

You're talking about runtime optimizations. Those are fine. You're totally allowed to run some meta-analysis, figure out you're spending more time on goal-tree updating than the updates gain you in utility, and scale that process down in frequency, or even make it dependent on how much cputime you need for itme-critical ops in a given moment. Agents with bounded computational resources will never have enough cputime to compute provably optimal actions in any case (the problem is uncomputable); so how much you spe... (read more)

See in context