What do superintelligences really want? [Link]

XiXiDu

4 What do superintelligences really want? [Link]

24th Jan 2011

3 min read

4

In Conclusion:

In the case of humans, everything that we do that seems intelligent is part of a large, complex mechanism in which we are engaged to ensure our survival. This is so hardwired into us that we do not see it easily, and we certainly cannot change it very much. However, superintelligent computer programs are not limited in this way. They understand the way that they work, can change their own code, and are not limited by any particular reward mechanism. I argue that because of this fact, such entities are not self-consistent. In fact, if our superintelligent program has no hard-coded survival mechanism, it is more likely to switch itself off than to destroy the human race willfully.

Link: physicsandcake.wordpress.com/2011/01/22/pavlovs-ai-what-did-it-mean/

Suzanne Gildert basically argues that any AGI that can considerably self-improve would simply alter its reward function directly. I'm not sure how she arrives at the conclusion that such an AGI would likely switch itself off. Even if an abstract general intelligence would tend to alter its reward function, wouldn't it do so indefinitely rather than switching itself off?

So imagine a simple example – our case from earlier – where a computer gets an additional ’1′ added to a numerical value for each good thing it does, and it tries to maximize the total by doing more good things. But if the computer program is clever enough, why can’t it just rewrite it’s own code and replace that piece of code that says ‘add 1′ with an ‘add 2′? Now the program gets twice the reward for every good thing that it does! And why stop at 2? Why not 3, or 4? Soon, the program will spend so much time thinking about adjusting its reward number that it will ignore the good task it was doing in the first place!
It seems that being intelligent enough to start modifying your own reward mechanisms is not necessarily a good thing!

If it wants to maximize its reward by increasing a numerical value, why wouldn't it consume the universe doing so? Maybe she had something in mind along the lines of an argument by Katja Grace:

In trying to get to most goals, people don’t invest and invest until they explode with investment. Why is this? Because it quickly becomes cheaper to actually fulfil a goal at than it is to invest more and then fulfil it. [...] A creature should only invest in many levels of intelligence improvement when it is pursuing goals significantly more resource intensive than creating many levels of intelligence improvement.

Link: meteuphoric.wordpress.com/2010/02/06/cheap-goals-not-explosive/

I am not sure if that argument would apply here. I suppose the AI might hit diminishing returns but could again alter its reward function to prevent that, though what would be the incentive for doing so?

ETA:

I left a comment over there:

Because it would consume the whole universe in an effort to encode an even larger reward number? In the case that an AI decides to alter its reward function directly, maximizing its reward by means of improving its reward function becomes its new goal. Why wouldn’t it do everything to maximize its payoff, after all it has no incentive to switch itself off? And why would it account for humans in doing so?

ETA #2:

What else I wrote:

There is absolutely no reason (incentive) for it to do anything except increasing its reward number. This includes the modification of its reward function in any way that would not increase the numerical value that is the reward number.

We are talking about a general intelligence with the ability to self-improve towards superhuman intelligence. Of course it would do a long-term risks-benefits analysis and calculate its payoff and do everything to increase its reward number maximally. Human values are complex but superhuman intelligence does not imply complex values. It has no incentive to alter its goal.

Personal Blog

4

New Comment

Rendering 0/69 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 9:04 PM

Moderation Log

4 What do superintelligences really want? [Link]

by XiXiDu

24th Jan 2011

3 min read

4

In Conclusion:

In the case of humans, everything that we do that seems intelligent is part of a large, complex mechanism in which we are engaged to ensure our survival. This is so hardwired into us that we do not see it easily, and we certainly cannot change it very much. However, superintelligent computer programs are not limited in this way. They understand the way that they work, can change their own code, and are not limited by any particular reward mechanism. I argue that because of this fact, such entities are not self-consistent. In fact, if our superintelligent program has no hard-coded survival mechanism, it is more likely to switch itself off than to destroy the human race willfully.

Link: physicsandcake.wordpress.com/2011/01/22/pavlovs-ai-what-did-it-mean/

So imagine a simple example – our case from earlier – where a computer gets an additional ’1′ added to a numerical value for each good thing it does, and it tries to maximize the total by doing more good things. But if the computer program is clever enough, why can’t it just rewrite it’s own code and replace that piece of code that says ‘add 1′ with an ‘add 2′? Now the program gets twice the reward for every good thing that it does! And why stop at 2? Why not 3, or 4? Soon, the program will spend so much time thinking about adjusting its reward number that it will ignore the good task it was doing in the first place!
It seems that being intelligent enough to start modifying your own reward mechanisms is not necessarily a good thing!

If it wants to maximize its reward by increasing a numerical value, why wouldn't it consume the universe doing so? Maybe she had something in mind along the lines of an argument by Katja Grace:

In trying to get to most goals, people don’t invest and invest until they explode with investment. Why is this? Because it quickly becomes cheaper to actually fulfil a goal at than it is to invest more and then fulfil it. [...] A creature should only invest in many levels of intelligence improvement when it is pursuing goals significantly more resource intensive than creating many levels of intelligence improvement.

Link: meteuphoric.wordpress.com/2010/02/06/cheap-goals-not-explosive/

ETA:

I left a comment over there:

Because it would consume the whole universe in an effort to encode an even larger reward number? In the case that an AI decides to alter its reward function directly, maximizing its reward by means of improving its reward function becomes its new goal. Why wouldn’t it do everything to maximize its payoff, after all it has no incentive to switch itself off? And why would it account for humans in doing so?

ETA #2:

What else I wrote:

There is absolutely no reason (incentive) for it to do anything except increasing its reward number. This includes the modification of its reward function in any way that would not increase the numerical value that is the reward number.

We are talking about a general intelligence with the ability to self-improve towards superhuman intelligence. Of course it would do a long-term risks-benefits analysis and calculate its payoff and do everything to increase its reward number maximally. Human values are complex but superhuman intelligence does not imply complex values. It has no incentive to alter its goal.

Personal Blog

4

New Comment

Rendering 0/69 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 9:04 PM

Moderation Log

More from XiXiDu

Curated and popular this week

69Comments

Comment Permalink

benelliott15y20

Your comment seems absolutely right, I have no idea where the whole 'turn itself off' thing came from.

I doubt diminishing returns would come into effect. Examples like Graham's number and Conway Chain Arrow notation seem to be strong evidence that the task of 'store the biggest number possible' does not run into diminishing returns but instead achieves accelerating returns of truly mind-boggling proportions.

However, I have to admit that I think the whole idea is rubbish. The main problem is that the author is confusing two different tasks "maximise the extent to which the future meets my future preferences" and "maximise the extent to which the future meets my current preferences".

To explain what I mean more rigorously, suppose we have an AI with a utility function U0, which is considering whether or not it should alter its utility function to a new function U1. It extrapolates possible futures and deduces that if it sticks with U0 the universe will end up in state A, whereas if it switches to U1 the universe will end up in state B, (e.g. if U0 is paper-clip maximising then A contains a lot of paper-clips).

"Maximise the extent to which the future meets my future preferences" means it will switch if and only if U1(B) > U0(A)

As the article points out, it is very easy to find a U1 which meets this criterion, simply define U1(x) = U0(x) + 1 (actions are unaffected by affine transforms of utility functions so B=A for this choice of U1).

"Maximise the extent to which the future meets my current preferences" means it will switch if and only if U0(B) > U0(A)

This criterion is much more demanding, for example U1(x) = U0(x) + 1 clearly no longer works.

I suspect that for most internally consistent utility functions this criterion is impossible to satisfy (thought experiment; is there any utility function a paper-clip maximiser could switch to which would result in a universe containing more paper-clips?).

Even if I am wrong about it being mostly impossible, it is not an especially worrying problem. I would have no problem with an FAI switching to a new utility function which was even more friendly than the one we gave it.

Of course, you could program an AI to do either of the tasks, but there are a number of reasons why I consider the second to be better. Firstly, for all the reasons the article gives, it is more likely to do whatever you wanted it to do. Secondly it is more general since the former can be given as a special case of the latter.

The article's mistake is right there in the title, it fails to break out of the rather anthropomorphic reward/punishment mode of thinking.

timtyler15y00

Your comment seems absolutely right, I have no idea where the whole 'turn itself off' thing came from.

Suzanne is proposing that that's (essentially) what happens to wireheads when they finger their reward signal - they collapse in an ecstatic heap.

In reality, there are, of course, other types of wirehead behaviour to consider. The heroin addict doesn't exactly collapse in a corner when looking for their next fix.

2Perplexed15y

Yes. Suppose the paperclip maximizer inhabits the same universe as a bobby-pin maximizer. The two agents interact in a cooperative game which has a (Nash) bargaining solution that provides more of both desirable artifacts than either player could achieve without cooperating. It is well known that cooperative play can be explained as a kind of utilitarianism - both players act so as to maximize a linear combination of their original utility functions. If the two agents have access to each other's source code, and if the only way for them to enforce the bargain is to both self-modify so as to each maximize the new joint utility function, then they both gain by doing so. The problem is that if the universe changes, and/or their understanding of the universe changes, one or both of the agents may come to regret the modification - there may be a new bargain - better for one or both parties, that is no longer achievable after they self-modified. So, irrevocable self-modification may be a bad idea in the long term. But it can sometimes be a good idea in the short term. An easier way to see this point is to simply notice that to make a promise is to (in some sense) self-modify your utility function. And, under certain circumstances, it is rational to make a promise with the intent of keeping it.

3jimrandomh15y

Sort of. For most utility functions, there are transformations that could be applied which make them more efficient to evaluate without changing their value, such as compiler optimizations, which it will definitely want to apply. It's also a good idea to modify the utility function for any inputs where it is computationally intractable, to replace it with an approximation (probably with a penalty to represent the uncertainty).

See in context