RobbBB comments on The genie knows, but doesn't care - Less Wrong

54 Post author: RobbBB 06 September 2013 06:42AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (515)

You are viewing a single comment's thread. Show more comments above.

Comment author: Wei_Dai 04 September 2013 06:53:04AM 9 points [-]

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.

Some people seem to be arguing that it may not be that hard to discover these specific lines of code. Or perhaps that we don't need to get an AI to "perfectly" care about its programmer's True Intentions. I'm not sure if I understand their arguments correctly so I may be unintentionally strawmanning them, but the idea may be that if we can get an AI to approximately care about its programmer or user's intentions, and also prevent it from FOOMing right away (or just that the microeconomics of intelligence explosion doesn't allow for such fast FOOMing), then we can make use of the AI in a relatively safe way to solve various problems, including the problem of how to control such AIs better, or how to eventually build an FAI. What's your take on this class of arguments?

Being Friendly is of instrumental value to barely any goals.

Tangentially, being Friendly is probably of instrumental value to some goals, which may turn out to be easier to instill in an AGI than solving Friendliness in the traditional terminal values sense. I came up with the term "Instrumentally Friendly AI" to describe such an approach.

Comment author: XiXiDu 04 September 2013 11:42:07AM *  2 points [-]

Nobody disagrees that an arbitrary agent pulled from mind design space, that is powerful enough to overpower humanity, is an existential risk if it either exhibits Omohundro's AI drives or is used as a tool by humans, either carelessly or to gain power over other humans.

Disagreeing with that would about make as much sense as claiming that out-of-control self-replicating robots could somehow magically turn the world into a paradise, rather than grey goo.

The disagreement is mainly about the manner in which we will achieve such AIs, how quickly that will happen, and whether such AIs will have these drives.

I actually believe that much less than superhuman general intelligence might be required for humans to cause extinction type scenarios.

Most of my posts specifically deal with the scenario and arguments publicized by MIRI. Those posts are not highly polished papers but attempts to reduce my own confusion and to enable others to provide feedback.

I argue that...

  • ...the idea of a vast mind design space is largely irrelevant, because AIs will be created by humans, which will considerably limit the kind of minds we should expect.
  • ...that AIs created by humans do not need to, and will not exhibit any of Omohundro's AI drives.
  • ...that even given Omohundro's AI drives, it is not clear how such AIs would arrive at the decision to take over the world.
  • ...that there will be no fast transition from largely well-behaved narrow AIs to unbounded general AIs, and that humans will be part of any transition.
  • ...that any given AI will initially not be intelligent enough to hide any plans for world domination.
  • ...that drives as outlined by Omohundro would lead to a dramatic interference with what the AI's creators want it to do, before it could possibly become powerful enough to deceive or overpower them, and would therefore be noticed in time.
  • ...that even if MIRI's scenario comes to pass, there is a lack of concrete scenarios on how such an AI could possibly take over the world, and that the given scenarios raise many questions.

There are a lot more points of disagreement.

What I, and I believe Richard Loosemore as well, have been arguing, as quoted above, is just one specific point that is not supposed to say much about AI risks in general. Below is an distilled version of what I personally meant:

1. Superhuman general intelligence, obtained by the self-improvement of a seed AI, is a very small target to hit, requiring a very small margin of error.

2. Intelligently designed systems do not behave intelligently as a result of unintended consequences. (See note 1 below.)

3. By step 1 and 2, for an AI to be able to outsmart humans, humans will have to intend to make an AI capable of outsmarting them and succeed at encoding their intention of making it outsmart them.

4. Intelligence is instrumentally useful, because it enables a system to hit smaller targets in larger and less structured spaces. (See note 2, 3.)

5. In order to take over the world a system will have to be able to hit a lot of small targets in very large and unstructured spaces.

6. The intersection of the sets of “AIs in mind design space” and “the first probable AIs to be expected in the near future” contains almost exclusively those AIs that will be designed by humans.

7. By step 6, what an AI is meant to do will very likely originate from humans.

8. It is easier to create an AI that applies its intelligence generally than to create an AI that only uses its intelligence selectively. (See note 4.)

9. An AI equipped with the capabilities required by step 5, given step 7 and 8, will very likely not be confused about what it is meant to do, if it was not meant to be confused.

10. Therefore the intersection of the sets of “AIs designed by humans” and “dangerous AIs” only contains almost exclusively those AIs which are deliberately designed to be dangerous by malicious humans.


Notes

  1. Software such as Mathematica will not casually prove the Riemann hypothesis if it has not been programmed to do so. Given intelligently designed software, world states in which the Riemann hypothesis is proven will not be achieved if they were not intended because the nature of unintended consequences is overall chaotic.

  2. As the intelligence of a system increases the precision of the input, that is necessary to make the system do what humans mean it to do, decreases. For example, systems such as IBM Watson or Apple’s Siri do what humans mean them to do when fed with a wide range of natural language inputs. While less intelligent systems such as compilers or Google Maps need very specific inputs in order to satisfy human intentions. Increasing the intelligence of Google Maps will enable it to satisfy human intentions by parsing less specific commands.

  3. When producing a chair an AI will have to either know the specifications of the chair (such as its size or the material it is supposed to be made of) or else know how to choose a specification from an otherwise infinite set of possible specifications. Given a poorly designed fitness function, or the inability to refine its fitness function, an AI will either (a) not know what to do or (b) will not be able to converge on a qualitative solution, if at all, given limited computationally resources.

  4. For an AI to misinterpret what it is meant to do it would have to selectively suspend using its ability to derive exact meaning from fuzzy meaning, which is a significant part of general intelligence. This would require its creators to restrict their AI and specify an alternative way to learn what it is meant to do (which takes additional, intentional effort). Because an AI that does not know what it is meant to do, and which is not allowed to use its intelligence to learn what it is meant to do, would have to choose its actions from an infinite set of possible actions. Such a poorly designed AI will either (a) not do anything at all or (b) will not be able to decide what to do before the heat death of the universe, given limited computationally resources. Such a poorly designed AI will not even be able to decide if trying to acquire unlimited computationally resources was instrumentally rational because it will be unable to decide if the actions that are required to acquire those resources might be instrumentally irrational from the perspective of what it is meant to do.

Comment author: RobbBB 05 September 2013 04:06:32PM *  12 points [-]

This mirrors some comments you wrote recently:

"You write that the worry is that the superintelligence won't care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean?"

"If an AI is meant to behave generally intelligent [sic] then it will have to work as intended or otherwise fail to be generally intelligent."

It's relatively easy to get an AI to care about (optimize for) something-or-other; what's hard is getting one to care about the right something.

'Working as intended' is a simple phrase, but behind it lies a monstrously complex referent. It doesn't clearly distinguish the programmers' (mostly implicit) true preferences from their stated design objectives; an AI's actual code can differ from either or both of these. Crucially, what an AI is 'intended' for isn't all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to kill Friendliness much more easily than intelligence. Your argument is misleading because it trades on treating this simple phrase as though it were all-or-nothing, a monolith; but all failures for a device to 'work as intended' in human history have involved at least some of the intended properties of that device coming to fruition.

It may be hard to build self-modifying AGI. But it's not the same hardness as the hardness of Friendliness Theory. As a programmer, being able to hit one small target doesn't entail that you can or will hit every small target it would be in your best interest to hit. See the last section of my post above.

I suggest that it's a straw man to claim that anyone has argued 'the superintelligence wouldn't understand what you wanted it to do, if you didn't program it to fully understand that at the outset'. Do you have evidence that this is a position held by, say, anyone at MIRI? The post you're replying to points out that the real claim is that the superintelligence won't care what you wanted it to do, if you didn't program it to care about the specific right thing at the outset. That makes your criticism seem very much like a change of topic.

Superintelligence may imply an ability to understand instructions, but it doesn't imply a desire to rewrite one's utility function to better reflect human values. Any such desire would need to come from the utility function itself, and if we're worried that humans may get that utility function wrong, then we should also be worried that humans may get the part of the utility function that modifies the utility function wrong.

Comment author: TheAncientGeek 15 May 2015 07:15:47PM *  -2 points [-]

I suggest that it's a straw man to claim that anyone has argued 'the superintelligence wouldn't understand what you wanted it to do, if you didn't program it to fully understand that at the outset'. Do you have evidence that this is a position held by, say, anyone at MIRI?

MIRI assumes that programming what you want an AI to do at the outset , Big Design Up Front, is a desirable feature for some reason.

The most common argument is that it is a necessary prerequisite for provable correctness, which is a desirable safety feature. OTOH, the exact opposite of massive hardcoding, goal flexibility is ielf a necessary prerequisite for corrigibility, which is itself a desirable safety feature.

The latter point has not been argued against adequately, IMO.