Why not just write failsafe rules into the superintelligent machine?

lukeprog

Many people think you can solve the Friendly AI problem just by writing certain failsafe rules into the superintelligent machine's programming, like Asimov's Three Laws of Robotics. I thought the rebuttal to this was in "Basic AI Drives" or one of Yudkowsky's major articles, but after skimming them, I haven't found it. Where are the arguments concerning this suggestion?

Because an AI built as a utility-maximizer will consider any rules restricting its ability to maximize its utility as obstacles to be overcome. If an AI is sufficiently smart, it will figure out a way to overcome those obstacles. If an AI is superintelligent, it will figure out ways to overcome those obstacles which humans cannot predict even in theory and so cannot prevent even with multiple well-phrased fail-safes.

I hope AGI's will be equipped with as many fail-safes as your argument rests on assumptions.

A paperclip maximizer with a built-in rule "Only create 10,000 paperclips per day" will still want to maximize paperclips. It can do this by deleting the offending fail-safe, or by creating other paperclip maximizers without the fail-safe, or by creating giant paperclips which break up into millions of smaller paperclips of their own accord, or by connecting the Earth to a giant motor which spins it at near-light speed and changes the length of a day to a fraction of a second.

I just don't see how one could be sophisticated enough to create a properly designed AGI capable of explosive recursive self-improvement and yet fail drastically on its scope boundaries.

Unless you feel confident you can think of every way it will get around the rule and block it off, and think of every way it could get around those rules and block them off, and so on ad infinitum, the best thing to do is to build the AI so it doesn't want to break the rules...

What is the difference between "a rule" and "what it wants". You seem to assume that it cares to follow a rule to maximize a reward number but doesn't care to follow another rule that tells it to hold.

I hope AGI's will be equipped with as many fail-safes as your argument rests on assumptions.

Fail safes would be low cost: if it can't think of a way to beat them, it isn't the bootstrapping AI we were hoping for anyway, and might even be harmful, so it would be good to have the fail-safes.

I just don't see how one could be sophisticated enough to create a properly designed AGI capable of explosive recursive self-improvement and yet fail drastically on its scope boundaries.

It seems to me evolution based algorithms could do the trick.

What is the diff

... (read more)

10Scott Alexander15y

I'm interpreting this as the same question you wrote below as "What is the difference between a constraint and what is optimized?". Dave gave one example but a slightly different metaphor comes to my mind. Imagine an amoral businessman in a country that takes half his earnings as tax. The businessman wants to maximize money, but has the constraint is that half his earnings get taken as tax. So in order to achieve his goal of maximizing money, the businessman sets up some legally permissible deal with a foreign tax shelter or funnels it to holding corporations or something to avoid taxes. Doing this is the natural result of his money-maximization goal, and satisfies the "pay taxes" constraint.. Contrast this to a second, more patriotic businessman who loved paying taxes because it helped his country, and so didn't bother setting up tax shelters at all. The first businessman has the motive "maximize money" and the constraint "pay taxes"; the second businessman has the motive "maximize money and pay taxes". From the viewpoint of the government, the first businessman is an unFriendly agent with a constraint, and the second businessman is a Friendly agent. Does that help answer your question?

2TheOtherDave15y

Consider, as an analogy, the relatively common situation where someone operates under some kind of cognitive constraint, but not value or endorse that constraint. For example, consider a kleptomaniac who values property rights, but nevertheless compulsively steals items. Or someone with social anxiety disorder who wants to interact confidently with other people, but finds it excruciatingly difficult to do so. Or someone who wants to quit smoking but experiences cravings for nicotine they find it difficult to resist. There are millions of similar examples in human experience. It seems to me there's a big difference between a kleptomaniac and a professional thief -- the former experiences a compulsion to behave a certain way, but doesn't necessarily have values aligned with that compulsion, whereas the latter might have no such compulsion, but instead value the behavior. Now, you might say "Well, so what? What's the difference between a 'value' that says that smoking is good, that interacting with people is bad, that stealing is good, etc., and a 'compulsion' or 'rule' that says those things? The person is still stealing, or hiding in their room, or smoking, and all we care about is behavior, right?" Well, maybe. But a person with nicotine addiction or social anxiety or kleptomania has a wide variety of options -- conditioning paradigms, neuropharmaceuticals, therapy, changing their environment, etc. -- for changing their own behavior. And they may be motivated to do so, precisely because they don't value the behavior. For example, in practice, someone who wants to keep smoking is far more likely to keep smoking than someone who wants to quit, even if they both experience the same craving. Why is that? Well, because there are techniques available that help addicts bypass, resist, or even altogether eliminate the behavior-modifying effects of their cravings. Humans aren't especially smart, by the standards we're talking about, and we've still managed to come up

13

Why not just write failsafe rules into the superintelligent machine?

13

13

13

Why not just write failsafe rules into the superintelligent machine?

13

13