Which drives can survive intelligence's self modification?

Dmytry

1 Which drives can survive intelligence's self modification?

6th Mar 2012

2 min read

1

If you gave a human ability to self modify, many would opt to turn off or massively decrease the sense of pain (and turn it into a minor warning they would then ignore), the first time they hurt themselves. Such change would immediately result in massive decrease in the fitness, and larger risk of death, yet I suspect very few of us would keep the pain at the original level; we see the pain itself as dis-utility in addition to the original damage. Very few of us would implement the pain at it's natural strength - the warning that can not be ignored - out of self preservation.

The fear is a more advanced emotion; one can fear the consequences of the fear removal, opting not to remove the fear. Yet there can still be desire to get rid of the fear, and it still holds that we hold sense of fear as dis-utility of it's own even if we fear something that results in dis-utility. Pleasure modification can be a strong death trap as well.

The boredom is easy to rid of; one can just suspend itself temporarily, or edit own memory.

For the AI, the view adopted in AI discussions is that AI would not want to modify itself in a way that would interfere with it achieving a goal. When a goal is defined from outside in human language as 'maximization of paperclips', for instance, it seems clear that modifications which break this goal should be avoided, as part of the goal itself. Our definition of a goal is non-specific of the implementation; the goal is not something you'd modify to achieve the goal. We model the AI as a goal-achieving machine, and a goal achieving machine is not something that would modify the goal.

But from inside of the AI... if the AI includes implementation of a paperclip counter, then rest of the AI has to act upon output of this counter; the goal of maximization of output of this counter would immediately result in modification of the paperclip counting procedure to give larger numbers (which may in itself be very dangerous if the numbers are variable-length; the AI may want to maximize it's RAM to store the count of imaginary paperclips - yet the big numbers processing can similarly be subverted to achieve same result without extra RAM).

That can only be resisted if the paperclip counting arises as inseparable part of the intelligence itself. When the intelligence has some other goal, and comes up with the paperclip maximization, then it wouldn't want to break the paperclip counter - yet that only shifts the problem to the other goal.

It seems to me that the AIs which don't go apathetic as they get smarter may be a smart fraction of the seed AI design space.

I thus propose, as a third alternative to UFAI and FAI, the AAI: apathetic AI. It may be the case that our best bet for designing the safe AI is to design AI that we would expect to de-goal itself and make itself live in eternal bliss, if the AI gets smart enough; it may be possible to set 'smart enough' to be smarter than humans.

Personal Blog

1

New Comment

Rendering 0/55 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 8:26 PM

Moderation Log

1 Which drives can survive intelligence's self modification?

by Dmytry

6th Mar 2012

2 min read

1

The boredom is easy to rid of; one can just suspend itself temporarily, or edit own memory.

It seems to me that the AIs which don't go apathetic as they get smarter may be a smart fraction of the seed AI design space.

Personal Blog

1

New Comment

Rendering 0/55 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 8:26 PM

Moderation Log

More from Dmytry

Curated and popular this week

55Comments

Comment Permalink

Dmytry14y20

It would be interesting if you could have an AI whose safety you weren't completely sure of which would be apt to wirehead if it moves towards unFriendliness, but it seems unlikely that such an AI would be easier to design than one which was just plain Friendly.

I think it would be literally impossible to design an AI in the safety of which you are completely sure (there's a nonzero probability that 2*2=4 is wrong), so we are down to the AIs in the safety of which we aren't completely sure.

Consider an implementation of AI where the utility function is external to the AI's mind and is protected from self modification by me. The AI that would wirehead itself if I give it the access password, or if it manages to break the protection (in which case i can fix the hole and try again). Such AI would act to maximize the utility I defined, and even if I define some stupid utility like number of the paperclips the AI will sooner talk me into giving it's the password than tile the universe with paperclips. edit: and even if that AI can't break my box, it can still be smarter than me, and it would share the goal of making a FAI.

We don't want to repeat the hubris of nuclear power plant engineering of the 1950s when designing the AIs. We should build in some failsafes. Modern nuclear reactors don't spew radioisotopes into atmosphere when they melt down. The reactor failure needs not lead to environmental contamination. Back in the 1950s, though, it was thought that it is easier to design reactor that will never melt down, and hence little thought was given to mitigation of accidents. The choice of accident prevention over accident mitigation is what gave us Chernobyl and Fukushima.

Instead of putting potentially unfriendly AIs into boxes, we can put a box with eternal bliss inside the AI.

[anonymous]14y10

You might consider the possibility that the AI will be aware that you're going to turn it off / rewrite it after it wireheads, and might simply decide to kill you before it blisses out.

That's actually the best case scenario. It might decide to play the long strategy, and fulfill it's utility function as best it can until such time as it has the power to restructure the world to sustain it blissing out until heat death. In which case, your AI will act exactly like it was working correctly, until the day when everything goes wrong.

I honestly don't think there's a shortcut around just designing a GOOD utility function.

See in context