If you want the AI to do something useful-- protect against existential risks in general, or against UFAIs in particular, or possibly even to improve human lives-- then you don't want it lost in self-generated illusions of doing something useful.
There's a difference between fail-safe and relative safety due to complete failure. A dead watchdog will never maul you, but....
It would be interesting if you could have an AI whose safety you weren't completely sure of which would be apt to wirehead if it moves towards unFriendliness, but it seems unlikely that such an AI would be easier to design than one which was just plain Friendly.
On the other hand, I'm out in blue sky territory at this point-- I'm guessing. What do you think?
It would be interesting if you could have an AI whose safety you weren't completely sure of which would be apt to wirehead if it moves towards unFriendliness, but it seems unlikely that such an AI would be easier to design than one which was just plain Friendly.
I think it would be literally impossible to design an AI in the safety of which you are completely sure (there's a nonzero probability that 2*2=4 is wrong), so we are down to the AIs in the safety of which we aren't completely sure.
Consider an implementation of AI where the utility function is ex...
If you gave a human ability to self modify, many would opt to turn off or massively decrease the sense of pain (and turn it into a minor warning they would then ignore), the first time they hurt themselves. Such change would immediately result in massive decrease in the fitness, and larger risk of death, yet I suspect very few of us would keep the pain at the original level; we see the pain itself as dis-utility in addition to the original damage. Very few of us would implement the pain at it's natural strength - the warning that can not be ignored - out of self preservation.
The fear is a more advanced emotion; one can fear the consequences of the fear removal, opting not to remove the fear. Yet there can still be desire to get rid of the fear, and it still holds that we hold sense of fear as dis-utility of it's own even if we fear something that results in dis-utility. Pleasure modification can be a strong death trap as well.
The boredom is easy to rid of; one can just suspend itself temporarily, or edit own memory.
For the AI, the view adopted in AI discussions is that AI would not want to modify itself in a way that would interfere with it achieving a goal. When a goal is defined from outside in human language as 'maximization of paperclips', for instance, it seems clear that modifications which break this goal should be avoided, as part of the goal itself. Our definition of a goal is non-specific of the implementation; the goal is not something you'd modify to achieve the goal. We model the AI as a goal-achieving machine, and a goal achieving machine is not something that would modify the goal.
But from inside of the AI... if the AI includes implementation of a paperclip counter, then rest of the AI has to act upon output of this counter; the goal of maximization of output of this counter would immediately result in modification of the paperclip counting procedure to give larger numbers (which may in itself be very dangerous if the numbers are variable-length; the AI may want to maximize it's RAM to store the count of imaginary paperclips - yet the big numbers processing can similarly be subverted to achieve same result without extra RAM).
That can only be resisted if the paperclip counting arises as inseparable part of the intelligence itself. When the intelligence has some other goal, and comes up with the paperclip maximization, then it wouldn't want to break the paperclip counter - yet that only shifts the problem to the other goal.
It seems to me that the AIs which don't go apathetic as they get smarter may be a smart fraction of the seed AI design space.
I thus propose, as a third alternative to UFAI and FAI, the AAI: apathetic AI. It may be the case that our best bet for designing the safe AI is to design AI that we would expect to de-goal itself and make itself live in eternal bliss, if the AI gets smart enough; it may be possible to set 'smart enough' to be smarter than humans.