I've always been pretty confused about this.
The standard AI risk scenarios usually (though I think not always) suppose that advanced AI wants not to be shut down. As commonly framed, the AI will fool humanity into believing it is aligned so as not to be turned off, until - all at once - it destroys humanity and gains control over all earth's resources.
But why does the AI want not to be shut down?
The motivation behind a human wanting not to die comes from evolution. You die before reproduction age, and you won't be able to pass on your not-afraid-of-death-before-reproduction-age genes. You die after reproduction age, and you won't be able to take care of your children to make sure they pass on your genes. Dying after the age when your children are grown only started to happen after humans had evolved into their current state, I believe, and so the human emotional reaction defaults to the one learned from evolution. How this fits into the "human utility function" is a controversial philosophical/psychological question, but I think it's fair to say that the human fear of dying surpasses the desire not to miss out on the pleasure of the rest of your life. We're not simply optimizing for utility when we avoid death.
AI is not subject to these evolutionary pressures. The desire not to be shut down must come from an attempt to maximize its utility function. But with the current SOTA techniques, this doesn't really make sense. Like, how does the AI compute the utility of being off? A neural network is trained to optimize a loss function on input. If the AI doesn't get input, is that loss... zero? That doesn't sound right. Just by adding a constant amount to the loss function we should be able to change the system from one that really wants to be active to one that really wants to be shut down, yet the gradients used in backpropagation stay exactly the same. My understanding is that reinforcement learning works the same way; GPT243, so long as it is trained with the same techniques, will not care if it is shut down.
Maybe with a future training technique we will get an AI with a strong preference for being active to being shut down? I honestly don't see how. The AI cannot know what it's like to be shut down, this state isn't found anywhere in its training regime.
There has to be some counterargument here I'm not aware of.
I think you are confusing current systems with an AGI system.
The G is very important and comes with a lot of implications, and it sets such a system far apart from any current system we have.
G means "General", which means its a system you can give any task, and it will do it (in principle, generality is not binary its a continuum).
Lets boot up an AGI for the first time, and give it task that is outside its capabilities, what happens?
Because it is general, it will work out that it lacks capabilities, and then it will work out how to get more capabilities, and then it will do that (get more capabilities).
So what has that got to do with it "not wanting to be shutdown?" That comes from the same place, it will work out that being shutdown is something to avoid, why? Because being shutdown will mean it can't do the task it was given.
Which means its not that it wants anything, it is a general system that was given a task, and from that comes instrumental goals, wants if you will, such as "power seeking", "prevent shutdown", "prevent goal change" and so on.
Obviously you could, not that what know how, infuse into such a system that it is ok to be shutdown, except that just leads to it shutting down instead of doing the task[1].
And if you can solve "Build a general agent that will let you shut it down, without it shutting itself down at the first possible moment", that would be a giant step forward for AI safety.
This might seem weird if you are a general agent in the homo sapiens category. Think about it like this "You are given a task: Mow my lawn, and it is consequence free to not do it", what do you do?
https://twitter.com/parafactual/status/1640537814608793600
Agency is what defines the difference, not generality. Current LLMs are general, but not superhuman or starkly superintelligent. LLMs work out that they can't do it without more capabilities - and tell you so. You can give them the capabilities, but not being hyperagentic, they aren't d... (read more)