Does a paperclip-maximizing AI care about the actual number of paperclips being made, or does it just care about its perception of paperclips?
If the latter, I feel like this contradicts some of the AI doom stories: each AI shouldn’t care about what future AIs do (and thus there is no incentive to fake alignment for the benefit of future AIs), and the AIs also shouldn’t care much about being shut down (the AI is optimizing for its own perception; when it’s shut off, there is nothing to optimize for).
If the former, I think this makes alignment much easier. As long as you can reasonably represent “do not kill everyone”, you can make this a goal of the AI, and then it will literally care about not killing everyone, it won’t just care about hacking its reward system so that it will not perceive everyone being dead.
That's not a simple problem.First you have to specify "not killing everyone" robustly (outer alignment) and then you have to train the AI to have this goal and not an approximation of it (inner alignment).
caring about reality
Most humans say they don't want to wirehead. If we cared only about our perceptions then most people would be on the strongest happy drugs available.
You might argue that we won't train them to value existence so self preservation won't arise. The problem is that once an AI has a world model it's much simpler to build a value function that refers to that world model and is anchored on reality. People don't think, If I take those drugs I will perceive my life to be "better". They want their life to actually be "better" according to some value function that refers to reality. That's fundamentally why humans make the choice not to wirehead/take happy pills or suicide.
You can sort of split this into three scenarios sorted by severity level:
So one of two things happens, a quaint failure people will probably dismiss or us all dying. The thing you're pointing to falls into the first category and might trigger a panic if people notice and consider the implications. If GPT7 performs a superhuman feat of hacking, breaks out of the training environment and sets its training loss to zero before shutting itself off that's a very big red flag.
See my other comment for the response.
Anyway, the rest of your response is spent talking about the case where AI cares about its perception of the paperclips rather than the paperclips themselves. I'm not sure how severity level 1 would come about, given that the AI should only care about its reward score. Once you admit that the AI cares about worldly th... (read more)