Does a paperclip-maximizing AI care about the actual number of paperclips being made, or does it just care about its perception of paperclips?
If the latter, I feel like this contradicts some of the AI doom stories: each AI shouldn’t care about what future AIs do (and thus there is no incentive to fake alignment for the benefit of future AIs), and the AIs also shouldn’t care much about being shut down (the AI is optimizing for its own perception; when it’s shut off, there is nothing to optimize for).
If the former, I think this makes alignment much easier. As long as you can reasonably represent “do not kill everyone”, you can make this a goal of the AI, and then it will literally care about not killing everyone, it won’t just care about hacking its reward system so that it will not perceive everyone being dead.
Both are possible. For theoretical examples, see the stamp collector for consequentialist AI and AIXI for reward-maximizing AI.
What kind of AI are the AIs we have now? Neither, they're not particularly strong maximizers. (if they were, we'd be dead; it's not that difficult to turn a powerful reward maximizer into a world-ending AI).
This would be true, except:
I think this goes to Matthew Barnett’s recent article of actually yes we do. And regardless I don’t think this point is a big part of Eliezer’s argument. https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument
Yeah so I think this is the crux of it. My point is that if we find some training approach that leads to a model that cares about the ... (read more)