JoshuaZ comments on [LINK] Wait But Why - The AI Revolution Part 2 - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (87)
You have a complicated goal system that can distinguish between short-term rewards and other goals. In the situations in question, the AI has no goal other than than the goal in question. To some extent, your stability arises precisely because you are an evolved hodgepodge of different goals in tension- if you weren't you wouldn't survive. But note that similar, essentially involuntary self-modification does on occasion happen with some humans- severe drug addiction is the most obvious example.
But the goal in question is "get the reward" and it's only by controlling the circumstances under which the reward is given that we can shape the AIs behavior. Once the AI is capable of taking control of the trigger, why would it leave it the way we've set it? Whatever we've got it set to is almost certainly not optimized to triggering the reward.
If that happens you will then have the problem of an AI which tries to wirehead itself while simultaneously trying to control its future light-cone to make sure that nothing stops it from continuing to wirehead.
That sounds bad. It doesn't seem obvious to me that reward seeking and reward optimizing are the same thing, but maybe they are. I don't know and will think about it more. Thank you for talking through this with me this far.