I'm new here, so I apologize if this is a common question, or if I'm using the wrong terms, or the wrong framework for approaching this. I also apologize if this post is badly structured. When I talk about alignment, it feels like so many of the ideas have rapidly branching dependencies, and sometimes fold back in on themselves, so it makes it hard for me to talk about in a coherent way.
Whenever I see people discuss bad alignment, it seems to be regarding agents that either immediately perform an attack, or, in the slightly more sophisticated thought experiment, are secretly antagonistic, but outwardly aligned until a future moment where they know... (read 637 more words →)
Looking at the google scholar link in this article, it looks like what I'm describing more closely resembles "motivation hacking", except that, in my thought experiment, the agent doesn't modify its own reward system. Instead, it selects arbitrary actions and anticipates if their reward is coincidentally more satisfying than the base objective. This allows it to perform this attack even if its in the training environment.
Further, this sort of "attack" may be a component of the self-analysis an agent may do in pursuit of the base objective, so at no point does the agent need to exhibit deceptive or antagonistic behavior to pursue this vulnerability. It may be that an agent pursuing this vulnerability is fundamentally the same as an agent pursuing the base objective.