sej2020 - LessWrong

sej20202mo60

My thinking is not very clear on this point, but I am generally pessimistic that any type of RL/optimization regime with an adversarial nature could be robust to self-aware agents. To me, it seems like adversarial methodologies could spawn opposing mesaoptimizers, and we would be at the mercy of whichever subsystem represented its optimization process well enough to squash the other.

Greedy-Advantage-Aware RLHF

sej20202mo40

Martin is representing my claim well in this exchange, but I also think it's important to mention that the simple/convoluted plan continuum does not have a perfect correspondence with the sharp/flat policy continuum. For example, wireheading may be simple in abstract, but I still expect a wireheading policy to be extremely sharp. If a wireheading policy takes, let's say 5 distinct actions (WWWWW) to execute, and the agent's policy is WWWWW, then it would receive arbitrarily high reward because the agent controls the reward button. However, if the policy is similar, like WWWWX, it would receive much lower reward because a plan that is 80% wireheading would likely not score well on the reward function representing the true goal.

Greedy-Advantage-Aware RLHF

sej20202mo20

Thanks for the feedback, these suggestions are definitely helpful as I'm thinking about how/if to advance the project.

Greedy-Advantage-Aware RLHF

sej20202mo40

Thank you for the example, I think that illustrates the point well.

Could you help me to understand why you think that more a intelligent agent would be more likely to have a reward-hacking policy that isn't sharp? The intelligence of the agent should have no bearing on the geometry of the deviations between the reward function and a function representing the true objective. The intelligence of the agent might impact its progression through policies over the course of optimization, and perhaps this difference would result in access to a space of more sophisticated policies that lie in broad, flat optima in the reward function. Is this close to your thinking? I think that criticism amounts to a rejection of the heuristic/intuition that reward-hacking==sharp policy, since this topological feature of the policy space in this problem always existed, regardless of the intelligence of the agent.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments