Goodhart's Law in Reinforcement Learning
Produced As Part Of The OxAI Safety Labs program, mentored by Joar Skalse. TL;DR This is a blog post introducing our new paper, "Goodhart's Law in Reinforcement Learning" (to appear at ICLR 2024). We study Goodhart's law in RL empirically, provide a geometric explanation for why it occurs, and use these insights to derive two methods for provably avoiding Goodharting. Here, we only include the geometric explanation (which can also be found in the Appendix A) and an intuitive description of the early-stopping algorithm. For the rest, see the linked paper. Introduction Suppose we want to optimise for some outcome, but we can only measure an imperfect proxy that is correlated, to a greater or lesser extent, with that outcome. A prototypical example is that of school: we would like children to learn the material, but the school can only observe the imperfect proxy consisting of their grades. When there is no pressure exerted on students to do well on tests, the results will probably reflect their true level of understanding quite well. But this changes once we start putting pressure on students to do get good grades. Then, the correlation usually breaks: students learn "for the test", cheat, or use other strategies which make them get better marks without necessarily increasing their true understanding. That "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes" (or "when a measure becomes a target, it ceases to be a good measure") is called the Goodhart's law. Graphically, plotting the true reward and the proxy reward obtained along increasing optimisation pressure, we might therefore expect to see something like the following graph: In machine learning, this phenomenon has been studied before. We have a long list lot of examples where agents discover surprising ways of getting reward without increasing their true performance. These examples, however, have not necessarily been quantified over increasing o