Goodhart's law seems to suggest that errors in utility or reward function specification are necessarily bad in sense that an optimal policy for the incorrect reward function would result in low return according to the true reward. But how strong is this effect?
Suppose the reward function were only slightly wrong. Can the resulting policy be arbitrarily bad according to the true reward or is it only slightly worse? It turns out the answer is "only slightly worse" (for the appropriate definition of "slightly wrong").
Definitions
Consider a Markov Decision Process (MDP) M=(S,A,T,R∗) where
- S is the set of states,
- A is the set of actions,
- T:S×A×S→R are the conditional transition probabilities, and
... (read 1345 more words →)