For example, an RL agent that learns a policy that looks good to humans but isn't. Adversarial examples that only fool a neural nets wouldn't count.
Could you clarify this a bit? I assume you are thinking about subsets of specification gaming that would not be obvious if they were happening?
If so, then I guess all the adversarial examples in image classification comes to mind, which fits specification gaming pretty well and required quite a large literature to understand.
For example, an RL agent that learns a policy that looks good to humans but isn't. Adversarial examples that only fool a neural nets wouldn't count.