I have just posted a very thorough edit-slash-rewrite of my Reducing Goodhart sequence. Many thanks to people who gave me feedback or chatted with me about related topics, and many apologies to the people who I told to read this sequence "but only after I finish editing it real quick."
If you're interested in the "why" and "what" questions of highly intelligent AIs learning human values, I can now recommend this sequence to you without embarrassment. And if you skimmed the old sequence but don't really remember what it was about, this is a great time to read the new and improved version.
II - Super short pitch
I worked on this because I was tired of talking nonsense about what it would mean for a value learning AI to "beat Goodhart's law." If you would also like to not talk nonsense about beating Goodhart's law and learning human values, this sequence will help.
III - Executive summary, superfluous for AF regulars
What does it mean for an AI to learn human preferences and then satisfy them? The intuitive way to approach this question is to treat "human preferences" as fixed facts that the AI is supposed to learn, but it turns out this is an unproductive way to think about the problem. Instead, it's better to treat humans as physical systems. "Human preferences" are parts of the models we build to understand ourselves.
Depending on how an AI models the world, it might infer different human preferences from the same data - you can say a reluctant addict either wants heroin or doesn't without actually disputing any raw data, just changing perspective. This makes it important that value learning AI models humans the way we want to be modeled. How we want to be modeled is itself a fact about our preferences that has to be learned by interacting with us.
A centerpiece of this sequence is Goodhart's law. Treating humans as physical systems and human preferences as emergent leads to a slightly unusual definition of Goodhart's law: "When you put pressure on the world to make it extremely good according to one interpretation of human values, this is often bad according to other interpretations."
This perspective helps us identify bad behavior that's relevant to Goodhart's law for value learning AI. We should build value learning AI that is sensitive to the broad spectrum of human values, that allows us to express our meta-preferences, and that is conservative about pushing the world off-distribution, in addition to avoiding blatant harm to humans.
If this summary sounds relevant to your interests, consider reading the whole sequence.
I - Release announcement
I have just posted a very thorough edit-slash-rewrite of my Reducing Goodhart sequence. Many thanks to people who gave me feedback or chatted with me about related topics, and many apologies to the people who I told to read this sequence "but only after I finish editing it real quick."
If you're interested in the "why" and "what" questions of highly intelligent AIs learning human values, I can now recommend this sequence to you without embarrassment. And if you skimmed the old sequence but don't really remember what it was about, this is a great time to read the new and improved version.
II - Super short pitch
I worked on this because I was tired of talking nonsense about what it would mean for a value learning AI to "beat Goodhart's law." If you would also like to not talk nonsense about beating Goodhart's law and learning human values, this sequence will help.
III - Executive summary, superfluous for AF regulars
What does it mean for an AI to learn human preferences and then satisfy them? The intuitive way to approach this question is to treat "human preferences" as fixed facts that the AI is supposed to learn, but it turns out this is an unproductive way to think about the problem. Instead, it's better to treat humans as physical systems. "Human preferences" are parts of the models we build to understand ourselves.
Depending on how an AI models the world, it might infer different human preferences from the same data - you can say a reluctant addict either wants heroin or doesn't without actually disputing any raw data, just changing perspective. This makes it important that value learning AI models humans the way we want to be modeled. How we want to be modeled is itself a fact about our preferences that has to be learned by interacting with us.
A centerpiece of this sequence is Goodhart's law. Treating humans as physical systems and human preferences as emergent leads to a slightly unusual definition of Goodhart's law: "When you put pressure on the world to make it extremely good according to one interpretation of human values, this is often bad according to other interpretations."
This perspective helps us identify bad behavior that's relevant to Goodhart's law for value learning AI. We should build value learning AI that is sensitive to the broad spectrum of human values, that allows us to express our meta-preferences, and that is conservative about pushing the world off-distribution, in addition to avoiding blatant harm to humans.
If this summary sounds relevant to your interests, consider reading the whole sequence.