Stuart_Armstrong comments on The mathematics of reduced impact: help needed - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (94)
Beware Goodhart's Law: you're setting rules of the game that the "disciple AI" has an incentive to subvert. Essentially, you're specifying a wish, and normally your ability to evaluate a wish is constrained by your ability to consider and (morally) evaluate all the possible consequences (strategies) in detail. An AI might find a strategy that, while satisfying your wish, would be disastrous (which might win the AI a prize so insignificant it'd never rise to your attention).
One consideration is the amount of information in the coarse graining measures: we could set it up so there are more measurements made than there are bits in the disciple AI's source code. Not a guarantee of anything, of course, but Goodhart's law mainly derives from how short the success indicator is compared with the phenomena it's trying to measure, so hence subverting the law is easier than improving the phenomena.