I expect that the main problem with Goodhart's law is that if you strive for an indicator to accurately reflect the state of the world, once the indicator becomes decoupled from the state of the world, it stops reflecting the changes in the world. This is how I interpret the term 'good,' which I dislike. People want a thermometer to accurately reflect the patterns they called temperature to better predict the future — if the thermometer doesn't reflect the temperature, future predictions suffer.
A problem I have with this reinterpretation is that "state of the world" is too broad. In looking at a thermometer, I am not trying to understand the entire world-state (and the thermometer also couldn't be decoupled from the entire world-state, since it is a part of the world).
A more accurate way to remove "good" would be as follows:
In everyday life, if a human is asked to make a (common, everyday) judgement based on appearances, then the judgement is probably accurate. But if we start optimizing really hard based on their judgement, Goodhart's Law kicks in.
Ah, yeah, sorry. I do think about this distinction more than I think about the actual model-based vs model-free distinction as defined in ML. Are there alternative terms you'd use if you wanted to point out this distinction? Maybe policy-gradient vs ... not policy-gradient?
In machine-learning terms, this is the difference between model-free learning (reputation based on success/failure record alone) and model-based learning (reputation can be gained for worthy failed attempts, or lost for foolish lucky wins).
There's unpublished work about a slightly weaker logical induction criterion which doesn't have this property (there exist constant-distribution inductors in this weaker sense), but which is provably equivalent to the regular LIC whenever the inductor is computable.[1] To my eye, the weaker criterion is more natural. The basic idea is that this weird trader shouldn't count as raking in the cash. The regular LIC (we can call it "strong LIC" or SLIC) counts traders as exploiting the market if there is a sequence of worlds in which their wealth grows unboundedly. This allows for the trick you quote: buying up larger and larger piles of sentences in diminishingly-probable worlds counts as exploiting the market.
The weak LIC (WLIC) says instead that traders have to actually make the money in order to count as exploiting the market.
Thus the limit of a logical inductor can count as a (weak) logical inductor, just not a computable one.
Roughly speaking. This is not quite an adequate description of the theorem.
Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.
One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into.
I agree that:
I once had the experience of a roommate asking if I believe in love. I said yes, absolutely, explaining that it is a real human emotion (I said something about chemicals in the brain). He responded: it sounds like you don't believe in love!
(I think this has to do with ambiguity about believe-in as much as about love, to be fair.)
So, a question: is 'love' worth saving as a concept? The way some people use the word might be pretty terrible, but, should we try to rescue it? Is there a good way to use it which recovers many aspects of the common use, while restoring coherence?
I do personally feel that there is some emotional core to love, so I'm sympathetic to the "it's a specific emotion" definition. This accounts with people not being able to give a specific definition. Emotions are feelings, so they're tough to define. They just feel a specific way. You can try to give cognitive-behavioral definitions, and that's pretty useful; but, for example, you could show behavioral signs of being afraid without actually experiencing fear.
I'm also sympathetic to the view that "love" is a kind of ritual, with saying "I love you" being the centerpiece of the ritual, and a cluster of loving behaviors being the rest of it. "I love you" can then be interpreted as a desire or intention or commitment to participate in the rest of it.
I would strongly guess that many people could physically locate the cringe pain, particularly if asked when they're experiencing it.
The point of Goodhart's Law is that you can only select for what you can measure. The burger is a good analogy because Instagram can't measure taste or nutrition, so when Instagram is what optimizes burgers, you get burgers with a very appealing appearance but non-optimized taste and nutrition. If you have the ability to measure taste, then you can create good taste, but you run into subtler examples of Goodhart (EG, Starbucks coffee is optimized to taste good to their professional tasters, which is slightly different from tasting good to a general audience).
Just specifying the variable you're interested in doesn't solve this problem; you also have to figure out how to measure it. The problem is that measurements are usually at least slightly statistically distinct from the actual target variable, so that the statistical connection can fall apart under optimization.
I also take issue with describing optimizing the appearance of the burger as "narrower" than optimizing the burger quality. In general it is a different task, which may be narrower or broader.