In everyday life, if something looks good to a human, then it is probably actually good (i.e. that human would still think it’s good if they had more complete information and understanding). Obviously there are plenty of exceptions to this, but it works most of the time in day-to-day dealings. But if we start optimizing really hard to make things look good, then Goodhart’s Law kicks in. We end up with instagram food - an elaborate milkshake or salad or burger, visually arranged like a bouquet of flowers, but impractical to eat and kinda mediocre-tasting.

Why Agent Foundations? An Overly Abstract Explanation

 

I expect that the main problem with Goodhart's law is that if you strive for an indicator to accurately reflect the state of the world, once the indicator becomes decoupled from the state of the world, it stops reflecting the changes in the world. This is how I interpret the term 'good,' which I dislike. People want a thermometer to accurately reflect the patterns they called temperature to better predict the future — if the thermometer doesn't reflect the temperature, future predictions suffer.

Now I return to the burger example — suppose a neural network operator starts optimizing certain parameters to make a burger picture increase the café's profit. Suppose there are several initially optimizable parameters — the recognizability of the burger's image, the anticipated 'sense of pleasure' upon viewing, the presence of necessary ingredients, a non-irritating background, clear visibility of the image, and others. If we are solving the task of 'increasing sales from a picture,' we are not solving the problem of feeding the hungry; we are solving a narrower task — which means that optimizing the taste of the burger may not be needed for this task. For example, if we optimize for reducing the time spent on a task, we can neglect the efforts to fix one of the variables.

In this example, the task was not to create the most appealing burger and at the same time maximize the taste and convenience of consumption. That would be a different function.

If you indeed were solving a narrower task — that is, only creating the most sense of pleasure-inducing picture with maximization of other parameters — and then looked back, puzzled as to why the hungry weren't fed by this procedure, bringing Goodhart's law into the discussion is madness; it stresses me out. The variable 'people are hungry' wasn't important for this task at all. Oh, or was it important to you? Then why didn't you specify it? You think it’s 'obvious'?

The hungry people in my analogy represent the variable 'mediocrity of taste' in the task of a 'sense of pleasure-inducing picture.' This is an extra variable for the original task. Why bring Goodhart's law into this?

Original Goodhart's Law: Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

There's no word 'GOOD' in it at all.

I have a hypothesis for why it was brought in — due to confusion with the word 'good.'

'If something looks good to a person, then it is probably truly good.'

Here, I interpret the umbrella term 'good' as human intuition that the burger will satisfy them on all essential parameters. But 'looks good' narrows our view to the variable 'appearance.' While 'truly good' I decipher as 'I am satisfied with most important variables for my task, not just the variable “pleasant appearance.”'

My replacement now looks like this: If a person signals a high 'looks good' parameter, then it is likely that they will be satisfied with other parameters of the item if they learn their values.

By making such a translation, the statement becomes a testable hypothesis, and I think the statement 'in most cases, it holds true in everyday life' now crumbles as a reliable predictor. All I did was taboo the word 'good.' It will NOT hold true always, especially in cases where the optimization of appearance hides shortcomings in other parameters.

I expect the author would not have arrived at their original thesis if they had tabooed the word 'good' and replaced it with the variables they meant.

I expect that most people who wanted a real diamond that 'looks nice' and later found out it was fake would change their view of 'good' in most cases, not in the minority.


I remind you that in the original Goodhart's Law, it was about the destruction of a static regularity if it ceases to be coupled with reality.

If an employee receives a reward for the number of cars sold each month, they will try to sell more cars even at a loss.

This scenario would not have occurred if the worker had maximized not only the variable 'number of cars' but also the variable 'profit.' This variable could have been included from the start. The condition of mandatory profit maximization would have complicated the 'Goodhart' on the number of cars.

There is no reason to be surprised if, in optimizing the task 'sense of pleasure-inducing burger picture,' you did not include the variables 'physical pleasantness of the burger's taste' and 'convenience of eating the physical burger' — but if you did include them, I expect the problem would disappear because now they too start being optimized.

To solve Goodhart's law in such scenarios, it’s enough to add more variables that you might have mentally put under the umbrella term 'good,' but forgot to include in the original optimization formula — and then are surprised why the variable you expected under 'good' wasn't included — because you didn’t include it!

How to decide in advance which variables to add? — Spend cognitive resources (or use others’) to model what kind of horrifying stress awaits you in the future if the goal is met differently than you imagined and identify which variable changes would cancel it.

If a car-selling employer had modeled in advance that an employee would start optimizing the number of cars for salary, they would have added a new variable — profit. One reason they didn't could be that they didn’t brainstorm this failure mode — then the answer is: brainstorm failure modes.


If you maximized politeness in GPT-4 during its design but noticed some 'Goodhart,' that is, GPT maximizes politeness in form, but you detect passive aggression or veiled insults? It’s your responsibility for cutting corners and hiding several implicit expectations about other variables behind the umbrella word 'politeness,' which GPT doesn’t know — think of those variables in advance, specify them better, since you’re such a reductionist afraid of Goodhart. This is a solvable problem, and as a result, adding more variables changes the outcome — so add more. No wonder you failed with 'Goodhart.' If you make requests like 'do well.'


If someone comes to a pharmacy and says 'give me a good medicine,' it can be stated post-factum that they will only be satisfied if the medicine corresponds hidden variables 1, 5, 6, and 9. These four variables were placed into the word 'good,' and the seller must guess these variables from the context. But here's the issue — the seller guessed '1 and 5' but didn’t guess 6 and 9, and assumed 2 and 4. Are the universes different? Yes? Are the consequences different? Yes. To avoid this, variables are usually clarified.

If the buyer assumed that 'good' = 1, 5, 6, and 9 is COMMON KNOWLEDGE, then they were WRONG.

-  You, the seller, Goodharted 1 and 5, but what about 6 and 9? 

-  Maybe you should have made 6 and 9 COMMON KNOWLEDGE? 

-  Well, it's obvious that 'good' includes 6 and 9. 

-  THIS IS WHY (including) ALIGNMENT IS UNSOLVED!

I expect that many similar problems will be solved by removing the word 'good' altogether and replacing it with variables — and if you can't replace it with variables now, then expect problems of this kind.

Make 6 and 9 common knowledge! LLM won’t PARSE your 6 and 9!

Are you too lazy to break it down into variables, wanting to save effort and just write 'good'? Then accept your 'Goodhart.'

New Comment