1 min read

1

This is a special post for quick takes by satchlj. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
11 comments, sorted by Click to highlight new comments since:

Is Goodhart's Curse Not Really That Bad?

EDIT: It's bad. Still, it's good to understand exactly when it's bad.

I'm not implying I'm on to anything others haven't thought of by posting this - I'm asking this so people can tell me if I'm wrong.

Goodhart's Curse is often cited to claim that if a superintelligent AI has a utility function which is a noisy approximation of the intended utility function, the expected proxy error will blow up given a large search space for the optimal policy.

But, assuming Gaussian or sub-Gaussian error, the expected regret is actually something like  where  is the size of the raw search space. Even if search space grows exponentially with intelligence, expected error isn't really blowing up. If smarter agents make more accurate proxies, then error might very plausibly decrease as intelligence grows.

I understand that there are a lot of big assumptions here which might not hold in practice, but this still seems to suggest there are a lot of worlds where Goodhart's Curse doesn't bite that hard.

If this is too compressed to be legible, please let me know and I will make it a full post.

This is correct, indeed there's a proof that so long as your errors are Gaussian or Sub-Gaussian distributions, no matter what the distribution of valuable things are, Goodhart errors do not blow up the proxy.

Similarly, there's a proof that so long as the tails of valuable things are heavier than the tails of errors, Goodhart's curse also cannot occur.

The key caveat is that it does assume independence, and thus only protects against Regressional goodhart, and in particular definitely requires unrealistic conditions for this theorem to work:

https://www.lesswrong.com/posts/fuSaKr6t6Zuh6GKaQ/when-is-goodhart-catastrophic

https://www.lesswrong.com/posts/GdkixRevWpEanYgou/catastrophic-regressional-goodhart-appendix

In more realistic settings, the most likely way to prevent Goodhart will be to either make reward functions bounded, or to use stuff like quantilizers.

Previous discussion, comment by johnswentworth:

Relevant slogan: Goodheart is about generalization, not approximation.

[...]

In all the standard real-world examples of Goodheart, the real problem is that the proxy is not even approximately correct once we move out of a certain regime.

When you calculate an expected regret here, are you calculating something like a myopic single-step expected regret? which ignores that utility in states is definitely not gaussian (when you are standing up, an error in actuating your joints that leads you to fall over will then yield a reward many standard deviations away from a Gaussian you fit to the rewards you experienced in the thousands of seconds before where you were uneventfully standing up in place), and that it will compound over t.

I'm not sure I understand what you're getting at.

If  is the target utility function and  is the proxy, and  is  then we calculated expected value of  assuming  is normally distributed over .

See this other comment.

It seems people have already explored this in depth. I like John Wentworth's slogan that Goodheart is about generalization, not approximation.

My point is that it seems like the Gaussian assumption is obviously wrong given any actual example of a real task, like standing up without falling and breaking a hip or hitting our heads & dying (both of which are quite common in the elderly - eg my grandfather and my grandmother, respectively). And that the analysis is obviously wrong given any actual example of a real environment more complicated than a bandit. (And I think this is part of what Wentworth is getting at when he says it's about when you "move out of a regime". The fact that the error inside the 'regime' is, if you squint, maybe not so bad in some way, doesn't help much when the regime is ultra-narrow and you or I could, ahem, fall out of the regime within a second of actions.) So my reaction is that if that is the expected regret in this scenario, which seems to be just about the best possible scenario, with the tamest errors, and the least RL aspects like having multiple steps, that you are showing that Goodhart's Curse really is that bad, and I'm confused why you seem to think it's great news.

Likely it's not great news, like I said in the original post I am not sure how to interpret what I had noticed. But right now, after the reflection that's come from people's comments, I don't think it's bad news, and it might (maybe) even be slightly good news.[1]

Could you explain to me where the single step / multiple steps aspect comes in? I don't see an assumption of only a single step anywhere, but maybe this comes from a lack of understanding.

  1. ^

    Instead of a world where we need our proxy to be close to 100% error-free (this seems totally unrealistic), we just need the error to have ~ no tails (this might also be totally unrealistic but is a weaker requirement).

[-]gwern*40

Could you explain to me where the single step / multiple steps aspect comes in? I don't see an assumption of only a single step anywhere, but maybe this comes from a lack of understanding.

Maybe you could explain why you think it covers multiple-steps? Like take my falling example. Falling is the outcome of many successive decisions taken by a humanoid body over a few hundred milliseconds. Each decision builds on the previous one, and is constrained by it: you start in a good position and you make a poor decision about some of your joints (like pivoting a little too quickly), then you are in a less-good position and you don't make a good enough decision to get yourself out of trouble, then you are in an even less good position, and a bunch of decisions later, you are laying on the ground suddenly dying when literally a second ago you were perfectly healthy and might have lived decades more. This is why most regret bounds include a T term in them, which covers the sequential decision-making aspect of RL and how errors can compound: a small deviation from the optimal policy at the start can snowball into arbitrarily large regrets over sufficient T.

[-]plex30

Why assume assuming Gaussian or sub-Gaussian error? I'd naively expect the error to find weird edge cases which end up being pretty far from the intended utility function, growing as the intelligence can explore more of the space?

(worthwhile area to be considering, tho)

Thanks, yeah Gaussian error is a strong assumption and usually not the default. But it's intuitively a much more realistic target than ~no error, and we want to understand how much error we can tolerate.

If you haven't already, I'd recommend reading Vinge's 1993 essay on 'The Coming Technological Singularity': https://accelerating.org/articles/comingtechsingularity

He is remarkably prescient, to the point that I wonder if any really new insights into the broad problem have been made in the last 32 years since he wrote. He discusses, among other things, using humans as a base to build superintelligence on as an possible alignment strategy, as well as the problems with this approach.

Here's one quote:

Eric Drexler [...] agrees that superhuman intelligences will be available in the near future — and that such entities pose a threat to the human status quo. But Drexler argues that we can confine such transhuman devices so that their results can be examined and used safely. This is I. J. Good's ultraintelligent machine, with a dose of caution. I argue that confinement is intrinsically impractical. For the case of physical confinement: Imagine yourself locked in your home with only limited data access to the outside, to your masters. If those masters thought at a rate — say — one million times slower than you, there is little doubt that over a period of years (your time) you could come up with "helpful advice" that would incidentally set you free. [...] 

[+][comment deleted]10
Curated and popular this week