abramdemski

Sequences

Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Filtered Evidence, Filtered Arguments
CDT=EDT?
Embedded Agency
Hufflepuff Cynicism

Comments

Sorted by

If you indeed were solving a narrower task — that is, only creating the most sense of pleasure-inducing picture with maximization of other parameters — and then looked back, puzzled as to why the hungry weren't fed by this procedure, bringing Goodhart's law into the discussion is madness; it stresses me out. The variable 'people are hungry' wasn't important for this task at all. Oh, or was it important to you? Then why didn't you specify it? You think it’s 'obvious'?

The point of Goodhart's Law is that you can only select for what you can measure. The burger is a good analogy because Instagram can't measure taste or nutrition, so when Instagram is what optimizes burgers, you get burgers with a very appealing appearance but non-optimized taste and nutrition. If you have the ability to measure taste, then you can create good taste, but you run into subtler examples of Goodhart (EG, Starbucks coffee is optimized to taste good to their professional tasters, which is slightly different from tasting good to a general audience).

Just specifying the variable you're interested in doesn't solve this problem; you also have to figure out how to measure it. The problem is that measurements are usually at least slightly statistically distinct from the actual target variable, so that the statistical connection can fall apart under optimization.

I also take issue with describing optimizing the appearance of the burger as "narrower" than optimizing the burger quality. In general it is a different task, which may be narrower or broader.

I expect that the main problem with Goodhart's law is that if you strive for an indicator to accurately reflect the state of the world, once the indicator becomes decoupled from the state of the world, it stops reflecting the changes in the world. This is how I interpret the term 'good,' which I dislike. People want a thermometer to accurately reflect the patterns they called temperature to better predict the future — if the thermometer doesn't reflect the temperature, future predictions suffer.

A problem I have with this reinterpretation is that "state of the world" is too broad. In looking at a thermometer, I am not trying to understand the entire world-state (and the thermometer also couldn't be decoupled from the entire world-state, since it is a part of the world).

A more accurate way to remove "good" would be as follows:

In everyday life, if a human is asked to make a (common, everyday) judgement based on appearances, then the judgement is probably accurate. But if we start optimizing really hard based on their judgement, Goodhart's Law kicks in.

Ah, yeah, sorry. I do think about this distinction more than I think about the actual model-based vs model-free distinction as defined in ML. Are there alternative terms you'd use if you wanted to point out this distinction? Maybe policy-gradient vs ... not policy-gradient?

In machine-learning terms, this is the difference between model-free learning (reputation based on success/failure record alone) and model-based learning (reputation can be gained for worthy failed attempts, or lost for foolish lucky wins).

There's unpublished work about a slightly weaker logical induction criterion which doesn't have this property (there exist constant-distribution inductors in this weaker sense), but which is provably equivalent to the regular LIC whenever the inductor is computable.[1] To my eye, the weaker criterion is more natural. The basic idea is that this weird trader shouldn't count as raking in the cash. The regular LIC (we can call it "strong LIC" or SLIC) counts traders as exploiting the market if there is a sequence of worlds in which their wealth grows unboundedly. This allows for the trick you quote: buying up larger and larger piles of sentences in diminishingly-probable worlds counts as exploiting the market.

The weak LIC (WLIC) says instead that traders have to actually make the money in order to count as exploiting the market.

Thus the limit of a logical inductor can count as a (weak) logical inductor, just not a computable one.

  1. ^

    Roughly speaking. This is not quite an adequate description of the theorem.

abramdemskiΩ442

Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.

One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into. 

Answer by abramdemski50

I agree that:

  • People commonly talk about love as black-and-white when it isn't.
  • It is used as a semantic stopsign or applause light.
  • Love is a cluster concept made up of many aspects, but the way people talk about love obscures this. The concept of "true love" serves to keep people in the dark by suggesting that the apparent grab-bag nature of love is just the way people talk, & there's an underlying truth to be discovered. Because experiences of love are fairly rare, people can only accumulate personal evidence about this slowly, so it remains plausible that what they've experienced is "just infatuation" or something else, and the "true love" remains to be discovered. People will fight about what love is in sideways ways, EG "you don't know what love is" can be used to undercut the other person's authority on love (when a more accurate representation of the situation might be "I didn't enjoy participating in the social dynamic you're presently calling love"). Different people will sometimes express different detailed views about what love is, but there seems to be little drive to reach an agreement about this, at least between people who aren't romantic partners. 

I once had the experience of a roommate asking if I believe in love. I said yes, absolutely, explaining that it is a real human emotion (I said something about chemicals in the brain). He responded: it sounds like you don't believe in love!

(I think this has to do with ambiguity about believe-in as much as about love, to be fair.)

So, a question: is 'love' worth saving as a concept? The way some people use the word might be pretty terrible, but, should we try to rescue it? Is there a good way to use it which recovers many aspects of the common use, while restoring coherence?

I do personally feel that there is some emotional core to love, so I'm sympathetic to the "it's a specific emotion" definition. This accounts with people not being able to give a specific definition. Emotions are feelings, so they're tough to define. They just feel a specific way. You can try to give cognitive-behavioral definitions, and that's pretty useful; but, for example, you could show behavioral signs of being afraid without actually experiencing fear.

I'm also sympathetic to the view that "love" is a kind of ritual, with saying "I love you" being the centerpiece of the ritual, and a cluster of loving behaviors being the rest of it. "I love you" can then be interpreted as a desire or intention or commitment to participate in the rest of it. 

I would strongly guess that many people could physically locate the cringe pain, particularly if asked when they're experiencing it.

Load More