Tendencies in reflective equilibrium

Scott Alexander

Consider a case, not too different from what has been shown to happen in reality, where we ask Bob what sounds like a fair punishment for a homeless man who steals $1,000, and he answers ten years. Suppose we wait until Bob has forgotten that we ever asked the first question, and then ask him what sounds like a fair punishment for a hedge fund manager who steals $1,000,000, and he says five years. Maybe we even wait until he forgets the whole affair, and then ask him the same questions again with the same answers, confirming that these are stable preferences.

If we now confront Bob with both numbers together, informing him that he supported a ten year sentence for stealing $1,000 and a five year sentence for stealing $1,000,000, a couple of things might happen. He could say "Yeah, I genuinely believe poor people deserve greater penalties than rich people." But more likely he says "Oh, I guess I was prejudiced." Then if we ask him the same question again, he comes up with two numbers that follow the expected mathematical relationship and punish the greater theft with more jail time.

Bob isn't working off of some predefined algorithm for determining punishment, like "jail time = (10 * amount stolen)/net worth". I don't know if anyone knows exactly what Bob is doing, but at a stab, he's seeing how many unpleasant feelings get generated by imagining the crime, then proposing a jail sentence that activates about an equal amount of unpleasant feelings. If the thought of a homeless man makes images of crime more readily available and so increases the unpleasant feelings, things won't go well for the homeless man. If you're really hungry, that probably won't help either.

So just like nothing automatically synchronizes the intention to study a foreign language and the behavior of studying it, so nothing automatically synchronizes thoughts about punishing the theft of $1000 and punishing the theft of $1000000.

Of course, there is something that non-automatically does it. After all, in order to elicit this strange behavior from Bob, we had to wait until he forgot about the first answer. Otherwise, he would have noticed and quickly adjusted his answers to make sense.

We probably could represent Bob's tendencies as an equation and call it a preference. Maybe it would be a long equation with terms for net worth of criminal, amount stolen, how much food Bob's eaten in the past six hours, and whether his local sports team won the pennant recently, with appropriate coefficients and powers for each. But if Bob saw this equation, he certainly wouldn't endorse it. He'd probably be horrified. It's also unstable: if given a choice, he would undergo brain surgery to remove this equation, thus preventing it from being satisfied. This is why I am reluctant to call these potential formalizations of these equations a "preference".

Instead of saying that Bob has one preference determining his jail time assignments, it would be better to model him as having several tendencies - a tendency to give a certain answer in the $1000 case, a tendency to give a different answer in the $1000000 case, and several tendencies towards things like consistency, fairness, compassion, et cetera.

People strongly consciously endorse these latter tendencies, probably because they're socially useful¹. If the Chief of Police says "I know I just put this guy in jail for theft, but I'm going to let this other thief off because he's my friend, and I don't really value consistency that much," then they're not going to stay Chief of Police for very long.

Bayesians and rationalists, in particular, make a big deal out of consistency. One common parable on the importance of consistency is the Dutch Book - a way to get free money from anyone behaving inconsistently. Suppose you have a weighted coin which can land on either heads or tails. There are several good reasons why I should not assign a probability of 66% to heads and 66% to tails, but one of the clearest is this: you can make me a bet that I will give you $2 if it lands on tails and you give me $1 if it lands on heads, and then a second bet where I give you $2 if it lands on heads and you give me $1 if it lands on tails. Whichever way the coin lands, I owe you $1 and you owe me $2 - I have gained a free dollar. So consistency is good if you don't want to be handing dollars out to random people...

...except that the Dutch book itself assumes consistency. If I believe that there is a 66% chance of it landing on heads, but refuse to take a bet at 2:1 odds - or even at 1.5:1 odds even though I should think it's easy money! - then I can't be Dutch booked. I am literally too stupid to be tricked effectively. You would think this wouldn't happen too often, since people would need to construct an accurate mental model to know when they should refuse such a bet, and such an accurate model would tell them they should revise their probabilities - but time after time people have demonstrated the ability to do exactly that.

I have not yet accepted that consistency is always the best course in every situation. For example, in Pascal's Mugging, a random person threatens to take away a zillion units of utility if you don't pay them $5. The probability they can make good on their threat is miniscule, but by multiplying out by the size of the threat, it still ought to motivate you to give the money. Some belief has to give - the belief that multiplication works, the belief that I shouldn't pay the money, or the belief that I should be consistent all the time - and right now, consistency seems like the weakest link in the chain.

The best we can do is seek reflective equilibrium among our tendencies. If you endorse the belief that rich people should not get lighter sentences than poor people more strongly than you endorse the tendency to give the homeless man ten years in jail and the fund manager five, then you can edit the latter tendency and come up with a "fair" sentence. This is Eliezer's defense of reason and philosophy, a powerful justification for morality (see part one here) and it's probably the best we can do in justifying our motivations as well.

Any tendency that has reached reflective equilibrium in your current state is about as close to a preference as you're going to get. It still won't automatically motivate you, of course. But you can motivate yourself toward it obliquely, and come up with the course of action that you most thoroughly endorse.

FOOTNOTES:

1: A tendency toward consistency can cause trouble if someone gains advantage from both of two mutually inconsistent ideas. Trivers' hypothesis predicts that people will consciously deny the inconsistency so they can continue holding both ideas, yet still remain consistent and so socially acceptable. Rationalists are so annoying because we go around telling people they can't do that.

FOOTNOTES:

To motivate you not to give up money, a threat to inflict $RIDICULOUSNUMBER units of disutility has to be proportionately incredible...there's no particular reason to think that disutility is even roughly linear in 1/credibility

It's not at all obvious that someone threatening to inflict disutility if I don't comply with certain demands would treat me worse if I don't comply with the demands than if I do.

One can't simply say "It is rational to one box on Newcomb's problem", because one might live in a universe in which an entity, say Sampi (if not Omega itself) executes one boxers painfully and rewards two-boxers.

The possibility that someone will inflict $RIDICULOUSNUMBER units of disutility on me is as latent in the question "give me money or I will inflict $RIDICULOUSNUMBER units of disutility on you" as it is in the question "paper or plastic", and not because it's possible bag choice will have a significant impact on my life. If I can't distinguish the credibility of the threat (that the speaker can and will act as they say) from zero, then I can't distinguish it from the opposite outcome, that they will act opposite of as they say, as I cant distinguish the possibility of the opposite outcome from zero.

On a personal note, the night before last, I had a wild dream (no laws of physics were violated, not so much laws of congress) that ended similar to how the movie "The Game" starring Michael Douglas ended. Well, it actually ended with me waking up - which is even more to the point. Thins oughtn't be simply accepted at face value.

Today I misread the following no fewer than two times, I think three times though I cant swear to that:

Marc Hauser, the primatologist psychologist at Harvard who recently was accused of mistreating evidence and graduate students, has resigned. I am in two minds about this. His work, although I am unconvinced by some of it, was very important, and he was good at communicating to the lay reader (including philosophers). I met him and was impressed by his demeanour and generosity. On the other hand, if he did deliberately misinterpret his data, that is an offence. Whether it is a hanging offence is moot.

I read the first sentence as: "Marc Hauser, the primatologist psychologist at Harvard who recently accused me of mistreating evidence and graduate students, has resigned." That made less and less sense as the post went on, so I took it from the top several time until I finally caught my error.

It's far more likely that I am misunderstanding someone threatening $RIDICULOUSNUMBER than that they can carry out their threat, and also more likely that I'll misspeak and say "yes" when I mean no, and say "no" when I mean yes, or mistakenly hand over a one dollar bill, etc. than that they can and will carry it out. Simultaneously, I'm not paralyzed by offending people despite King Incognito being not just a possibility but a trope. I think anyone who jumps on the "it's a compelling argument" horn of the $RIDICULOUSNUMBER argument has to give an account of how they disagree with anyone, ever, given the possible ramifications if the other person is Agent Smith, Haran al-Rashid, Peter the Great, Yoda, etc. I could easily mistake a threat of unimaginable torture for everyday speech or mild disagreement.

The obvious answer is that agreeing has indistinguishable ramifications (many dislike the teachers pet, which is another trope in itself)...in which case I would like to know why that same reasoning isn't applicable when a random person actually claims such power. It is no more likely that someone claiming such power has it than that someone not claiming such power has it, likewise for its use.

If you disagree, upvote this or I'll give you $RIDICULOUSNUMBER units of disutility! I'm kidding, of course. (Or would no number of disclaimers be sufficient? Shall you believe that having expressed this claim, it is more likely than not I abide by it, and that saying I kid was a half-plausible way to deny making threats? Or shall you believe that having disclaimed it, I would be displeased by acts in accordance with it? Both are plausible for humans, but further muddling things, if I have such power, how would my intentions likely differ from a normal human's?)

It's not at all obvious that someone threatening to inflict disutility if I don't comply with certain demands would treat me worse if I don't comply with the demands than if I do.

In support of this point, I'd like to point out that the ridiculous powers required to inflict $RIDICULOUSNUMBER sanctions are so far removed from our experience, that we have no idea how such an agent could be expected to act. It could do the opposite of what it claims (perhaps it hates cowards) as easily as fulfill its threats, given that we know nothing of its motives.

51

Tendencies in reflective equilibrium

51

51

51

Tendencies in reflective equilibrium

51

51