Estimates vs. head-to-head comparisons

paulfchristiano

23 Estimates vs. head-to-head comparisons

4th May 2013

7 min read

23

(Cross-posted from my blog.)

Summary: when choosing between two options, it’s not always optimal to estimate the value of each option and then pick the better one.

Suppose I am choosing between two actions, X and Y. One way to make my decision is to predict what will happen if I do X and predict what will happen if I do Y, and then pick the option which leads to the outcome that I prefer.

My predictions may be both vague and error-prone, and my value judgments might be very hard or nearly arbitrary. But it seems like I ultimately must make some predictions, and must decide how valuable the different outcomes are. So if I have to evaluate N options, I could do it by evaluating the goodness of each option, and then simply picking the option with the highest value. Right?

There are other possible procedures for evaluating which of two options is better. For example, I have often encountered advice of the form "if your error bars are too big, you should just ignore the estimate". To be most extreme, I could choose some particular axis on which options can be better or worse, and then pick the option which is best on that axis, ignoring all others. (E.g., I could choose the option which is cheapest, or the charity which is most competently administered, or whatever.)

If you have an optimistic quantitative outlook like mine, this probably looks pretty silly—if one option is cheaper, that just gets figured into my estimate for how good it is. If my error bars are big, as long as I keep track of the error bars in my calculation it is still better than nothing. So why would I ever want to do anything other than estimate the value of each option?

In fact I don’t think my intuition is quite right. To see why, let’s start with a very simple case.

A simple model

Alice and Bob are picking between two interventions X and Y. They only have a year to make their decision, so they split up: Alice will produce an estimate of the value of X and Bob will produce an estimate of the value of Y, and they will both do whichever one looks better. Let’s suppose that Alice and Bob are perfectly calibrated and trust each other completely, so that each of them believes the other’s estimate to be unbiased.

Suppose that intervention X is good because it reduces carbon emissions. First Alice dutifully estimates the reductions in emissions that result from intervention X, call that number A1. Of course Alice doesn’t care about carbon emissions per se, she cares about the improvements in human quality of life that result from decreased emissions--and she couldn’t compare her estimate with Bob’s unless she converts it into units of goodness. So she next estimates the gain in quality of life per unit of reduced emissions, call that number A2. She then reports that the value of X is A1 * A2. Because she is unbiased, as long as her estimates of A1 and A2 are independent she obtains an unbiased estimate of the value of X.

Meanwhile, it happens to be the case that intervention Y is also good because it reduces carbon emissions. So Bob similarly estimates the reduction in carbon emissions from intervention Y, B1, and then the goodness of reduced emissions, B2, and reports B1 * B2. His estimate is also an unbiased estimate of the value of Y.

The pair decides to do intervention X iff it appears to have a higher value than Y, i.e. iff A1 * A2 > B1 * B2. This is not crazy but it’s also not a very good idea. It is easy to see that intervention X is better than intervention Y iff A1 > B1. But if estimates A2 and B2 are relatively noisy—especially if the noise in those estimates is larger than the actual gap between A1 and B1—then Alice and Bob will make an unnecessarily random decision.

What went wrong? Alice and Bob aren’t making a systematically bad decision, but they could have made a better decision by using a different technique for comparison. I think that a similar situation arises very often, in much less simple and slightly less severe situations. This may mean that the best way to compare X and Y is not always to compute the value for each. When making a comparison between X and Y, we can minimize uncertainty by making the analysis of X as similar to the analysis of Y as possible.

Objections

Of course this example was very simple, and there are lots of reasons you might expect more realistic estimates to be safe from these problems. I think that, despite all of these divergences, this simple model captures a common failure in estimation. The basic problem is that the argument above shows that there is no general reason to expect independent estimates of value to yield optimal results. Without a general reason to think that this procedure is optimal, it seems to be on much shakier ground. But to make the point, here are responses to some of the most obvious objections:

1. The reason we can say that Alice and Bob did badly is because we know something they didn't---that A2 and B2 were estimates of the same quantity. Couldn't they just have done one extra step of work---updating each of their estimates after looking at the other's work---and avoided the problem?

In this case, that would have solved Alice and Bob's problem. But in practice, different estimates rarely involve estimating exactly the same intermediates. If I want to compare the goodness of health interventions and education interventions in the developing world, the most natural estimates might not have even a single step in common. Nevertheless, each of those estimates would involve many uncertainties about social dynamics in the developing world, long-term global outcomes, and so on. I could do my analysis in a way that introduced analogies between the two estimates, and this could help me eliminate some of this uncertainty (even if the resulting estimates were noisier, or involved ignoring some apparently useful information).

If Alice and Bob's estimates were related in a more complicated way, then it's still the case that there is some extra update Alice and Bob could have done, which would have eliminated the problem (i.e. updating on each other's estimates, using that relationship). But such an update could be quite complicated, and after making it Alice and Bob would need to make further updates still. In general, it's not clear I can fix the problem without being logically omniscient. I don't know the extent of this issue in practice, and I'm not familiar with a literature on this or related problems. It seems pretty messy in general, but I expect it would be possible to make meaningful headway on it.

The point is: in order to prove that comparing independent value estimates is optimal, it is not enough to assume that my beliefs are well-calibrated. I also need to assume that my beliefs make use of all available information (including having considered every alternative estimation strategy that sheds light on the question), which is unrealistic even for an idealized agent unless it is logically omniscient. When my beliefs don’t make use of all available information, other techniques for comparison might do better, including using different estimates which have more elements in common. (In some cases, even very simple approaches like “do the cheapest thing” will be predictably better than comparing independent value estimates.)

2. Alice and Bob had trouble because they are two different people. I agree that I shouldn’t compare estimates from different people, but if I do all of the estimates myself it seems like this isn’t a problem.

When I try to estimate the same thing several times, without remembering my earlier estimates, I tend to get different results. I strongly suspect this is universal, though I haven’t seen research on that question.

Moreover, when I try to estimate different things, my estimates tend not to obey the logical relationships that I know the estimated quantities must, unless I go back through with those particular relationship in mind and enforce them. For example, if I estimate A and B separately, the sum is rarely the same as if I estimated A+B. When the relationships amongst items are complicated, such consistency is unrealistically difficult to enforce. (Of course, the prospects for making comparisons also suffer.) It may be that there is some principled way to get around these problems, but I don't know it.

Alice and Bob's estimates don’t have to be very far from each other before they could have done better. I agree that estimates from a single person will have a higher degree of consistency than estimates from different people, but they won't be consistent enough to remove the problem (or opportunity for improvement, if you want to look at it from a different angle).

3. The weird behavior in the example came from the artificial structure of the problem. How often could you do such factoring out for realistic estimates, even when they are similar?

If I’m trying to estimate the effect of different health interventions, the first step would be to separate the question “How much does this improve people’s health?” from “How much does improving people’s health matter?” That already factors out a big piece of the uncertainty. I think most people get that far, though, and so the question is: can you go farther?

I think it is still easier to estimate "Which of these interventions improve health more?" than to estimate the absolute improvement from either. We can break this comparison down into still smaller comparisons: “How many more or fewer people does X reach than Y?” and “Per person affected, what is the relative impact of X and Y?” etc. By focusing on the most important comparisons, and writing the others off as a wash, we might be able to reduce the total error in our comparison.

Conclusion

Trying to explicitly estimate the goodness of outcomes tends to draw a lot of criticism from pretty much every side. I think most of this criticism is unjustified (and often rooted in an aversion to making reasoning or motivations explicit, a desire to avoid offense or culpability, etc.). Nevertheless, there are problems with many straightforward approaches to quantitative estimation, and some qualitative processes improve on quantitative estimation in important ways. Many of these improvements are often dismissed by optimistic quantitative types (myself included), and I think that is an error. For example, I mentioned that I've often dismissed arguments of the form "If your error bars are too big, you are sometimes better off ignoring the data." This looks obviously wrong on the Bayesian account, but as far as I can tell it may actually be the optimal behavior---even for idealized, bias-free humans.

Personal Blog

23

New Comment

Rendering 0/30 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 12:21 AM

Moderation Log

23 Estimates vs. head-to-head comparisons

by paulfchristiano

4th May 2013

7 min read

23

(Cross-posted from my blog.)

Summary: when choosing between two options, it’s not always optimal to estimate the value of each option and then pick the better one.

In fact I don’t think my intuition is quite right. To see why, let’s start with a very simple case.

A simple model

Objections

3. The weird behavior in the example came from the artificial structure of the problem. How often could you do such factoring out for realistic estimates, even when they are similar?

Conclusion

Personal Blog

23

New Comment

Rendering 0/30 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 12:21 AM

Moderation Log

More from paulfchristiano

Curated and popular this week

30Comments

Comment Permalink

private_messaging13y00

If I have a noisy estimate and a prior, I should regress towards the mean. By the "ideal case" do you mean the case in which my estimates have no noise? That is a strange idealization, which people might implicitly use but probably wouldn't advocate.

I was primarily referring to this wide eyed optimism prevalent on these boards; attend some workshops and become more rational and win. It's not that people advocate not regressing to the mean, it's that they don't even know this is an issue (and a difficult issue when probability distribution and it's mean are something you need to find out as well). In the ideal case, you have a sum over all terms - it is not an estimate at all - you don't discard any terms, if you discard any terms it will make it less ideal, if you apply any extra scaling it will make it less ideal, and so on. And so you have people see it as biases and imagine enormous gains to be obtained from doing something formal inspired instead. I have a cat test. Can you explicitly determine if something is a picture of a cat based on a list of numbers representing pixel luminosities? This is the size of gap between implicit processing of the evidence and explicit processing of the evidence.

But I don't see why either of these properties---reflecting symmetries, summing to one over exclusive alternatives---are necessary for good outcomes. Suppose that I am trying to estimate the relative goodness of two options in order to pick the best. Why should it matter whether my beliefs have these particular consistency properties, as long as they are my best available guess?

This needs a specific example. Some people were worrying over a very very far fetched scenario, being unable to assign it low enough probability. The property of summing to 1 over the enormous number of likewise far fetched mutually exclusive scenarios would definitely have helped, compared to the state of - I suspect - summing to a very very huge number. Then they were taught a little bit of rationality and they know probability is subjective, which makes them inclined to consider their numerical assessment of a feeling (which may well already incorporate alleged impact) to be a probability, and multiply it with something. Other bad patterns include inversion of probability - why are you so extremely certain in negation of an event? People expect that probabilities close to 1 require evidence, and without any, are reluctant to assign something close to 1, even though in that case it is representative of a sum of almost entire hypothesis space.

With respect to the other points, I agree that estimation is hard, but the difficulties you cite seem to fit pretty squarely into the simple theoretical framework of computing a well-calibrated estimate of expected value. So to the extent there are gaps between that simple framework and reality, these difficulties don't point to them.

not a question for which you actually know the expert consensus.

I do not see people most educated in these matters (or, indeed, the theory) to be running "rationality workshops" advocating explicit theory-based reasoning, that's what I mean. And people I see I would not even suspect of expertise if they haven't themselves claimed expertise.

This would be a fine response if I were trying to cast myself as better than experts because I have such an excellent clean theory (and I have little patience with Eliezer for doing this). But in fact I am just trying to say relatively simple things in the interest of building up an understanding.

Yes I certainly agree here - first make simple steps in the right direction.

paulfchristiano13y00

I think mostly you are arguing against LW in general, which seems fine but not particularly helpful here or relevant to my point.

Some people were worrying over a very very far fetched scenario, being unable to assign it low enough probability. The property of summing to 1 over the enormous number of likewise far fetched mutually exclusive scenarios would definitely have helped, compared to the state of - I suspect - summing to a very very huge number.

What is the "very very far fetched scenario"? If you mean the intelligence explosion scenario... (read more)

See in context