Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Manfred comments on The Optimizer's Curse and How to Beat It - Less Wrong

44 Post author: lukeprog 16 September 2011 02:46AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (81)

You are viewing a single comment's thread. Show more comments above.

Comment author: Manfred 17 September 2011 03:44:30PM 7 points [-]

In any group there's going to be random noise, and if you choose an extreme value, chances are that value was inflated by noise. In Bayesian, given that something has the highest value, it probably had positive noise, not just positive signal. So the correction is to correct out the expected positive noise you get from explicitly choosing the highest value. Naturally, this correction is greater for when the noise is bigger.

So imagine choosing between black boxes. Each black box has some number of gold coins in it, and also two numbers written on it. The first number, A, on the box is like the estimated expected value, and the second number, B, is like the variance. What happened is that someone rolled two distinct dice with B sides, subtracted die 1 from die 2, and added that to the number of gold coins in the box.

So if you see a box with 40, 3 written on it, you know that it has an expected value of 40 gold coins, but might have as few as 37 or as many as 43.

Now comes the problem: I put 10 boxes in front of you, and tell you to choose the one with the most gold coins. The first box is 50, 1 - a very low-variance box. But the last 9 boxes are all high-uncertainty, all with B=20. The expected values printed on them are as follows [I generated the boxes honestly] : 53, 52, 37, 60, 44, 36, 56, 45, 54. Ooh, one of those boxes has a 60 on it! Pick that one!

Okay, don't pick that one. Think about it - there are 9 boxes with high variance, and the one you picked probably has unusually large noise. To be special among 9 proposals with high variance, it probably has noise at the 80th+ percentile. What's the 80th percentile of noise for 1d20 - 1d20? I bet it's larger than 10. You're better off just going with the 50, 1 box.

And it's a good thing you applied that correction, because I generated the boxes by typing "RandomInteger[20,9] - RandomInteger[20,9] + 45" into Wolfram alpha - they each 45 coins each.

So this illustrates that what beating the optimizer's curse really is is a sort of "correction for multiple comparisons." If you have a lot of noisy boxes, some of them will look large even when they're not, even larger than non-noisy boxes.

Comment author: JGWeissman 17 September 2011 07:06:08PM 3 points [-]

That is a good example of how the optimizer's curse causes an overestimate of the maximum expected value, and even reliably causes a wrong choice to be associated with the maximum expected value. But how do I apply the correction mathematically, so I can know for which expected values on the high uncertainty boxes I should expect their best of them to be better or worse than the low uncertainty box? Even better, how can I deal with situations where the uncertainties of the expected values are not so conveniently categorized (and whose actual values aren't conveniently uniform)?

Comment author: Manfred 19 September 2011 10:52:04AM *  2 points [-]

Oh - I learned how, by the way. You start with some prior over how you expect the actual coins to be distributed, and then you convolute in the noise distribution of each box to get the combined distribution for each box. Then, given where the number on the outside of each box falls on the combined distribution, you can assign how much of that you expect to be signal and how much you expect to be noise by distributing improbability equally between signal and noise. Then you subtract out the expected noise.

Comment author: Manfred 17 September 2011 07:48:50PM 0 points [-]

I'm not sure. It's probably in the paper.

Comment author: Brickman 28 September 2011 01:58:06AM 0 points [-]

I'm trying to figure out why, from the rules you gave at the start, we can assume that box 60 has more noise than the other boxes with variance of 20. You didn't, at the outset of the problem, say anything about what the values in the boxes actually were. I would not, taking this experiment, have been surprised to see a box labeled "200", with a variance of 20, because the rules didn't say anything about values being close to 50, just close to A. Well, I would've been surprised with you as a test-giver, but it wouldn't have violated what I understood the rules to be and I wouldn't have any reason to doubt that box was the right choice.

The box with 60 stands out among the boxes with high variance, but you did not say that those boxes were generated with the same algorithm and thus have the same actual value. In fact you implied the opposite. You just told me that 60 was an estimate of its expected value, and 37 was an estimate of one of the other boxes' expected values. So I would assign a very high probability to it being worth more than the box labeled 37. I understand that the variance is being effectively applied twice to go between the number on the box to the real number of coins (The real number of 45 could make an estimate anywhere from 25 to 65, but if it hit 25 I'd be assigning the real number a lower bound of 5 and if it hit 65 I'd be assigning the real number an upper bound of 85, which is twice that range). (Actually for that reason I'm not sure your algorithm really means there's a variance of 20 from what you state the expected value to be, but I don't feel like doing all the math to verify that since it's tangential to the message I'm hearing from you or what I'm saying). But that doesn't change the average. The range of values that my box labeled 60 could really contain from being higher than the range the box labeled 37 could really contain, to the best of my knowledge, and both are most likely to fall within a couple coins of the center of that range, with the highest probability concentrated on the exact number.

If the boxes really did contain different numbers of coins, or we just didn't have reason to assume that they don't contain different numbers, the box labeled 60 is likely to contain more coins than that 50/1 box did. It is also capable of undershooting 50 by ten times as much if unlucky, so if for some reason I absolutely cannot afford to find less than 50 coins in my box the 50/1 box is the safer choice--but if I bet on the 60/20 box 100 times and you bet on the 50/1 box 100 times, given the rules you set out in the beginning, I would walk away with 20% more money.

Or am I missing some key factor here? Did I misinterpret the lesson?

Comment author: Manfred 28 September 2011 04:41:15AM 2 points [-]

Or am I missing some key factor here? Did I misinterpret the lesson?

The key factor is that the 60,20 box is not in isolation - it is the top box, and so not only do you expect it to have more "signal" (gold) than average, you also expect it to have more noise than average.

You can think of the numbers on the boxes as drawn from a probability distribution. If there was 0 noise, this probability distribution would just be how the gold in the boxes was distributed. But if you add noise, it's like adding two probability distributions together. If you're not familiar with what happens, go look it up on wikipedia, but the upshot is that the combined distribution is more spread out than the original. This combined distribution isn't just noise or just signal, it's the probability of having some number be written on the outside of the box.

And so if something is the top, very highest box, where should it be located on the combined distribution?

Now, if you have something that's high on the combined distribution, how much of that is due to signal, and how much of it is due to noise? This is a tougher question, but the essential insight is that the noise shouldn't be more improbable than the signal, or vice versa - that is, they should both be about the same number of standard deviations from their means.

This means that if the standard deviation of the noise is bigger, then the probable contribution of the noise is greater.

Me saying the same thing a different way can be found here.

Comment author: Brickman 28 September 2011 12:15:15PM 1 point [-]

Oh, I understand now. Even if we don't know how it's distributed, if it's the top among 9 choices with the same variance that puts it in the 80th percentile for specialness, and signal and noise contribute to that equally. So it's likely to be in the 80th percentile of noise.

It might have been clearer if you'd instead made the boxes actually contain coins normally distributed about 40 with variance 15 and B=30, and made an alternative of 50/1, since you'd have been holding yourself to more proper unbiased generation of the numbers and still, in all likelihood, come up with a highest-labeled box that contained less than the sure thing. You have to basically divide your distance from the norm by the ratio of specialness you expect to get from signal and noise. The "all 45" thing just makes it feel like a trick.

Comment author: CynicalOptimist 17 November 2016 05:27:56PM 0 points [-]

I think there's some value in that observation that "the all 45 thing makes it feel like a trick". I believe that's a big part of why this feels like a paradox.

If you have a box with the numbers "60" and "20" as described above, then I can see two main ways that you could interpret the numbers:

A: The number of coins in this box was drawn from a probability distribution with a mean of 60, and a range of 20.

B: The number of coins in this box was drawn from an unknown probability distribution. Our best estimate of the number of coins in this box is 60, based on certain information that we have available. We are certain that the actual value is within 20 gold coins of this.

With regards to understanding the example, and understanding how to apply the kind of Bayesian reasoning that the article recommends, it's important to understand that the example was based on B. And in real life, B describes situations that we're far more likely to encounter.

With regards to understanding human psychology, human biases, and why this feels like a paradox, it's important to understand that we instinctively tend towards "A". I don't know if all humans would tend to think in terms of A rather than B, but I suspect the bias applies widely amongst people who've studied any kind of formal probability. "A" is much closer to the kind of questions that would be set as exercises in a probability class.

Comment author: Manfred 28 September 2011 05:37:09PM 0 points [-]

That's true - when I wrote the post you replied to I still didn't really understand the solution - though it did make a good example for JGWeissman's question. By the time I wrote the post I linked to, I had figured it out and didn't have to cheat.

Comment author: Oscar_Cunningham 17 September 2011 11:25:11PM 0 points [-]

But if you don't know that all the high variance boxes have the same mean then 60 is the one to go with. And if you do know they have the same mean, then it's expected value is no longer 60.

Comment author: Manfred 18 September 2011 08:48:21AM 1 point [-]

Imagine putting gold coins into a bunch of boxes by having them normally distributed about 50 gold coins with standard deviation 10. Then we'll add some Gaussian noise to the estimates on the boxes - but we'll split them into 2 groups. Ten boxes will have noise with standard deviation of 5, while the other ten will have a standard deviation of 25.

But since I've still kept the simple situation where we just have 2 groups, you can get the overall biggest by just picking the biggest from each group and comparing them. So we can treat the groups independently for a bit. The biggest one is going to have the biggest positive deviation from 50, combined signal and noise. Because I used normal distributions this time, the combined prior+noise distribution is just a bigger normal distribution. So given that something is big or small by this combined distribution, how do we expect the signal and noise distributions to shift? Well, it would be silly to expect one of them to be more improbable than the other, so we expect their means to shift by about the same number of standard deviations for each distribution. This right there means that the bigger the noise, the more of the variation we should attribute to noise. And also the bigger the element in the combined distribution, the larger we should expect its noise to be.

Comment author: Oscar_Cunningham 18 September 2011 09:45:46AM 0 points [-]

But if you know the boxes were originally drawn from N(50,100) then the number on the box is no longer the correct Bayesian mean. All I'm arguing is that once you have your Bayesian expected value you don't need to update it any further.

Comment author: Manfred 18 September 2011 10:13:42AM *  3 points [-]

All I'm arguing is that once you have your Bayesian expected value you don't need to update it any further.

That's pretty uncontroversial, but in practice it means that you end up penalizing high-noise boxes with high values (and boosting high-noise boxes with low values), which I think is a nontrivial result.