This looks cool. My biggest caution would be that this effect may be tied to the specific class of data generating processes you're looking at.

Your framing seems to be that you look at the world as being filled with entities whose features under any conceivable measurements are distributed as independent multivariate normals. The predictive factor is a feature and so is the outcome. Then using extreme order statistics of the predictive factor to make inferences about the extreme order statistics of the outcome is informative but unreliable, as you illustrated. Playing around in R, reliability seems better for thin-tailed distributions (e.g., uniform) and worse for heavy-tailed distributions (e.g., Cauchy). Fixing the distributions and letting the number of observations vary, I agree with you that the probability of picking exactly the greatest outcome goes to zero. But I'd conjecture that the probability that the observation with the greatest factor is in some fixed percentile of the greatest outcomes will go to one, at least in the thin-tailed case and maybe in the normal case.

But consider another data generating process. If you carry out the following little experiment in R

fac <- rcauchy(1000)
out <- fac + rnorm(1000)
plot(rank(fac), rank(out))
rank(out)[which.max(fac)]

it looks like extreme factors are great predictors of extreme outcomes, even though the factors are only unreliable predictors of outcomes overall. I wouldn't be surprised if the probability of the greatest factor picking the greatest outcome goes to one as the number of observations grows.

Informally (and too evocatively) stated, what seems to be happening is that as long as new observations are expanding the space of factors seen, extreme factors pick out extreme outcomes. When new observations mostly duplicate already observed factors, all of the duplicates would predict the most extreme outcome and only one of them can be right.

Thanks for doing what I should have done and actually run some data!

I ran your code in R. I think what is going on in the Cauchy case is that the variance on fac is way higher than the normal noise being added (I think the SD is set to 1 by default, whilst the Cauchy is ranging over some orders of magnitude). If you plot(fac, out), you get a virtually straight line, which might explain the lack of divergence between top ranked fac and out.

I don't have any analytic results to offer, but playing with R suggests in the normal case the probability of the greatest factor score picking out the greatest outcome goes down as N increases - to see this for yourself, replace rcauchy with runf or rnorm, and increase the N to 10000 or 100000. In the normal case, it is still unlikely that max(fax) picks out max(out) with random noise, but this probability seems to be sample size invariant - the rank of the maximum factor remains in the same sort of percentile as you increase the sample size.

I can intuit why this is the case: in the bivariate normal case, the distribution should be elliptical, and so the limit case with N -> infinity will be steadily reducing density of observations moving out from the ellipse. So as N increases, you are more likely to 'fill in' the bulges on the ellipse at the right tail that gives you the divergence, if the N is smaller, this is less likely. (I find the uniform result more confusing - the 'N to infinity case' should be a parallelogram, so you should just be picking out the top right corner, so I'd guess the probability of picking out the max factor might be invariant to sample size... not sure.)

Comment author:Lumifer
28 July 2014 04:54:39PM
*
4 points
[-]

Another issue is that real-life processes are, generally speaking, not stationary (in the statistical sense) -- outside of physics, that is.

When you see an extreme event in reality it might be that the underlying process has heavier tails than you thought it does, or it might be that the whole underlying distribution switched and all your old estimates just went out of the window...

Good point. When I introduced that toy example with Cauchy factors, it was the easiest way to get factors that, informally, don't fill in their observed support. Letting the distribution of the factors drift would be a more realistic way to achieve this.

the whole underlying distribution switched and all your old estimates just went out of the window...

I like to hope (and should probably endeavor to ensure) that I don' t find myself in situations like that. A system that generatively (what the joint distribution of factor X and outcome Y looks like) evolves over time, might be discriminatively (what the conditional distribution of Y looks like given X) stationary. Even if we have to throw out our information about what new X's will look like, we may be able to keep saying useful things about Y once we see the corresponding new X.

Comment author:Lumifer
28 July 2014 05:54:14PM
4 points
[-]

I like to hope (and should probably endeavor to ensure) that I don' t find myself in situations like that.

It comes with certain territories. For example, any time you see the financial press talk about a six-sigma event you can be pretty sure the underlying distribution ain't what it used to be :-/

## Comments (90)

Best*8 points [-]This looks cool. My biggest caution would be that this effect may be tied to the specific class of data generating processes you're looking at.

Your framing seems to be that you look at the world as being filled with entities whose features under any conceivable measurements are distributed as independent multivariate normals. The predictive factor is a feature and so is the outcome. Then using extreme order statistics of the predictive factor to make inferences about the extreme order statistics of the outcome is informative but unreliable, as you illustrated. Playing around in R, reliability seems better for thin-tailed distributions (e.g., uniform) and worse for heavy-tailed distributions (e.g., Cauchy). Fixing the distributions and letting the number of observations vary, I agree with you that the probability of picking exactly the greatest outcome goes to zero. But I'd conjecture that the probability that the observation with the greatest factor is in some fixed percentile of the greatest outcomes will go to one, at least in the thin-tailed case and maybe in the normal case.

But consider another data generating process. If you carry out the following little experiment in R

it looks like extreme factors are great predictors of extreme outcomes, even though the factors are only unreliable predictors of outcomes overall. I wouldn't be surprised if the probability of the greatest factor picking the greatest outcome goes to one as the number of observations grows.

Informally (and too evocatively) stated, what seems to be happening is that as long as new observations are expanding the space of factors seen, extreme factors pick out extreme outcomes. When new observations mostly duplicate already observed factors, all of the duplicates would predict the most extreme outcome and only one of them can be right.

Thanks for doing what I should have done and actually run some data!

I ran your code in R. I think what is going on in the Cauchy case is that the variance on fac is way higher than the normal noise being added (I think the SD is set to 1 by default, whilst the Cauchy is ranging over some orders of magnitude). If you plot(fac, out), you get a virtually straight line, which might explain the lack of divergence between top ranked fac and out.

I don't have any analytic results to offer, but playing with R suggests in the normal case the probability of the greatest factor score picking out the greatest outcome goes down as N increases - to see this for yourself, replace rcauchy with runf or rnorm, and increase the N to 10000 or 100000. In the normal case, it is still unlikely that max(fax) picks out max(out) with random noise, but this probability seems to be sample size invariant - the rank of the maximum factor remains in the same sort of percentile as you increase the sample size.

I can intuit why this is the case: in the bivariate normal case, the distribution should be elliptical, and so the limit case with N -> infinity will be steadily reducing density of observations moving out from the ellipse. So as N increases, you are more likely to 'fill in' the bulges on the ellipse at the right tail that gives you the divergence, if the N is smaller, this is less likely. (I find the uniform result more confusing - the 'N to infinity case' should be a parallelogram, so you should just be picking out the top right corner, so I'd guess the probability of picking out the max factor might be invariant to sample size... not sure.)

*4 points [-]Another issue is that real-life processes are, generally speaking, not stationary (in the statistical sense) -- outside of physics, that is.

When you see an extreme event in reality it might be that the underlying process has heavier tails than you thought it does, or it might be that the whole underlying distribution switched and all your old estimates just went out of the window...

Good point. When I introduced that toy example with Cauchy factors, it was the easiest way to get factors that, informally, don't fill in their observed support. Letting the distribution of the factors drift would be a more realistic way to achieve this.

I like to hope (and should probably endeavor to ensure) that I don' t find myself in situations like that. A system that generatively (what the joint distribution of factor X and outcome Y looks like) evolves over time, might be discriminatively (what the conditional distribution of Y looks like given X) stationary. Even if we have to throw out our information about what new X's will look like, we may be able to keep saying useful things about Y once we see the corresponding new X.

It comes with certain territories. For example, any time you see the financial press talk about a six-sigma event you can be pretty sure the underlying distribution ain't what it used to be :-/