It should come as no surprise to people on this list that models often outperform experts. But these are generally finely calibrated models, integrating huge amounts of data, so this seems less surprising. How can the poor experts compete against that?
But sometimes the models are much simpler than that, and still perform better. For instance, the models could be linear, rather than having higher order complexities. These models can still outperform experts, because in practice, despite their beliefs that they are doing a non-linear task, expert decisions can often best be modelled as being entirely linear.
But surely the weights of the linear models are subtle and need to be set exactly? Not really. It seems that if you take a linear model, and weigh the variables by +1 or -1 depending on whether it has a positive or negative impact on the result, then you will get a model that still often outperforms experts. These models with ±1 weights are called improper linear models.
What's going on here? Well, there's been a bit of a dodge. I've been talking about "taking" a linear model, with "variables", and weighing the factors depending on a positive or negative "impact". And to do all that, you need experts. They are the ones that know which variables are important, and know the direction (positive or negative) in which they impact the result. They don't choose these variables by just taking random possibilities and then figuring out what the direction is. Instead, they understand the situation, to some extent, and choose important variables.
So that's the real role of the expert here: knowing what should go into the model, what really makes the underlying dependent variable change. Selecting and coding the variable information, in the terms that are often used.
But, just as experts can be very good at that task, they are human, and humans are terrible at integrating lots of information together. So, having selected the variables, they get regularly outperformed by proper linear models. And when you add the fact that the experts have selected variables of comparable importance, and that these variables are often correlated with each other, it's not surprising that they get outperformed by improper linear models as well.
Nice. To make your proposed explanation more precise:
Take a random vector on the n-dim unit sphere. Project to the nearest (+1,-1)/sqrt(n) vector; what is the expected l2-distance / angle? How does it scale with n?
If this value decreases in n, then your explanation is essentially correct, or did you want to propose something else?
Start by taking a random vector x where each coordinate is unit gaussian (normalize later). The projection px just splits into positive coordinates and negative coordinates.
We are interested in E[ / |x| sqrt(n)].
If the dimension is large enough, then we wont really need to normalize; it is enough to start with 1/sqrt(n) gaussians, as we will almost almost surely get almost unit length. Then all components are independent.
For the angle, we then (approximately) need to compute E(sum_i |x_i| / n), where each x_i is unit Gaussian. This is asymptotically independent of n; so it appears like this explanation of improper linear models fails.
Darn, after reading your comment I mistakenly believed that this would be yet another case of "obvious from high-dimensional geometry" / random projection.
PS. In what sense are improper linear models working? l_1, l2, l\infty sense?
Edit: I was being stupid, leaving the above for future ridicule. We want E(sum_i |x_i| / n)=1, not E(sum_i |x_i|/n)=0.
Folded Gaussian tells us that E[ sum_i |x_i|/n]= sqrt(2/pi), for large n. The explanation still does not work, since 2/pi <1, and this gives us the expected error margin of improper high-dimensional models.
@Stuart: What are the typical empirical errors? Do they happen to be near sqrt(2/pi), which is close enough to 1 to be summarized as "kinda works"?