This is a good example of people relying too much on linear regressions. You can't interpret coefficients of linear regressions the way they do. They're good for exploratory data analysis and their interpretations of the coefficients are reasonable hypotheses to consider, but they should actually test them.
Covariance is one keyword. If the data is linear but not maximal dimensional, then you get covariance. This is to be expected in situations like this, where you convert a scale to a bunch of booleans. ETA: and even if one did not expect adjacent values to be correlated, that the total number of ratings is about the same is a reduction of dimension.
But if the data is not linear, many more things can go wrong. I don't know names for them.
Matt Simpson: I suppose that could solve the problem of covariance, but that's not what I'm talking about.
It would be interesting to see higher-dimensional plots. For example, the scatter plot of average-score vs the number of messages could be colored according to the number of ratings of 1. And similar charts for other ratings.
That is a good example of an error that one could make from believing the data is linear (and thus trusting the regression coefficients) when it is not linear. If their non-linear model were correct, we would get regression coefficients like what we see. If we trusted the regression coefficients too much (implicitly assuming the data is linear), then the positive coefficient on the number of 1s would suggest that having all 1s is good. But it is not. Their model says it is not and the data says it is not (eg, the scatter plot).
I think that is what you are saying. It is certainly not their mistake - they believe their model. I am not saying anything so specific, but it is the type of mistake that I am talking about. Also, there are lots of non-linear models that lead to the same regression.
Comment author:Matt_Simpson
12 January 2011 06:16:09PM
*
0 points
[-]
I interpreted this comment as saying that they should test whether the coefficients are equal to 0 before interpreting them. There's evidence that they did this if you look at the "if you're into algebra" sidebar on the right - they dropped the m3 variable because it had a large p-value.
This is the same reason that when shopping on Amazon I ignore the reviews from people who rated the product 1 or 5 stars. They often have an ulterior motive of trying to damage/help the image of the product as much as possible.
Comments (5)
This is a good example of people relying too much on linear regressions. You can't interpret coefficients of linear regressions the way they do. They're good for exploratory data analysis and their interpretations of the coefficients are reasonable hypotheses to consider, but they should actually test them.
Covariance is one keyword. If the data is linear but not maximal dimensional, then you get covariance. This is to be expected in situations like this, where you convert a scale to a bunch of booleans. ETA: and even if one did not expect adjacent values to be correlated, that the total number of ratings is about the same is a reduction of dimension.
But if the data is not linear, many more things can go wrong. I don't know names for them.
Matt Simpson: I suppose that could solve the problem of covariance, but that's not what I'm talking about.
It would be interesting to see higher-dimensional plots. For example, the scatter plot of average-score vs the number of messages could be colored according to the number of ratings of 1. And similar charts for other ratings.
That is a good example of an error that one could make from believing the data is linear (and thus trusting the regression coefficients) when it is not linear. If their non-linear model were correct, we would get regression coefficients like what we see. If we trusted the regression coefficients too much (implicitly assuming the data is linear), then the positive coefficient on the number of 1s would suggest that having all 1s is good. But it is not. Their model says it is not and the data says it is not (eg, the scatter plot).
I think that is what you are saying. It is certainly not their mistake - they believe their model. I am not saying anything so specific, but it is the type of mistake that I am talking about. Also, there are lots of non-linear models that lead to the same regression.
I interpreted this comment as saying that they should test whether the coefficients are equal to 0 before interpreting them. There's evidence that they did this if you look at the "if you're into algebra" sidebar on the right - they dropped the m3 variable because it had a large p-value.
Is that what you were getting at?
edit: typo
Here's an insightful comment on the article:
http://www.reddit.com/r/math/comments/ezm6s/the_mathematics_of_beauty/c1c87ts
This is the same reason that when shopping on Amazon I ignore the reviews from people who rated the product 1 or 5 stars. They often have an ulterior motive of trying to damage/help the image of the product as much as possible.