Kindly comments on 2012 Survey Results - Less Wrong

80 Post author: Yvain 07 December 2012 09:04PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (640)

You are viewing a single comment's thread. Show more comments above.

Comment author: gwern 30 November 2012 03:27:25AM *  26 points [-]

I previously mentioned that item non-response might be a good measure of Conscientiousness. Before doing anything fancy with non-response, I first checked that there was a correlation with the questionnaire reports. The correlation is zero:

R> lwc <- subset(lw, !is.na(as.integer(as.character(BigFiveC))))
R> missing_answers <- apply(lwc, 1, function(x) sum(sapply(x, function(y) is.na(y) || as.character(y)==" ")))
R> cor.test(as.integer(as.character(lwc$BigFiveC)), missing_answers)
Pearson's product-moment correlation
data: as.integer(as.character(lwc$BigFiveC)) and missing_answers
t = -0.0061, df = 421, p-value = 0.9952
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.09564 0.09505
sample estimates:
cor
-0.0002954
# visualize to see if we made some mistake somewhere
R> plot(as.integer(as.character(lwc$BigFiveC)), missing_answers)

I am completely surprised. The results in the economics paper looked great and the rationale is very plausible. Yet... The 2 sets of data here have the right ranges, there's plenty of variation in both dimension, I'm sure I'm catching most of the item non-responses or NAs given that there are non-responses as high as 34, there's a lot of datapoints, and it's not that the correlation is the opposite direction which might indicate a coding error but that there's none at all. Yvain questions the Big Five results, but otherwise they look exactly as I would've predicted before seeing the results: low C and E and A, high O, medium N.

There may be something very odd about LWers and Conscientiousness; when I try C vs Income, there's a almost-zero correlation again:

R> cor.test(as.integer(as.character(lwc$BigFiveC)), log1p(as.integer(lwc$Income)))
Pearson's product-moment correlation
data: as.integer(as.character(lwc$BigFiveC)) and log1p(as.integer(lwc$Income))
t = 0.2178, df = 421, p-value = 0.8277
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.08482 0.10585
sample estimates:
cor
0.01061

I guess the next step is a linear model on income vs age, Conscientiousness, and IQ:

lwc <- subset(lw, !is.na(as.integer(as.character(BigFiveC)))))
lwc <- subset(lw, !is.na(as.integer(as.character(Age))))
lwc <- subset(lw, !is.na(as.integer(as.character(IQ))))
lwc <- subset(lw, !is.na(as.integer(as.character(Income))))
c <- as.integer(as.character(lwc$BigFiveC))
age <- as.integer(as.character(lwc$Age))
iq <- as.integer(as.character(lwc$IQ))
income <- log1p(as.integer(as.character(lwc$Income)))
summary(lm(income ~ (age + iq + c)))
Call:
lm(formula = income ~ (age + iq + c))
Residuals:
Min 1Q Median 3Q Max
-8.762 -0.849 1.191 2.319 3.644
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5531 3.5479 -0.16 0.88
age 0.1311 0.0323 4.06 9.5e-05
iq 0.0339 0.0267 1.27 0.21
c 0.0174 0.0121 1.44 0.15
Residual standard error: 3.35 on 106 degrees of freedom
(489 observations deleted due to missingness)
Multiple R-squared: 0.196, Adjusted R-squared: 0.173
F-statistic: 8.59 on 3 and 106 DF, p-value: 3.73e-05

So all of them combined don't explain much and most of the work is being done by the age variable... There's many high-income LWers, supposedly (in this subset of respondents reporting age, income, IQ, and Conscientiousness, the max is 700,000), so I'd expect a cumulative r^2 of more than 0.173 for all 3 variables; if those aren't governing income, what is? Maybe everyone working with computers is rich and the others poor? Let's look at everyone who submitted salary and profession and see whether the practical computer people are making bank:

lwi <- subset(lw, !is.na(as.integer(as.character(Income))))
lwi <- subset(lwi, !is.na(as.character(Profession)))
cs <- as.integer(as.character(lwi[as.character(lwi$Profession)=="Computers (practical: IT, programming, etc.)",]$Income))
others <- as.integer(as.character(lwi[as.character(lwi$Profession)!="Computers (practical: IT, programming, etc.)",]$Income))
# ordinary t-test, but we'll exclude anyone with zero income (unemployed?)
t.test(cs[cs!=0],others[others!=0])
Welch Two Sample t-test
data: cs[cs != 0] and others[others != 0]
t = 5.905, df = 309.3, p-value = 9.255e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
22344 44673
sample estimates:
mean of x mean of y
76458 42950

Wow. Just wow. 76k vs 43k. I mean, maybe this would go away with enough fiddling (eg. cost-of-living) but it's still dramatic. This suggests a new theory to me: maybe Conscientiousness does correlate with income at its usual high rate for everyone but computer people who are simply in so high demand that lack of Conscientiousness doesn't matter:

R> lwi <- subset(lw, !is.na(as.integer(as.character(Income))))
R> lwi <- subset(lwi, !is.na(as.character(Profession)))
R> lwi <- subset(lwi, !is.naBigFiveC)))))
R> cs <- lwi[as.character(lwi$Profession)=="Computers (practical: IT, programming, etc.)",]
R> others <- lwi[as.character(lwi$Profession)!="Computers (practical: IT, programming, etc.)",]
R> cor.test(as.integer(as.character(cs$BigFiveC)), as.integer(as.character(cs$Income)))
Pearson's product-moment correlation
data: as.integer(as.character(cs$BigFiveC)) and as.integer(as.character(cs$Income))
t = 0.5361, df = 87, p-value = 0.5933
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1527 0.2625
sample estimates:
cor
0.05738
R> cor.test(as.integer(as.character(others$BigFiveC)), as.integer(as.character(others$Income)))
Pearson's product-moment correlation
data: as.integer(as.character(others$BigFiveC)) and as.integer(as.character(others$Income))
t = 1.997, df = 200, p-value = 0.04721
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.001785 0.272592
sample estimates:
cor
0.1398

So for the CS people the correlation is small and non-statistically-significant, for non-CS people the correlation is almost 3x larger and statistically-significant.

Comment author: Kindly 30 November 2012 04:07:27AM 14 points [-]

There is a correlation of 0.13 between non-responses and N.

Of course, there's also a correlation of -0.13 between C and the random number generator.

Comment author: [deleted] 30 November 2012 10:48:20AM 10 points [-]

People who had seen the RNG give a large number were primed to feel unusually reckless when taking the Big 5 test. Duh. (Just kidding.)