This is an idea that just occurred to me. We have a large community of people who think about scientific problems recreationally, many of whom are in no position to go around investigating them. Hopefully, however, some other community members are in a position to go around investigating them, or know people who are. The idea here is to allow people to propose relatively specific ideas for experiments, which can be upvoted if people think they are wise, and can be commented on and refined by others. Grouping them together in an easily identifiable, organized way in which people can provide approval and suggestions seems like it may actually help advance human knowledge, and with its high sanity waterline and (kind of) diverse group of readers, this community seems like an excellent place to implement this idea.
These should be relatively practical, with an eye towards providing some aspiring grad student or professor with enough of an idea that they could go implement it. You should explain the general field (physics, AI, evolutionary psychology, economics, psychology, etc.) as well as the question the experiment is designed to investigate, in as much detail as you are reasonably capable of.
If this is a popular idea, a new thread can be started every time one of these reaches 500 comments, or quarterly, depending on its popularity. I expect this to provide help for people refining their understanding of various sciences, and if it ever gets turned into even a few good experiments, it will prove immensely worthwhile.
I think it's best to make these distinct from the general discussion thread because they have a very narrow purpose. I'll post an idea or two of my own to get things started. I'd also encourage people to post not only experiment ideas, but criticism and suggestions regarding this thread concept. I'd also suggest that people upvote or downvote this post if they think this is a good or bad idea, to better establish whether future implementations will be worthwhile.
Field: Software Engineering. Issue: what are the determinants of efficiency in getting stuff done that entails writing software.
At the Paris LW meetup, I described to Alexandros the particular subtopic about which I noticed confusion (including my own) - people call it the "10x thesis". According to this, in a typical workgroup of software professionals (people paid to write code), there will be a ten to one ratio between productivities of the best and worst. According to a stronger version, these disparities are unrelated to experience.
The research in this area typically has the following setup: you get a group of N people in one room, and give them the same task to perform. Usually there is some experimental condition that you want to measure the effect of (for instance "using design patterns" vs "not using design patterns"), so you split them into subgroups accordingly. You then measure how long each takes to finish the task.
The "10x" result comes from interpreting the same kind of experimental data, but instead of looking at the effect of the experimental condition, you look at the variance itself. (Historically, this got noticed because it vexed researchers that the variance was almost always swamping out the effects of the experimental conditions.)
The issue that perplexes me is that taking a best-to-worst ratio in each group, in such cases, will give a measurement of variance that is composed of two things: first, how variable the time required to complete a task is intrinsically, and second, how different people in the relevant population (which is itself hard to define) differ in their effectiveness at completing tasks.
When I discussed this with Alexandros I brought up the "ideal experiment" I would want to use to measure the first component: take one person, give them a task, measure how long they take. Repeat N times.
However this experiment isn't valid, because remembering how you solved the task the first time around saves a huge amount of time in successive attempts.
So my "ideal" experiment has to be amended: the same, but you wipe the programmer's short-term memory each time, resetting them to the state they were in before the task. Now this is only an impossible experiment.
What surprised me was Alexandros' next remark: "You can measure the same thing by giving the same task to N programmers, instead".
This seems clearly wrong to me. There are two different probability distributions involved: one is within-subject, the other inter-subject. They do not necessarily have the same shape. What you measure when giving one task to N programmers is a joint probability distribution, the shape of which could be consistent with infinitely many hypotheses about the shape of the underlying distributions.
Thus, my question - what would be a good experimental setup and statistical tools to infer within-subject variation, which cannot be measured, from what we can measure?
Bonus question: am I totally confused about the matter?
If being a good or bad programmer is an intrinsic quality that is independent of the task, then you could just give the same subject different tasks to solve. So you take N programmers, and give team all K tasks to solve. Then you can determine the mean difficulty of each task as well as the mean quality of each programmer. Given that you should be able to infer the variance.
There are some details to be worked out, for example, is task difficulty multiplicative or additive? I.e. if task A is 5 times as hard as task B, will the standard deviation also be 5 times as large? But that can be solved with enough data and proper prior probabilities of different models.