twanvl comments on Experiment Idea Thread - Spring 2011 - Less Wrong

28 Post author: Psychohistorian 06 May 2011 06:10PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (53)

You are viewing a single comment's thread. Show more comments above.

Comment author: Morendil 06 May 2011 09:44:49PM *  2 points [-]

Field: Software Engineering. Issue: what are the determinants of efficiency in getting stuff done that entails writing software.

At the Paris LW meetup, I described to Alexandros the particular subtopic about which I noticed confusion (including my own) - people call it the "10x thesis". According to this, in a typical workgroup of software professionals (people paid to write code), there will be a ten to one ratio between productivities of the best and worst. According to a stronger version, these disparities are unrelated to experience.

The research in this area typically has the following setup: you get a group of N people in one room, and give them the same task to perform. Usually there is some experimental condition that you want to measure the effect of (for instance "using design patterns" vs "not using design patterns"), so you split them into subgroups accordingly. You then measure how long each takes to finish the task.

The "10x" result comes from interpreting the same kind of experimental data, but instead of looking at the effect of the experimental condition, you look at the variance itself. (Historically, this got noticed because it vexed researchers that the variance was almost always swamping out the effects of the experimental conditions.)

The issue that perplexes me is that taking a best-to-worst ratio in each group, in such cases, will give a measurement of variance that is composed of two things: first, how variable the time required to complete a task is intrinsically, and second, how different people in the relevant population (which is itself hard to define) differ in their effectiveness at completing tasks.

When I discussed this with Alexandros I brought up the "ideal experiment" I would want to use to measure the first component: take one person, give them a task, measure how long they take. Repeat N times.

However this experiment isn't valid, because remembering how you solved the task the first time around saves a huge amount of time in successive attempts.

So my "ideal" experiment has to be amended: the same, but you wipe the programmer's short-term memory each time, resetting them to the state they were in before the task. Now this is only an impossible experiment.

What surprised me was Alexandros' next remark: "You can measure the same thing by giving the same task to N programmers, instead".

This seems clearly wrong to me. There are two different probability distributions involved: one is within-subject, the other inter-subject. They do not necessarily have the same shape. What you measure when giving one task to N programmers is a joint probability distribution, the shape of which could be consistent with infinitely many hypotheses about the shape of the underlying distributions.

Thus, my question - what would be a good experimental setup and statistical tools to infer within-subject variation, which cannot be measured, from what we can measure?

Bonus question: am I totally confused about the matter?

Comment author: twanvl 06 May 2011 11:14:35PM 3 points [-]

If being a good or bad programmer is an intrinsic quality that is independent of the task, then you could just give the same subject different tasks to solve. So you take N programmers, and give team all K tasks to solve. Then you can determine the mean difficulty of each task as well as the mean quality of each programmer. Given that you should be able to infer the variance.

There are some details to be worked out, for example, is task difficulty multiplicative or additive? I.e. if task A is 5 times as hard as task B, will the standard deviation also be 5 times as large? But that can be solved with enough data and proper prior probabilities of different models.