Imagine the following situation: you have come across numerous references to a paper purporting to show that the chances of successfully treating a disease contracted at age 10 are substantially lower if the disease is detected later: somewhat lower at age 20 to very poor at age 50. Every author draws more or less the same bar chart to depict this situation: the picture below, showing rising mortality from left to right.
You search for the original paper, which proves a long quest: the conference publisher have lost some of their archives in several moves, several people citing the paper turn out to no longer have a copy, etc. You finally locate a copy of the paper (let's call it G99) thanks to a helpful friend with great scholarly connections.
And you find out some interesting things.
The most striking is what the author's original chart depicts: the chances of successfully treating the disease detected at age 50 become substantially lower as a function of age when it was contracted; mortality is highest if the disease was contracted at age 10 and lowest if contracted at age 40. The chart showing this is the picture below, showing decreasing mortality from top to bottom, for the same ages on the vertical axis.
Not only is the representation topsy-turvy; the two diagrams can't be about the same thing, since what is constant in the first (age disease detected) is variable in the other, and what is variable in the first (age disease contracted) is constant in the other.
Now, as you research the issue a little more, you find out that authors prior to G99 have often used the first diagram to report their findings; reportedly, several different studies on different populations (dating back to the eighties) have yielded similar results.
But when citing G99, nobody reproduces the actual diagram in G99, they all reproduce the older diagram (or some variant of it).
You are tempted to conclude that the authors citing G99 are citing "from memory"; they are aware of the earlier research, they have a vague recollection that G99 contains results that are not totally at odds with the earlier research. Same difference, they reason, G99 is one more confirmation of the earlier research, which is adequately summarized by the standard diagram.
And then you come across a paper by the same author, but from 10 years earlier. Let's call it G89. There is a strong presumption that the study in G99 is the same that is described in G89, for the following reasons: a) the researcher who wrote G99 was by then already retired from the institution where they obtained their results; b) the G99 "paper" isn't in fact a paper, it's a PowerPoint summarizing previous results obtained by the author.
And in G89, you read the following: "This study didn't accurately record the mortality rates at various ages after contracting the disease, so we will use average rates summarized from several other studies."
So basically everyone who has been citing G99 has been building castles on sand.
Suppose that, far from some exotic disease affecting a few individuals each year, the disease in question was one of the world's major killers (say, tuberculosis, the world's leader in infectious disease mortality), and the reason why everyone is citing either G99 or some of the earlier research is to lend support to the standard strategies for fighting the disease.
When you look at the earlier research, you find nothing to allay your worries: the earlier studies are described only summarily, in broad overview papers or secondary sources; the numbers don't seem to match up, and so on. In effect you are discovering, about thirty years later, that what was taken for granted as a major finding on one of the principal topics of the discipline in fact has "sloppy academic practice" written all over it.
If this story was true, and this was medicine we were talking about, what would you expect (or at least hope for, if you haven't become too cynical), should this story come to light? In a well-functioning discipline, a wave of retractations, public apologies, general embarrassment and a major re-evaluation of public health policies concerning this disease would follow.
The story is substantially true, but the field isn't medicine: it is software engineering.
I have transposed the story to medicine, temporarily, as an act of benign deception, to which I now confess. My intention was to bring out the structure of this story, and if, while thinking it was about health, you felt outraged at this miscarriage of academic process, you should still feel outraged upon learning that it is in fact about software.
The "disease" isn't some exotic oddity, but the software equivalent of tuberculosis - the cost of fixing defects (a.k.a. bugs).
The original claim was that "defects introduced in early phases cost more to fix the later they are detected". The misquoted chart says this instead: "defects detected in the operations phase (once software is in the field) cost more to fix the earlier they were introduced".
Any result concerning the "disease" of software bugs counts as a major result, because it affects very large fractions of the population, and accounts for a major fraction of the total "morbidity" (i.e. lack of quality, project failure) in the population (of software programs).
The earlier article by the same author contained the following confession: "This study didn't accurately record the engineering times to fix the defects, so we will use average times summarized from several other studies to weight the defect origins".
Not only is this one major result suspect, but the same pattern of "citogenesis" turns up investigating several other important claims.
Software engineering is a diseased discipline.
The publication I've labeled "G99" is generally cited as: Robert B. Grady, An Economic Release Decision Model: Insights into Software Project Management, in proceedings of Applications of Software Measurement (1999). The second diagram is from a photograph of a hard copy of the proceedings.
Here is one typical publication citing Grady 1999, from which the first diagram is extracted. You can find many more via a Google search. The "this study didn't accurately record" quote is discussed here, and can be found in "Dissecting Software Failures" by Grady, in the April 1989 issue of the "Hewlett Packard Journal"; you can still find one copy of the original source on the Web, as of early 2013, but link rot is threatening it with extinction.
A more extensive analysis of the "defect cost increase" claim is available in my book-in-progress, "The Leprechauns of Software Engineering".
Here is how the axes were originally labeled; first diagram:
- vertical: "Relative Cost to Correct a Defect"
- horizontal: "Development Phase" (values "Requirements", "Design", "Code", "Test", "Operation" from left to right)
- figure label: "Relative cost to correct a requirement defect depending on when it is discovered"
Second diagram:
- vertical: "Activity When Defect was Created" (values "Specifications", "Design", "Code", "Test" from top to bottom)
- horizontal: "Relative cost to fix a defect after release to customers compared to the cost of fixing it shortly after it was created"
- figure label: "Relative Costs to Fix Defects"
Of course, by 1989 both experience and multiple cause-and-effect explanations told people this is the case. And the two graphs are actually different data sets with the same conclusion, so it looks like people just took whatever graph they found quickly.
Comparing early quickly-found bugs and late quicky-found bugs is still impossible with this quality of data, but it is for the better. The real problem is not citing graph correctly - it is about what affects both bug severity and bug detection. Like having any semblance of order in the team.
Are there people that claim this is about true science and not set of best practices? Maybe they are the real problem for now...
Typical quote: "Software engineering is defined as the systematic application of science, mathematics, technology and engineering principles to the analysis, development and maintenance of software systems, with the aim of transforming software development from an ad hoc craft to a repeatable, quantifiable and manageable process."
And the publications certainly dress up software engineering in the rhetoric of science: the style of citation where you say something and then add "(Grady 1999)" as if that was supposed to be authoritative.
It will be impossible to make progress in this field (and I think this has implications for AI and even FAI) until such confusions are cleared away.