Diseased disciplines: the strange case of the inverted chart

Morendil

63 Diseased disciplines: the strange case of the inverted chart

7th Feb 2012

5 min read

63

Imagine the following situation: you have come across numerous references to a paper purporting to show that the chances of successfully treating a disease contracted at age 10 are substantially lower if the disease is detected later: somewhat lower at age 20 to very poor at age 50. Every author draws more or less the same bar chart to depict this situation: the picture below, showing rising mortality from left to right.

Rising mortality, left to right

You search for the original paper, which proves a long quest: the conference publisher have lost some of their archives in several moves, several people citing the paper turn out to no longer have a copy, etc. You finally locate a copy of the paper (let's call it G99) thanks to a helpful friend with great scholarly connections.

And you find out some interesting things.

The most striking is what the author's original chart depicts: the chances of successfully treating the disease detected at age 50 become substantially lower as a function of age when it was contracted; mortality is highest if the disease was contracted at age 10 and lowest if contracted at age 40. The chart showing this is the picture below, showing decreasing mortality from top to bottom, for the same ages on the vertical axis.

Decreasing mortality, top to bottom

Not only is the representation topsy-turvy; the two diagrams can't be about the same thing, since what is constant in the first (age disease detected) is variable in the other, and what is variable in the first (age disease contracted) is constant in the other.

Now, as you research the issue a little more, you find out that authors prior to G99 have often used the first diagram to report their findings; reportedly, several different studies on different populations (dating back to the eighties) have yielded similar results.

But when citing G99, nobody reproduces the actual diagram in G99, they all reproduce the older diagram (or some variant of it).

You are tempted to conclude that the authors citing G99 are citing "from memory"; they are aware of the earlier research, they have a vague recollection that G99 contains results that are not totally at odds with the earlier research. Same difference, they reason, G99 is one more confirmation of the earlier research, which is adequately summarized by the standard diagram.

And then you come across a paper by the same author, but from 10 years earlier. Let's call it G89. There is a strong presumption that the study in G99 is the same that is described in G89, for the following reasons: a) the researcher who wrote G99 was by then already retired from the institution where they obtained their results; b) the G99 "paper" isn't in fact a paper, it's a PowerPoint summarizing previous results obtained by the author.

And in G89, you read the following: "This study didn't accurately record the mortality rates at various ages after contracting the disease, so we will use average rates summarized from several other studies."

So basically everyone who has been citing G99 has been building castles on sand.

Suppose that, far from some exotic disease affecting a few individuals each year, the disease in question was one of the world's major killers (say, tuberculosis, the world's leader in infectious disease mortality), and the reason why everyone is citing either G99 or some of the earlier research is to lend support to the standard strategies for fighting the disease.

When you look at the earlier research, you find nothing to allay your worries: the earlier studies are described only summarily, in broad overview papers or secondary sources; the numbers don't seem to match up, and so on. In effect you are discovering, about thirty years later, that what was taken for granted as a major finding on one of the principal topics of the discipline in fact has "sloppy academic practice" written all over it.

If this story was true, and this was medicine we were talking about, what would you expect (or at least hope for, if you haven't become too cynical), should this story come to light? In a well-functioning discipline, a wave of retractations, public apologies, general embarrassment and a major re-evaluation of public health policies concerning this disease would follow.

The story is substantially true, but the field isn't medicine: it is software engineering.

I have transposed the story to medicine, temporarily, as an act of benign deception, to which I now confess. My intention was to bring out the structure of this story, and if, while thinking it was about health, you felt outraged at this miscarriage of academic process, you should still feel outraged upon learning that it is in fact about software.

The "disease" isn't some exotic oddity, but the software equivalent of tuberculosis - the cost of fixing defects (a.k.a. bugs).

The original claim was that "defects introduced in early phases cost more to fix the later they are detected". The misquoted chart says this instead: "defects detected in the operations phase (once software is in the field) cost more to fix the earlier they were introduced".

Any result concerning the "disease" of software bugs counts as a major result, because it affects very large fractions of the population, and accounts for a major fraction of the total "morbidity" (i.e. lack of quality, project failure) in the population (of software programs).

The earlier article by the same author contained the following confession: "This study didn't accurately record the engineering times to fix the defects, so we will use average times summarized from several other studies to weight the defect origins".

Not only is this one major result suspect, but the same pattern of "citogenesis" turns up investigating several other important claims.

Software engineering is a diseased discipline.

The publication I've labeled "G99" is generally cited as: Robert B. Grady, An Economic Release Decision Model: Insights into Software Project Management, in proceedings of Applications of Software Measurement (1999). The second diagram is from a photograph of a hard copy of the proceedings.

Here is one typical publication citing Grady 1999, from which the first diagram is extracted. You can find many more via a Google search. The "this study didn't accurately record" quote is discussed here, and can be found in "Dissecting Software Failures" by Grady, in the April 1989 issue of the "Hewlett Packard Journal"; you can still find one copy of the original source on the Web, as of early 2013, but link rot is threatening it with extinction.

A more extensive analysis of the "defect cost increase" claim is available in my book-in-progress, "The Leprechauns of Software Engineering".

Here is how the axes were originally labeled; first diagram:

vertical: "Relative Cost to Correct a Defect"
horizontal: "Development Phase" (values "Requirements", "Design", "Code", "Test", "Operation" from left to right)
figure label: "Relative cost to correct a requirement defect depending on when it is discovered"

Second diagram:

vertical: "Activity When Defect was Created" (values "Specifications", "Design", "Code", "Test" from top to bottom)
horizontal: "Relative cost to fix a defect after release to customers compared to the cost of fixing it shortly after it was created"
figure label: "Relative Costs to Fix Defects"

Practice & Philosophy of ScienceInformation CascadesProbability & StatisticsProgramming

Frontpage

63

New Comment

Rendering 0/150 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 1:57 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

63 Diseased disciplines: the strange case of the inverted chart

by Morendil

7th Feb 2012

5 min read

150

63

Rising mortality, left to right

And you find out some interesting things.

Decreasing mortality, top to bottom

But when citing G99, nobody reproduces the actual diagram in G99, they all reproduce the older diagram (or some variant of it).

So basically everyone who has been citing G99 has been building castles on sand.

The story is substantially true, but the field isn't medicine: it is software engineering.

The "disease" isn't some exotic oddity, but the software equivalent of tuberculosis - the cost of fixing defects (a.k.a. bugs).

Not only is this one major result suspect, but the same pattern of "citogenesis" turns up investigating several other important claims.

Software engineering is a diseased discipline.

A more extensive analysis of the "defect cost increase" claim is available in my book-in-progress, "The Leprechauns of Software Engineering".

Here is how the axes were originally labeled; first diagram:

vertical: "Relative Cost to Correct a Defect"
horizontal: "Development Phase" (values "Requirements", "Design", "Code", "Test", "Operation" from left to right)
figure label: "Relative cost to correct a requirement defect depending on when it is discovered"

Second diagram:

vertical: "Activity When Defect was Created" (values "Specifications", "Design", "Code", "Test" from top to bottom)
horizontal: "Relative cost to fix a defect after release to customers compared to the cost of fixing it shortly after it was created"
figure label: "Relative Costs to Fix Defects"

Practice & Philosophy of ScienceInformation CascadesProbability & StatisticsProgramming

Frontpage

63

Mentioned in

622012: Year in Review

53Causal diagrams and software engineering

37PSA: Please list your references, don't just link them

31Fallacies of reification - the placebo effect

27Rocket science and big money - a cautionary tale of math gone wrong

Load More (5/8)

New Comment

Rendering 0/150 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 1:57 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from Morendil

Curated and popular this week

150Comments

150

Comment Permalink

Polymeron14y80

This strikes me as particularly galling because I have in fact repeated this claim to someone new to the field. I think I prefaced it with "studies have conclusively shown...". Of course, it was unreasonable of me to think that what is being touted by so many as well-researched was not, in fact, so.

Mind, it seems to me that defects do follow both patterns: Introducing defects earlier and/or fixing them later should come at a higher dollar cost, that just makes sense. However, it could be the same type of "makes sense" that made Aristotle conclude that heavy objects fall faster than light objects - getting actual data would be much better than reasoning alone, especially is it would tell us just how much costlier, if at all, these differences are - it would be an actual precise tool rather than a crude (and uncertain) rule of thumb.

I do have one nagging worry about this example: These days a lot of projects collect a lot of metrics. It seems dubious to me that no one has tried to replicate these results.

rwallace14y10

We know that late detection is sometimes much more expensive, simply because depending on the domain, some bugs can do harm (letting bad data into the database, making your customers' credit card numbers accessible to the Russian Mafia, delivering a satellite to the bottom of the Atlantic instead of into orbit) much more expensive than the cost of fixing the code itself. So it's clear that on average, cost does increase with time of detection. But are those high-profile disasters part of a smooth graph, or is it a step function where the cost of fixing the... (read more)

0vi21maobk9vp14y

The real problem with these graphs is not that they were cited wrong. After all, it does look like both are taken from different data sets, however they were collected, and support the same conclusion. The true problem is that it is hard to say what do they measure at all. If this true problem didn't exist, and these graphs measured something that can be measured, I'd bet that these graphs not being refuted would actually mean that they are both showing true sign of correlation. The reason is quite simple: every possible metric gets collected for a stupid presentation from time to time. If the correlation was falsifiable and wrong, we would likely see falsifications on TheDailyWTF forum as an anecdots.

7Morendil14y

Mostly the ones that are easy to collect: a classic case of "looking under the lamppost where there is light rather than where you actually lost your keys". [...] Now we're starting to think. Could we (I don't have a prefabricated answer to this one) think of a cheap and easy to run experiment that would help us see more clearly what's going on?

See in context