Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
The 2006 report from NASA's "Independent Verification and Validation Facility" makes some interesting claims. Turning to page 6, we learn that thanks to IV&V, "NASA realized a software rework risk reduction benefit of $1.6 Billion in Fiscal Year 2006 alone". This is close to 10% of NASA's overall annual budget, roughly equal to the entire annual budget of the International Space Station!
If the numbers check out, this is an impressive feat for IV&V (the more formal big brother of "testing" or "quality assurance" departments that most software development efforts include). Do they?
Flaubert and the math of ROI
Back in 1841, to tease his sister, Gustave Flaubert invented the "age of the captain problem", which ran like this:
A ship sails the ocean. It left Boston with a cargo of wool. It grosses 200 tons. [...] There are 12 passengers aboard, the wind is blowing East-North-East, the clock points to a quarter past three in the afternoon. It is the month of May. How old is the captain?
Flaubert was pointing out one common way people fail at math: you can only get sensible results from a calculation if the numbers you put in are related in the right ways. (Unfortunately, math education tends to be excessively heavy on the "manipulate numbers" part and to skimp on the "make sense of the question" part, a trend dissected by French mathematician Stella Baruk who titled one of her books after Flaubert's little joke on his sister.)
Unfortunately, NASA's math turns out on inspection to be "age-of-the-captain" math. (This strikes me as a big embarrassment to an organization literally composed mainly of rocket scientists.)
Over at Edge, Tetlock discusses "expert political judgment", the controversy surrounding Nate Silver's presidential predictions, overconfidence, motivated cognition, black swans, the IARPA forecasting contest, and much else. A few choice bits:
There's a question that I've been asking myself for nearly three decades now and trying to get a research handle on, and that is why is the quality of public debate so low and why is it that the quality often seems to deteriorate the more important the stakes get?
Is world politics like a poker game? This is what, in a sense, we are exploring in the IARPA forecasting tournament. You can make a good case that history is different and it poses unique challenges. This is an empirical question of whether people can learn to become better at these types of tasks. We now have a significant amount of evidence on this, and the evidence is that people can learn to become better. It's a slow process. It requires a lot of hard work, but some of our forecasters have really risen to the challenge in a remarkable way and are generating forecasts that are far more accurate than I would have ever supposed possible from past research in this area.
One of the things I've discovered in my work on assessing the accuracy of probability judgment is that there is much more eagerness in participating in these exercises among people who are younger and lower in status in organizations than there is among people who are older and higher in status in organizations.
From a sociological point of view, it's a minor miracle that this forecasting tournament is even occurring. Government agencies are not supposed to sponsor exercises that have the potential to embarrass them.
If you think that the Eurozone is going to collapse–if you think it was a really bad idea to put into common currency economies at very different levels of competitiveness, like Greece and Germany (that was a fundamentally unsound macroeconomic thing to do and the Eurozone is doomed), that's a nice example of an emphatic but untestable hedgehog kind of statement. It may be true, but it's not very useful for our forecasting tournament.
To make a forecasting tournament work we have to translate that hedgehog like hunch into a testable proposition like will Greece leave the Eurozone or formally withdraw from the Eurozone by May 2013? Or will Portugal? You need to translate the abstract interesting issue into testable propositions and then you need to get lots of thoughtful people to make probability judgments in response to those testable proposition questions. You need to do that over, and over, and over again.
In our tournament, we've skimmed off the very best forecasters in the first year, the top two percent. We call them "super forecasters." They're working together in five teams of 12 each and they're doing very impressive work.
Another amazing and wonderful thing about this tournament is how many really smart, thoughtful people are willing to volunteer, essentially enormous amounts of time to make this successful. We offer them a token honorarium. We're paying them right now $150 or $250 a year for their participation. The ones who are really taking it seriously–it's way less the minimum wage. And they're some very thoughtful professionals who are participating in this. Some political scientists I know have had some disparaging things to say about the people who might participate in something like this and one phrase that comes to mind is "unemployed news junkies." I don't think that's a fair characterization of our forecasters. Certainly the most actively engaged of our forecasters are really pretty awesome. They're very skillful at finding information, synthesizing it, and applying it, and then updating the response to new information. And they're very rapid updaters.
(I confess to some feelings of pride, possibly unearned, on reading this last paragraph - as the top forecaster of a middle-ranked team.)
But actually, go read the whole thing.
Previously: part 1
The three tactics I described in part 1 are most suited to making an initial forecast. I will now turn to a question that was raised in comments on part 1 - that of updating when new evidence arrives. But first, I'd like to discuss the notion of a "well-specified forecast".
It is often surprisingly hard to frame a question in terms that make a forecast reasonably easy to verify and score. Questions can be ambiguous (consider "X will win the U.S. presidential election" - do we mean win the popular vote, or win re-election in the electoral college?). They can fail to cover all possible outcomes (so "which of the candidates will win the election" needs a catch-all "Other").1
Low waterlines imply that it's relatively easy for a novice to outperform the competition. (In poker, as discussed in Nate Silver's book, the "fish" are those who can't master basic techniques such as folding when they have a poor hand, or calculating even roughly the expected value of a pot.) Does this apply to the domain of making predictions? It's early days, but it looks as if a smallish set of tools - a conscious status quo bias, respecting probability axioms when considering alternatives, considering references classes, leaving yourself a line of retreat, detaching from sunk costs, and a few more - can at least place you in a good position.
Among the goals of Less Wrong is to "raise the sanity waterline" of humanity. We've also talked about "raising the rationality waterline": the phrase is somewhat popular around these parts, which suggests that the metaphor is catchy. But is that all there is to it, a catchy metaphor? Or can the phrase be more usefully cashed out?
While reading Nate Silver's The Signal and the Noise, I came across a discussion of "raising the waterline" which fleshes out the metaphor with a more substantial model. This model preserves some of the salient aspects of the metaphor as discussed on LW, for instance the perception that the current waterline (as regards sanity and rationality) is "ridiculously low". More interestingly, it fleshes out some of the specific ways that a "waterline" belief should constrain our future sensory experiences, maybe even to the point of quantifying what should result from low (or rising) waterlines.
This is intended as a short series:
- "Raising the waterline", this introductory post, will summarize Nate Silver's "waterline" model, within its original context of playing Poker, which Silver frames as a game of prediction under uncertainty. Poker therefore serves as a "toy model" for a much more general class of problems.
- "Raising the forecasting waterline" will extend the discussion to the kind of forecasts studied by Philip Tetlock's Good Judgement Project, a prediction game somewhat similar to PredictionBook and related to prediction markets; I will leverage the waterline model to extract useful insights from my participation in GJP.
- "Raising the discussion waterline", a shamelessly speculative coda, will relate the previous two posts to the question of "how do Internet discussions reliably lead to correct inferences from true beliefs, or fail to do so"; I will argue that the waterline model brings some hope that a few basic tactics could nevertheless provide large wins, and raise the more general question of what other low waterlines we could aim to exploit.
TL;DR: I align with the minority position that "there is a lot less to the so-called placebo effect than people tend to think there is (and the name is horribly misleading)", a strong opinion weakly held.
The following post is an off-the cuff reply to a G+ post of gwern's, but I've been thinking about this on and off for quite a while. Were I to expand this for posting to Main, I would: a) go into more detail about the published research, b) introduce a second fallacy of reification for comparison, the so-called "10X variance in programmer productivity".
My agenda is to have this join my short series of articles on "software engineering as a diseased discipline", which I view as my modest attempt at "using Less Wrong ideas in your secret identity" and is covered at greater length in my book-in-progress.
I would therefore appreciate your feedback and probing at weak points.
Most of the time, talk of placebo effects (or worse of "the" placebo effect) falls victim to the reification fallacy.
My position is roughly "there is a lot less to the so-called placebo effect than people think there is (and the name is horribly misleading)".
More precisely: the term "placebo" in the context of "placebo controlled trial" has some usefulness, when used to mean a particular way of distinguishing between the null and test hypotheses in a trial: namely, that the test and control group receive exactly the same treatment, except that you substitute, in the control group, an inert substance (or inoperative procedure) for the putatively active substance being tested.
Whatever outcome measures are used, they will generally improve somewhat even in the control group: this can be due to many things, including regression to the mean, the disease running its course, increased compliance with medical instructions due to being in a study, expectancy effects leading to biased verbal self-reports.
None of these is properly speaking an "effect" causally linked to the inert substance (the "placebo pill"). The reification fallacy consists of thinking that because we give something a name ("the placebo effect") then there must be a corresponding reality. The false inference is "the people who improved in the control group were healed by the power of the placebo effect".
The further false inference is "there are ailments of which I could be cured by ingesting small sugar pills appropriately labeled". Some of my friends actually leverage this into justification for buying sugar in pharmacies at a ridiculous markup. I confess to being aghast whenever this happens in my presence.
A better name has been suggested: the "control response". This is experiment-specific, and encompasses all of the various mechanisms which make it look like "the control group improves when given a sugar pill / saline solution / sham treatment". Moreover it avoids hinting at mysterious healing powers of the mind.
Meta-analyses of those few studies that were designed to find an actual "placebo effect" (i.e. studies with a non-treatment arm, or studies comparing objective outcome measures for different placebos) have not confirmed it, the few individual studies that find a positive effect are inconclusive for a variety of reasons.
Doubting the existence of the placebo effect will expose you to immediate contradiction from your educated peers. One explanation seems to be that the "placebo effect" is a necessary argumentative prop in the arsenal of two opposed "camps". On the one hand proponents of CAM (Complementary and Alternative Medicine) will argue that "even if a herbal remedy is a placebo, who cares as long as it actually works" and must therefore assume that the placebo effect is real. On the other hand opponents of CAM will say "homeopathy or herbal remedies only seem to work because of the placebo effect, we can therefore dismiss all positive reports from people treating themselves with such".
I don't have a proper list of references yet, but see the following:
Recently a controversy broke out over the replicability of a study John Bargh et al. published in 1996. The study reported that unconsciously priming a stereotype of elderly people caused subjects to walk more slowly. A recent replication attempt by Stephane Doyen et al., published in PLoS ONE, was unable to reproduce the results. (Less publicized, but surely relevant, is another non-replication by Hal Pashler et al.) (source)
This is interesting, if only because the study in question is one of the more famous examples of priming effects - it's the one I tend to use when I introduce people to the idea of priming. (Ironically, the failed replication study also mentions a further experimental manipulation that does show priming effects - affecting the experimenters rather than the subjects.) Bargh's reply is also unusual in that it focuses significantly on extra-scientific arguments, such as attacks on the open access business model of PLoS ONE.
I was instantly reminded of The Golem, which "debunks the view that scientific knowledge is a straightforward outcome of competent theorization, observation, and experimentation". The examples on relativity and solar neutrinos are particularly engaging - it's not just psychology where experimentation is problematic, but all of science.
The linked blog also contributes useful observations of its own, such as the "rhetorical function" of the additional experiment in Doyen's study, how online publication makes a difference in how easily experimental setups can be replicated, or a subtle point about our favorite villain, p-values.
EDIT: added link to source. Heartfelt thanks to the two readers who upvoted the version without the link. :)
Fake explanations don't feel fake. That's what makes them dangerous. -- EY
Let's look at "A Handbook of Software and Systems Engineering", which purports to examine the insights from software engineering that are solidly grounded in empirical evidence. Published by the prestigious Fraunhofer Institut, this book's subtitle is in fact "Empirical Observations, Laws and Theories".
Now "law" is a strong word to use - the highest level to which an explanation can aspire to reach, as it were. Sometimes it's used in a jokey manner, as in "Hofstadter's Law" (which certainly seems often to apply to software projects). But this definitely isn't a jokey kind of book, that much we get from the appeal to "empirical observations" and the "handbook" denomination.
Here is the very first "law" listed in the Handbook:
Requirement deficiencies are the prime source of project failures.
Previously, we observed that in the field of software engineering, a last name followed by a year, surrounded by parentheses, seems to be a magic formula for suspending critical judgment in readers.
Another such formula, it seems, is the invocation of statistical results. Brandish the word "percentage", assert that you have surveyed a largish population, and whatever it is you claim, some people will start believing. Do it often enough and some will start repeating your claim - without bothering to check it - starting a potentially viral cycle.
As a case in point, one of the most often cited pieces of "evidence" in support of the above "law" is the well-known Chaos Report, according to which the first cause of project failure is "Incomplete Requirements". (The Chaos Report isn't cited as evidence by the Handbook, but it's representative enough to serve in the following discussion. A Google Search readily attests to the wide spread of the verbatim claim in the Chaos Report; various derivatives of the claim are harder to track, but easily verified to be quite pervasive.)
Some elementary reasoning about causal inference is enough to show that the same evidence supporting the above "law" can equally well be suggested as evidence supporting this alternative conclusion:
Project failures are the primary source of requirements deficiencies.
Imagine the following situation: you have come across numerous references to a paper purporting to show that the chances of successfully treating a disease contracted at age 10 are substantially lower if the disease is detected later: somewhat lower at age 20 to very poor at age 50. Every author draws more or less the same bar chart to depict this situation: the picture below, showing rising mortality from left to right.
You search for the original paper, which proves a long quest: the conference publisher have lost some of their archives in several moves, several people citing the paper turn out to no longer have a copy, etc. You finally locate a copy of the paper (let's call it G99) thanks to a helpful friend with great scholarly connections.
And you find out some interesting things.
The most striking is what the author's original chart depicts: the chances of successfully treating the disease detected at age 50 become substantially lower as a function of age when it was contracted; mortality is highest if the disease was contracted at age 10 and lowest if contracted at age 40. The chart showing this is the picture below, showing decreasing mortality from top to bottom, for the same ages on the vertical axis.
Not only is the representation topsy-turvy; the two diagrams can't be about the same thing, since what is constant in the first (age disease detected) is variable in the other, and what is variable in the first (age disease contracted) is constant in the other.
Now, as you research the issue a little more, you find out that authors prior to G99 have often used the first diagram to report their findings; reportedly, several different studies on different populations (dating back to the eighties) have yielded similar results.
But when citing G99, nobody reproduces the actual diagram in G99, they all reproduce the older diagram (or some variant of it).
You are tempted to conclude that the authors citing G99 are citing "from memory"; they are aware of the earlier research, they have a vague recollection that G99 contains results that are not totally at odds with the earlier research. Same difference, they reason, G99 is one more confirmation of the earlier research, which is adequately summarized by the standard diagram.
And then you come across a paper by the same author, but from 10 years earlier. Let's call it G89. There is a strong presumption that the study in G99 is the same that is described in G89, for the following reasons: a) the researcher who wrote G99 was by then already retired from the institution where they obtained their results; b) the G99 "paper" isn't in fact a paper, it's a PowerPoint summarizing previous results obtained by the author.
And in G89, you read the following: "This study didn't accurately record the mortality rates at various ages after contracting the disease, so we will use average rates summarized from several other studies."
So basically everyone who has been citing G99 has been building castles on sand.
Suppose that, far from some exotic disease affecting a few individuals each year, the disease in question was one of the world's major killers (say, tuberculosis, the world's leader in infectious disease mortality), and the reason why everyone is citing either G99 or some of the earlier research is to lend support to the standard strategies for fighting the disease.
When you look at the earlier research, you find nothing to allay your worries: the earlier studies are described only summarily, in broad overview papers or secondary sources; the numbers don't seem to match up, and so on. In effect you are discovering, about thirty years later, that what was taken for granted as a major finding on one of the principal topics of the discipline in fact has "sloppy academic practice" written all over it.
If this story was true, and this was medicine we were talking about, what would you expect (or at least hope for, if you haven't become too cynical), should this story come to light? In a well-functioning discipline, a wave of retractations, public apologies, general embarrassment and a major re-evaluation of public health policies concerning this disease would follow.
The story is substantially true, but the field isn't medicine: it is software engineering.
I have transposed the story to medicine, temporarily, as an act of benign deception, to which I now confess. My intention was to bring out the structure of this story, and if, while thinking it was about health, you felt outraged at this miscarriage of academic process, you should still feel outraged upon learning that it is in fact about software.
The "disease" isn't some exotic oddity, but the software equivalent of tuberculosis - the cost of fixing defects (a.k.a. bugs).
The original claim was that "defects introduced in early phases cost more to fix the later they are detected". The misquoted chart says this instead: "defects detected in the operations phase (once software is in the field) cost more to fix the earlier they were introduced".
Any result concerning the "disease" of software bugs counts as a major result, because it affects very large fractions of the population, and accounts for a major fraction of the total "morbidity" (i.e. lack of quality, project failure) in the population (of software programs).
The earlier article by the same author contained the following confession: "This study didn't accurately record the engineering times to fix the defects, so we will use average times summarized from several other studies to weight the defect origins".
Software engineering is a diseased discipline.
The publication I've labeled "G99" is generally cited as: Robert B. Grady, An Economic Release Decision Model: Insights into Software Project Management, in proceedings of Applications of Software Measurement (1999). The second diagram is from a photograph of a hard copy of the proceedings.
Here is one typical publication citing Grady 1999, from which the first diagram is extracted. You can find many more via a Google search. The "this study didn't accurately record" quote is discussed here, and can be found in "Dissecting Software Failures" by Grady, in the April 1989 issue of the "Hewlett Packard Journal"; you can still find one copy of the original source on the Web, as of early 2013, but link rot is threatening it with extinction.
A more extensive analysis of the "defect cost increase" claim is available in my book-in-progress, "The Leprechauns of Software Engineering".
Here is how the axes were originally labeled; first diagram:
- vertical: "Relative Cost to Correct a Defect"
- horizontal: "Development Phase" (values "Requirements", "Design", "Code", "Test", "Operation" from left to right)
- figure label: "Relative cost to correct a requirement defect depending on when it is discovered"
- vertical: "Activity When Defect was Created" (values "Specifications", "Design", "Code", "Test" from top to bottom)
- horizontal: "Relative cost to fix a defect after release to customers compared to the cost of fixing it shortly after it was created"
- figure label: "Relative Costs to Fix Defects"
The Less Wrong Public Goods Team has already brought you an easy-to use virtual machine for hacking Less Wrong.
But virtual boxes can cut both ways: on the one hand, you don't have to worry about setting things up yourself; on the other hand, not knowing how things were put together, having to deal with a "black box" that doesn't let you use your own source code editor or pick an OS - these can be offputting. To me at least, these were trivial inconveniences that might stand in the way of updating my copy of the source and making some useful tweaks.
Enter Vagrant - and a little work I've done today for LW hackers and would-be hackers. Vagrant is a recent tool that allows you to treat virtual machine configurations as source code.
Instead of being something that someone possessed of arcane knowledge has put together, a virtual machine under Vagrant results from executing a series of source code instructions - and this source code is available for you to read, review, understand or change. (Software development should be a process of knowledge capture, not some hermetic discipline where you rely on the intransmissible wisdom of remote elders.)
Preliminary (but tested) results are up on my Github repo - it's a fork of the offiical LW code base, not the real thing. (One this is tested by someone else, and if it works well, I intend to submit a pull request so that these improvements end up in the main codebase.) The following assumes you have a Unix or Mac system, or if you're using Windows, that you're command-line competent.
Hacking on LW is now done as follows (compared to using the VM):
- The following prerequisites are unchanged: git, Virtualbox
- Install the following prerequisites: Ruby, rubygems, Vagrant
- Download the Less Wrong source code as follows: git clone email@example.com:Morendil/lesswrong.git
- Enter the "lesswrong" directory, then build the VM with: vagrant up (may take a while)
- Log into the virtual box with: vagrant ssh
- Go to the "/vagrant/r2" directory, and copy example.ini to development.ini
- Change all instances of "password" in development.ini to "reddit"
- You can now start the LW server with: paster serve --reload development.ini port=8080
- Browse the URL http://localhost:8080/
View more: Next