They Write the Right Stuff is about software which "never crashes. It never needs to be re-booted. This software is bug-free. It is perfect, as perfect as human beings have achieved. Consider these stats : the last three versions of the program -- each 420,000 lines long-had just one error each. The last 11 versions of this software had a total of 17 errors. Commercial programs of equivalent complexity would have 5,000 errors."
The programmers work from 8 to 5, with occasional late nights. They wear dressy clothes, not flashy or grungy. I assume there's a dress code, but I have no idea whether conventional clothes are actually an important part of the process. I'm sure that working reasonable numbers of hours is crucial, though I also wonder whether those hours need to be standard office hours.
"And the culture is equally intolerant of creativity, the individual coding flourishes and styles that are the signature of the all-night software world. "People ask, doesn't this process stifle creativity? You have to do exactly what the manual says, and you've got someone looking over your shoulder," says Keller. "The answer is, yes, the process does stifle creativity." " I have no idea what's in the manual, or if there can be a manual for something as new as self-optimizing AI. I assume there could be a manual for some aspects.
What follows is main points quoted from the article:
The important thing is the process: The product is only as good as the plan for the product. About one-third of the process of writing software happens before anyone writes a line of code.
2. The best teamwork is a healthy rivalry. The central group breaks down into two key teams: the coders - the people who sit and write code -- and the verifiers -- the people who try to find flaws in the code. The two outfits report to separate bosses and function under opposing marching orders. The development group is supposed to deliver completely error-free code, so perfect that the testers find no flaws at all. The testing group is supposed to pummel away at the code with flight scenarios and simulations that reveal as many flaws as possible. The result is what Tom Peterson calls "a friendly adversarial relationship."
I note that it's rivalry between people who are doing different things, not people competing to get control of a project.
3. The database is the software base.
One is the history of the code itself -- with every line annotated, showing every time it was changed, why it was changed, when it was changed, what the purpose of the change was, what specifications documents detail the change. Everything that happens to the program is recorded in its master history. The genealogy of every line of code -- the reason it is the way it is -- is instantly available to everyone.
The other database -- the error database -- stands as a kind of monument to the way the on-board shuttle group goes about its work. Here is recorded every single error ever made while writing or working on the software, going back almost 20 years. For every one of those errors, the database records when the error was discovered; what set of commands revealed the error; who discovered it; what activity was going on when it was discovered -- testing, training, or flight. It tracks how the error was introduced into the program; how the error managed to slip past the filters set up at every stage to catch errors -- why wasn't it caught during design? during development inspections? during verification? Finally, the database records how the error was corrected, and whether similar errors might have slipped through the same holes.
The group has so much data accumulated about how it does its work that it has written software programs that model the code-writing process. Like computer models predicting the weather, the coding models predict how many errors the group should make in writing each new version of the software. True to form, if the coders and testers find too few errors, everyone works the process until reality and the predictions match.
4. Don't just fix the mistakes -- fix whatever permitted the mistake in the first place.
The process is so pervasive, it gets the blame for any error -- if there is a flaw in the software, there must be something wrong with the way its being written, something that can be corrected. Any error not found at the planning stage has slipped through at least some checks. Why? Is there something wrong with the inspection process? Does a question need to be added to a checklist?
Importantly, the group avoids blaming people for errors. The process assumes blame - and it's the process that is analyzed to discover why and how an error got through. At the same time, accountability is a team concept: no one person is ever solely responsible for writing or inspecting code. "You don't get punished for making errors," says Marjorie Seiter, a senior member of the technical staff. "If I make a mistake, and others reviewed my work, then I'm not alone. I'm not being blamed for this."
This is incorrect in an interesting way.
There's a famous story, among people who study Apollo history, of the 1201 and 1202 program alarms that occurred during Apollo 11, as described here and here. Those links are short and well worth reading in their entirety, but here's a summary:
Apollo 11's guidance computer had incredibly limited hardware by modern standards. When you read its specs, if you know anything about computers, you will not believe that people's lives were trusted to something so primitive. As Neil and Buzz were performing their powered descent to the Moon, the guidance computer started emitting obscure "1201" and "1202" program alarms that they had never seen before. Significant computer problems at this stage, hovering over the moon with only minutes of fuel to spare, mean that the astronauts should abort and return to orbit, instead of attempting to land and crashing due to broken software. The program experts quickly determined that the alarms were ignorable, and the mission proceeded. As it turned out, the astronauts had been incorrectly trained to leave a switch on, which fed radar data to the computer that it shouldn't have been getting (and the switch wasn't connected to a real computer during training so this wasn't noticed). This overloaded the computer, which had too much data to process given its hard real-time constraints. Then it did something that would be amazing in this era, much less 1969:
This auto-restart ability, combined with prioritization, allowed the computer to (literally literally) reboot every 10 seconds, while continuing to handle tasks whose failure would kill the astronauts, and dropping less important tasks.
The thing about space software is that it's enormously, insanely expensive in real terms (i.e. it requires lots of time from lots of skilled people). Ordinary software (desktop, server, phone, console, you name it) is cheaper, bigger, and evolves more rapidly. It's also buggier, but its bugs typically don't kill people and rarely cost a billion dollars. NASA has done things wrong, but their approach to software is perfectly suited to their requirements.
This wasn't actually a computer software failure. It was a failure of procedure development. Also it suggests their training should also be a high-fidelity simulation test, as that would have found this problem on the ground right away. So its maybe a testing failure but even then not a testing failure for the software but for the entire landing system (considering hardware, software and human procedures).