Fatal software errors

Many large software-intensive organisations are currently in the process of replacing so-called legacy software systems. These applications and components typically involve several million lines of code that are between 5 and 40 years old. The following observations are helpful to understand the potential impact of software errors, the scale of the work, and the risks involved.

In case you are waiting for a banking transaction to come through, consider this anecdote from a random collection of software errors:

Rumor has it that, when they shut down the IBM 7094 at MIT in 1973, they found a low-priority process that had been submitted in 1967 and had not yet been run.

Other examples from the same collection of errors are less funny and some are deadly:

Computer blunders were blamed for $650M student loan losses. From ACM SIGSOFT Software Engineering Notes , vol. 20, no. 3

The Korean Airlines KAL 801 accident in Guam killed 225 out of 254 aboard. A design problem was discovered in barometric altimetry in Ground Proximity Warning System (GPWS). From ACM SIGSOFT Software Engineering Notes, vol. 23, no. 1.

From the End User License Agreement of a typical software platform:

The software product may contain support for programs written in Java. Java technology is not fault tolerant and is not designed, manufactured, or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, such as in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, or weapon systems, in which the failure of Java technology could lead directly to death, personal injury, or severe physical or environmental damage.

If you believe real-time capable systems are intrinsically safer, consider that these systems are still coded by humans in notations just as arcane as Java, and are subject to the usual average error rate of humans. A detailed case study from the field of medical software:

On March 21, 1986, an oilfield worker named Ray Cox was being irradiated for the ninth time at the East Texas Cancer Center in Tyler, Texas, for a tumour that had been removed from his back…

Inside the treatment room Cox was hit with a powerful shock. He knew from previous treatments this was not supposed to happen. He tried to get up. Not seeing or hearing him because of the broken communications between the rooms, the technician pushed the “p” key, meaning “proceed.” Cox was hit again. The treatment finally stopped when Cox stumbled to the door of the room and beat it with his fists…

Cox’s injury was similar to Jane Yarborough’s — a dime-sized dose of 16,000 to 15,000 rads. He was sent home but returned to the hospital a few weeks later spitting blood: the doctors diagnosed radiation overexposure. It later paralysed his left arm, both legs, his left vocal chord, and his diaphragm. He died nearly five months later…

The Therac-25’s software program, relatively crude by today’s standards, probably contained 101,000 lines of code.

A more recent example illustrates that the situation is not improving over the years. The Therac-25 software was probably subjected to much more rigorous testing than most corporate business systems ever have been.

Large software systems consist of > 10,000,000 lines of code. Integrating systems beyond company boundaries via web services is becoming increasingly common, leading to ultra-large scale systems involving billions of lines of code, and to a quasi-infinite number of usage scenarios that are never tested. As I am writing this post, the perfect example of the brittleness of deep web service supply chains in an ultra-large scale system comes along:

Scores of well-known websites have been unavailable for large parts of Thursday because of problems with Amazon’s web hosting service.

If a single line of code contains one decision, 10 million lines of code contain 10 million decisions. Re-writing 10 million lines of legacy software may resolve all existing errors – in the most optimistic scenario. At the same time it is an opportunity to introduce around 10 million new decisions that can potentially be wrong. Given that 1 unintended error per 500 lines of code is considered pretty good, that’s effectively a guarantee of 20,000 new sources of errors in a haystack of 10 million lines. Happy error hunting…

The sources of errors located in the most frequently executed lines of code are likely to be detected relatively quickly, the rest will remain as sleepers until the big day arrives. Yes, the error rate can theoretically be reduced to a very acceptable 1 error per 250,000 lines of code (500 times better) through the use of NASA-style quality assurance, but at more than 150 times the cost (around $1,000 per line of code) – any commercial enterprise attempting to emulate the approach would be broke in no time.

Unless software systems are constructed from the ground up based on a radically different set of new principles and notations, the quality of software will not improve in any substantial way. The problem with state-of-the-art-software is that its readability and understandability by humans decreases very quickly over time, due to poor notations and inadequate mechanisms for modularising specifications. But as highlighted above, the complexity of our systems is steadily going up, and so are the high-impact risks.

Even if all known errors in a piece of software are fixed, no one is able to verify how many new errors are introduced with these fixes. Errors in new software are unavoidable as long as humans are involved in the software creation process.

The opportunity for improvement of software quality lies in the introduction of techniques that allow software specifications to be simplified (modularised) and to be made easier to understand (improved notation) as part of fixing errors, and as part of gaining new insights about the system by observing and analysing its run-time behaviour.

In other words, progress will only be made if the downward spiral of software understandability over time can be reversed. Linear representations in the form of traditional program code and natural language (documentation) certainly will never be up to the job. Radically different representations of knowledge are required to allow a software re-write to be conducted incrementally, in tiny steps, without introducing thousands of new sources of errors.

4 thoughts on “Fatal software errors

  1. Pingback: No one is in control, mistakes happen on this planet « Jorn Bettin

  2. Pingback: Reconnecting software quality expectations and cost expectations « Jorn Bettin

  3. Pingback: The antidote to misuse of mathematics and junk data | Jorn Bettin

  4. Pingback: Human scale computing | Interdisciplinary Innovation and Collaboration

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s