As humans we heavily rely on intuition and on our personal mental models for making many millions of subconscious decisions and a much smaller number of conscious decisions on a daily basis. All these decisions involve interpretations of our prior experience and the sensory input we receive. It is only in hindsight that we can realise our mistakes. Learning from mistakes involves updating our mental models, and we need to get better at it, not only personally, but as a society:
Whilst we will continue to interact heavily with humans, we increasingly interact with the web – and all our interactions are subject to the well-known problems of communication. One of the more profound characteristics of ultra-large-scale systems is the way in which the impact of unintended or unforeseen behaviours propagates through the system.
The most familiar example is the one of software viruses, which have spawned an entire industry. Just as in biology, viruses will never completely go away. It is an ongoing fight of empirical knowledge against undesirable pathogens that is unlikely to ever end, because both opponents are evolving their knowledge after each new encounter based on the experience gained.
Similar to viruses, there are many other unintended or unforeseen behaviours that propagate through ultra-large-scale systems. Only on some occasions do these behaviours result in immediate outages or misbehaviours that are easily observable by humans.
Sometimes it can take hours, weeks, or months for downstream effects to aggregate to the point where they cause some component to reach a point where an explicit error is generated and a human observer is alerted. In many cases it is not possible to trace down the root cause or causes, and the co-called fix consists in correcting the visible part of the downstream damage.
Take the recent tsunami and the destroyed nuclear reactors in Japan. How far is it humanly and economically possible to fix the root causes? Globally, many nuclear reactor designs have weaknesses. What trade-off between risk levels (also including a contingency for risks that no one is currently aware of) and the cost of electricity are we prepared to make?
Addressing local sources of events that lead to easily and immediately observable error conditions is a drop in the bucket of potential sources of serious errors. Yet this is the usual limit of scope of that organisations apply to quality assurance, disaster recovery etc.
The difference between the web and a living system is fading, and our understanding of the system is limited to say the least. A sensible approach to failures and system errors is increasingly comparable to the one used in medicine to fight diseases – the process of finding out what helps is empirical, and all new treatments are tested for unintended side-effects over an extended period of time. Still, all the tests only lead to statistical data and interpretations, no absolute guarantees. In the life sciences no honest scientist can claim to be in full control. In fact, no one is in full control, and it is clear that no one will ever be in full control.
Traditional management practices strive to avoid any semblance of “not being in full control”. Organisations that are ready to admit that they operate within the context of an ultra-large-scale system have a choice between:
- conceding they have lost control internally, because their internal systems are so complex, or
- regaining a degree of internal understandability by simplifying internal structures and systems, enabled by shifting to the use of external web services – which also does not establish full control.
Conceding the unavoidable loss of control, or being prepared to pay extensively for effective risk reduction measures (one or two orders of magnitude in cost) amounts to political suicide in most organisations.