Drift into failure, p.20
Drift Into Failure, page 20
A decade and a half before, Feynman had discovered a similarly ambiguous slide about Challenger. In his case, the bullets had declared that the eroding seal in the field joints was "most critical" for flight safety, yet that "analysis of existing data indicates that it is safe to continue flying the existing design."51 The accident proved that it was not. Solid Rocket Boosters (or SRBs or SRMs) that help the Space Shuttle out of the earth's atmosphere are segmented, which makes ground transportation easier and has some other advantages. A problem that was discovered early in the Shuttle's operation, however, was that the solid rockets did not always properly seal at these segments, and that hot gases could leak through the rubber O-rings in the seal, called blow-by. This eventually led to the explosion of Challenger in 1986. The pre-accident slide picked out by Feynman had declared that while the lack of a secondary seal in a joint (of the solid rocket motor) was "most critical," it was still "safe to continue flying." At the same time, efforts needed to be "accelerated" to eliminate SRM seal erosion. During Columbia as well as Challenger, slides were not just used to support technical and operational decisions that led up to the accidents. Even during both post-accident investigations, slides with bulletized presentations were offered as substitutes for technical analysis and data, causing the Columbia Accident Investigation Board, similar to Feynman years before, to conclude that: "The Board views the endemic use of PowerPoint briefing slides instead of technical papers as an illustration of the problematic methods of technical communication at NASA."52
The overuse of bullets and slides illustrates the problem of information environments and how studying them can help us understand something about the creation of local rationality in organizational decision-making. NASA's bulletization shows how organizational decision-makers are configured in an impoverished information environment. That which decision-makers can know is generated by other people, and gets distorted during transmission through a reductionist, abbreviated medium. The narrowness and incompleteness of the environment in which decision-makers find themselves can come across as disquieting to retrospective observers, including people inside and outside the organization. It was after the Columbia accident that the Mission Management Team "admitted that the analysis used to continue flying was, in a word, 'lousy.' This admission – that the rationale to fly was rubber-stamped – is, to say the least, unsettling."53
Unsettling it may be, and probably is – in hindsight. But from the inside, people in organizations do not spend a professional life making "unsettling" decisions. Rather, they do mostly normal work. Again, how can a manager see a "lousy" process to evaluate flight safety as normal, as not something that is worthy reporting or repairing? How could this process be normal? As Vaughan did with Challenger, the Columbia Accident Investigation Board found clues to answers in pressures of scarcity and competition: "The Flight Readiness process is supposed to be shielded from outside influence, and is viewed as both rigorous and systematic. Yet the Shuttle Program is inevitably influenced by external factors, including, in the case of STS-107, schedule demands. Collectively, such factors shape how the Program establishes mission schedules and sets budget priorities, which affects safety oversight, workforce levels, facility maintenance, and contractor workloads. Ultimately, external expectations and pressures impact even data collection, trend analysis, information development, and the reporting and disposition of anomalies. These realities contradict NASA's optimistic belief that pre-flight reviews provide true safeguards against unacceptable hazards."54
Studying information environments, how they are created, sustained, and rationalized, and in turn how they help support and rationalize complex and risky decisions, is one route to understanding the small incremental steps that an organization makes towards its margins. Managing the information environment, of course, is not something that can be done with a priori decisions about what is important and what is not. Because that simply displaces the problem to an environment prior to the one that needs to be influenced. Also, a priori knowledge of what is important is very difficult to establish in complex systems with unruly technology. The very nature of these systems, and the technology they operate, make predictions about what is going to fail and when virtually impossible. This is why high reliability theory recommends decision-makers to remain complexly sensitized; to live in an information environment that is full of inputs from all kinds of sides and angles. Yet this can create a signal-to-noise ratio problem for the decision-maker. And it can once again encourage a tendency to oversimplify, to categorize, to bulletize.
Recall that high reliability theory also encourages decision-makers to defer to expertise and to take minority opinion seriously. This should enrich their information environment too. But even this does not necessarily help. As indicated above, perhaps there is no such thing as "rigorous and systematic" decision-making based on technical expertise alone. Expectations and pressures, budget priorities and schedules, contractor workloads, employee qualifications and workforce levels all impact technical decision-making. All these factors determine and constrain what will be seen as possible and rational courses of action at the time, even by experts. Although the intention was that NASA's flight safety evaluations be shielded from external pressures (turning it into a model closed system, as per the high-reliability recommendation), these pressures nonetheless seeped into even the collection of data, analysis of trends and reporting of anomalies. The information environments thus created for decision-makers were continuously and insidiously tainted by pressures of production and scarcity (and in which organization are they not?), pre-rationally influencing the way people saw the world. Yet even this "lousy" process was considered "normal" – normal or inevitable enough, in any case, to not warrant expending energy and political capital on trying to change it. Drift into failure was the result.
Control Theory and Drift
A family of ideas that approaches the problem of drift from another angle than the social-organizational one is control theory. Control theory looks at adverse events as emerging from interactions among system components. It usually does not identify single causal factors, but rather looks at what may have gone wrong with the system's operation or organization of the hazardous technology that allowed an accident to take place. Safety, or risk management, is viewed as a control problem, and adverse events happen when component failures, external disruptions or interactions between layers and components are not adequately handled; when safety constraints that should have applied to the design and operation of the technology have loosened, or become badly monitored, managed, controlled. Control theory tries to capture these imperfect processes, which involve people, societal and organizational structures, engineering activities, and physical parts. It sees the complex interactions between those as eventually resulting in an accident.
Control theory sees the operation of hazardous processes as a matter of keeping many interrelated components in a state of dynamic equilibrium. This means that control inputs, even if small, are continually necessary for the system to stay safe: like a bicycle, it cannot be left on its own, or it would lose balance and collapse. A dynamically stable system is kept in equilibrium through the use of feedback loops of information and control. Adverse events are not seen as the result of an initiating event or root cause that triggers a linear series of events. Instead, adverse events result from interactions among components that violate the safety constraints on system design and operation. Feedback and control inputs can grow increasingly at odds with the real problem or processes to be controlled. Concern with those control processes (how they evolve, adapt and erode) lies at the heart of control theory as applied to organizational safety.
Control theory says that the potential for failure builds because deviations from the system's original design assumptions become increasingly rationalized and accepted. This is consistent with Vaughan's and Snook's descriptions of the social systems in which decision-making occurs and local actions and rationalities develop. Adaptations occur, adjustments get made, and constraints get loosened in response to local concerns with limited time-horizons. They are all based on uncertain, incomplete knowledge. Just like practical drift and structural secrecy, ill is can engender and sustain erroneous expectations of users or system components about the behavior of others in the system.
A changed or degraded control structure eventually leads to adverse events. In control-theoretic terms, degradation of the safety-control structure Over time can be due to asynchronous evolution, where one part of a system changes without the related necessary changes in other parts. Changes to subsystems may have been carefully planned and executed in isolation, but consideration of their effects on other parts of the system, including the role they play in overall safety control, may remain neglected or inadequate. Asynchronous evolution can occur, too, when one part of a properly designed system deteriorates independent of other parts.
The more complex a system (and, by extension, the more complex its control structure), the more difficult it can become to map out the reverberations of changes (even carefully considered ones) throughout the rest of the system. Control theory embraces a more complex idea of causation than the energy-to-be-contained models discussed above (see also Chapter 8). Small changes somewhere in the system, or small variations in the initial state of a process, can lead to large consequences elsewhere.
Control theory helps in the design control and safety systems (particularly software-based) for hazardous industrial or other processes.55 When applied to organizational safety, control theory is concerned with how an erosion of a control structure allows a migration of organizational activities towards the boundary of acceptable safety performance.
Leveson arid her colleagues applied control theory to the analysis of a water contamination incident that occurred in May 2000 in the town of Walkerton, Ontario, Canada.56 The contaminants E. coli and Campylobacter entered the water system through a well of the Walkerton municipality, which operated the system through its Walkerton Public Utilities Commission (WPUC). Leveson's control theoretic approach showed how the incident flowed from a steady (and rationalized, normalized) erosion of the control structure that had been put in place to guarantee water quality.
The proximate events were as follows. In May 2000, the water system was supplied by three groundwater sources: wells 5, 6, and 7. The water pumped from each well was treated with chlorine before entering the distribution system. The source of the contamination was manure that had been spread on a farm near well 5. Unusually heavy rains from May 8 to May 12 carried the bacteria to the well. Between May 13 and 15, a WPUC employee checked well 5 but did not take measurements of chlorine residuals, although daily checks were supposed to be made. Well 5 was turned off on May 15 and well 7 was turned on. A new chlorinator, however, had not been installed on well 7 and the well was therefore pumping unchlorinated water directly into the distribution system. The WPUC employee did not turn off the well, but instead allowed it to operate without chlorination until noon on Friday May 19, when the new chlorinator was installed.
On May 15, samples from the Walkerton water distribution system were sent to a laboratory for testing according to the normal procedure. Two days later, the laboratory advised WPUC that samples from May 15 tested positive for E. coli and other bacteria. On May 18, the first symptoms of illness appeared in the community. Public inquiries about the water prompted assurances by WPUC that the water was safe. The next day, the outbreak had grown, and a physician contacted the local health unit with a suspicion that she was seeing patients with symptoms of E. coli.
In response to the lab results, WPUC started to flush and superchlorinate the system to try to destroy any contaminants in the water. The chlorine residuals began to recover. WPUC did not disclose the lab results. They continued to flush and superchlorinate the water through the following weekend, successfully increasing the chlorine residuals. Ironically, it was not the operation of well 7 without a chlorinator that caused the contamination; the contamination instead entered the system through well 5 from May 12 until it had been shut down on May 15.
Without waiting for more samples, the community issued a boil water advisory on May 21. About half of Walkerton's residents became aware of the advisory on May 21, with some members of the public still drinking the Walkerton town water as late as May 23. Seven people died and more than 2,300 become ill.
The proximate events could be modeled using a sequence-of-events approach, which would point to the various errors and violations and shortcomings in the systems layers of defense. But Leveson and colleagues decided to model the Ontario water quality safety control structure and show how it eroded over time, allowing the contamination to take place. The safety control structure was intended to prevent exposure of the public to contaminated water, first by removing contaminants, second by public health measures that would prevent consumption of contaminated water (see Figure 5.1).
Figure 5.1 Control structure as originally envisioned to guarantee water quality in Walkerton (Leveson, Daouk et al. 2003)
In Ontario, decisions had been taken to remove various water safety controls, or to reduce their enforcement, without an assessment of the risks. One of the important features that disappeared were feedback loops. As the other controls weakened or disappeared over time, the entire socio-technical system moved to a state where a small change in the operation of the system or in the environment (in this case, unusually heavy rain) could lead to a tragedy.
Well 5 had been vulnerable to contamination to begin with. It was shallow, in an area open to farm runoff, and perched on top of bedrock with only a thin layer of top soil around it. No extra approval for the well had been necessary, however, and it was connected to the municipal system as a matter of routine. No program or policy was in place to review existing wells to determine whether they met requirements or needed continuous monitoring.
Figure 5.2 Safety control structure at the time of the wafer contamination incident at Walkerton. Controls had been loosened, feedback loops had disappeared. The control structure had become hollowedout relative to its original design intentions (Leveson, Daouk et al. 2003)
A number of factors led to erosion of the control structure. These included objections to the taste of chlorine in drinking water, WPUC employees who could safely consume untreated water from the wells, a lack of certification for water system operators, inexperience with water quality processes, and a focus on financial strains on WPUC. A lack of government policy on land use and watershed exposed this increasingly brittle structure to heavily contaminated water by hog and cattle farming. Budget and staff reductions by a new conservative government took a toll on environmental programs and agencies. A Water Sewage Services Improvement Act was passed in 1996, which shut down the government-run testing laboratories, delegated control of provincially owned water and sewage plants to municipalities, eliminated funding for municipal water utilities, and ended the provincial drinking water surveillance program. Farm operators were from now on to be treated with understanding if they were found in violation of livestock and waste-water regulations. No criteria were established to ensure the quality of testing or the qualifications or experience of private lab personnel, and no provisions were made for licensing, inspection, or auditing of private labs by the government. The resulting control structure was a hollowed-out version of its former self. It had become brittle, and vulnerable to an unusual perturbation (like massive rainfall), lacking the resilience or redundancies to stop the problem or recover quickly from it.
Control theory does not see an organization as a static design of components or layers. It readily accepts that a system is more than the sum of its constituent elements. Instead, it sees an organization as a set of constantly changing and adaptive processes focused on achieving the organization's multiple goals and adapting around its multiple constraints. The relevant units of analysis in control theory are therefore not components or their breakage (for example, holes in layers of defense), but system constraints and objectives.57
An important consequence is that control theory is not concerned with individual unsafe acts or errors, Or even individual events that may have helped trigger an adverse event. Such a focus does not help, after all, in identifying broader ways to protect the system against migrations towards risk. Control theory also rejects the depiction of adverse events in a traditionally physical way as the latent failure model does, for example. Accidents are not about particles, trajectories or collisions between hazards and the process-to-be-protected. Removing individual unsafe acts, errors or singular events from an adverse event sequence only creates more space for new ones to appear if the same kinds of systemic constraints and objectives are left similarly ill-controlled. The focus of control theory is therefore not on erroneous actions or violations, but on the mechanisms that help generate such behaviors at a higher level of functional abstraction – mechanisms that turn these behaviors into normal, acceptable and even indispensable aspects of an actual, dynamic, daily work context that needs to survive inside the constraints of three kinds of boundaries (functional, economic and safety).
For control theory, the making and enforcing of rules is not an effective strategy for controlling behavior. This, instead, can be achieved by making the boundaries of system performance explicit and known, and to help people develop skills at coping with the edges of those boundaries.58 This, indeed, should be part of the information environment in which decision-makers operate. Ways proposed by Rasmussen include increasing the margin from normal operation to the safety boundary. This can be done by moving the safety boundary further out, or by moving operations further inward, away from a fixed safety boundary. In both cases more margin opens up. This, however, is only partially effective because of risk homeostasis – the tendency for a system to gravitate back to a certain level of risk acceptance, even after interventions to make it safer. In other words, if the boundary of safe operations is moved further away, then normal operations will likely follow not long after – under pressure, as they always are, from the objectives of efficiency and less effort.
