Drift Into Failure (Sidney Dekker) » p.16

Drift into failure, p.16

Drift Into Failure, page 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Turner also saw managerial and administrative difficulties in handling information in complex situations that blurred signal with noise. There were failures to comply with discredited or out-of-date regulations, and these passing unnoticed because of a cultural lag in what was accepted as normal. Then there was the "strangers and sites" problem: people (strangers) entering areas they officially shouldn't because of a lack of mandate or knowledge, and sites being used for purposes that was not their original intention. This all amounted to the judgment errors, cognitive lapses and Communication difficulties that Turner saw as critical for creating the discrepancy in which failure was incubated. This point of view, focusing on which components went wrong where, is sustained in recent writings on high-reliability theory:

Failure means that there was a lapse in detection. Someone somewhere didn't anticipate what and how things could go wrong. Something was not caught as soon as it could have been caught.9

This expresses die same Newtonian commitment as that which Turner could not escape. The gap between risk-in-the-world and risk-as-perceived grows because somebody, somewhere in the system, is not detecting things that could be detected. This represents broken components that can be tracked down and fixed or replaced.

Risk as Energy to be Contained: Barrier Analysis

Man-made disaster theory has left another theoretical legacy that has made it difficult to think of organizational risk in dynamic, adaptive terms and trace its development over time. Man-made disasters theory, from its very roots, takes for granted that we think of risk in terms of energy – a dangerous build-up of energy, unintended transfers, or uncontrolled releases of energy. This risk needs to be contained, and the most popular way is through a system of barriers: multiple layers whose function it is to stop or inhibit propagations of dangerous and unintended energy transfers. This separates the object-to-be-protected from the source of hazard by a series of defenses (which is a basic notion in the latent failure model). Other countermeasures include preventing or improving the recognition of the gradual build-up of dangerous energy (something that very much inspired man-made disaster theory), reduce the amount of energy (for example, reduce the height or composition of the tip), prevent the uncontrolled release of energy or safely distribute its release.

The conceptualization of risk as energy to be contained or managed has its roots in efforts to understand and control the physical nature of accidents. This also points to the limits of such a conceptualization. It is not necessarily well-suited to explain the organizational and socio-technical factors behind system breakdown, nor equipped with a language than can meaningfully handle processes of gradual adaptation, or the social processes of risk management and human decision-making. The central analogy used for understanding how systems work in this models is a technical system (for which the Newtonian—Cartesian worldview is optimally suited). And the chief strategy for understanding how these work and fail has always been reductionism. That means dismantling the system and looking at the parts that make up the whole. Consistent with Newtonian logic, this approach assumes that we can derive the macro properties of a system (for example, safety) as a straightforward combination or aggregation of the performance of the lower-order components or subsystems that constitute it. Indeed, the assumption is that safety can be increased by guaranteeing the reliability of the individual system components and the layers of defense against component failure so that accidents will not occur.

These assumptions are visible in one of the off-shoots of man-made disaster theory: the Swiss Cheese Model (also known as the latent failure model, or defenses in depth model), which was first published in the late 1980s.10 It preserves the basic features of the risk-as-energy model. The Swiss Cheese Model relies on the sequential, or linear progression of failure(s) that became popular in the 1930s, particularly in industrial safety applications. There, adverse outcomes were viewed as the conclusion of a sequence of events. It was a simple, linear way of Conceptualizing how events interact to produce a bad outcome. According to the sequence of events idea, events preceding the accident happen linearly, in a fixed order, and the accident itself is the last event in the sequence. In a slightly different version, it has been known, too, as the domino model, for its depiction of an accident as the endpoint in a string of falling dominoes.

The protection against failure is to put in barriers. These barriers, or defenses, need to be put in place to separate the object to be protected from the hazard. They are measures or mechanisms that protect against hazards or lessen the consequences of malfunctions or erroneous actions. These defenses come in a variety of forms. They can be engineered (hard) or human (soft), they can consist of interlocks, procedures, double-checks, actual physical barriers or even a line of tape on the floor of the ward (that separates an area with a particular anti-septic routine from other areas, for example). According to Reason, the "best chance of minimizing accidents is by identifying and correcting these delayed action failures (latent failures) before they combine with local triggers to breach or circumvent the system's defenses." This is consistent with ideas about barriers and the containment of energy or the prevention of uncontrolled release of energy.

But defense layers have "holes" in them. An interlock can be bypassed, a procedure can be ignored, a safety valve can begin to leak. An organizational layer of defense, for example, involves such processes as goal setting, organizing, communicating, managing, designing, building, operating, and maintaining. All of these processes are fallible, and produce the latent failures that reside in the system. This is not normally a problem, but when combined with other factors, they can contribute to an accident sequence. Indeed, according to the latent failure model, accidents happen when all of the layers are penetrated (when all their imperfections or "holes" line up). Incidents, in contrast, happen when the accident progression is stopped by a layer of defense somewhere along the way. The Swiss Cheese Model got its name from the image of multiple layers of defense with holes in them. Only a particular relationship between those holes, however (when they all "line up") will allow hazard to reach the object that was supposed to be protected.

It is interesting to see how people have, for a long time, modeled undesirable energy as proceeding along a linear trajectory that needs stopping or channeling. To this day, houses in Bali, Indonesia, are typically built according to specifications that keep evil spirits out. Evil spirits are believed to be able to travel in straight lines only, so if all the holes (for example, doors) line up, the spirit can enter the house, but will exit it again at the other end. Evil is thus released. If doors don't line up, evil stops traveling and is contained within the house.11 The purpose of holes in the layers of defense (or doors in the walls) is opposite in these two ideas, of course. In the Swiss Cheese Model, holes that allow evil to travel through are bad. In the Balinese myth, they are good. But the notion of a linear trajectory along which bad influences on a system's health travel is the same. And, in the West, the notion of trajectories in three-dimensional space is of course entirely Newtonian.

As for the Swiss Cheese Model, the imperfection of the layers of defense can sponsor a search for broken parts (errors, communication difficulties, deficient supervision). Together with other theoretical assumptions, man-made disaster theory and its offshoots have thus retained a firmly Newtonian position about risk and people's knowledge of risk. Thinking of danger in terms of uncontrolled energy releases was one. The idea that there is a "real" risk out there, versus an "imagined" risk inside organizations and inside people's heads was another. This latter assumption made it difficult for accident theory to understand how people saw their world at the time and why this made sense to them. These assumptions probably helped keep the brakes on subsequent developments in accident research and would hamper progress on theorizing beyond the broken part. Today, a substantial gulf still lies between man-made disaster theory and its offshoots on the one end, and what is considered complexity- and systems thinking on the other.

Man-made disaster theory argues that (p. 16), "despite the best intentions of all involved, the objective of safely operating technological systems Could be subverted by some very familiar and "normal" processes of organizational life."12 Such "subversion" occurs through usual organizational phenomena such as information not being fully appreciated, information not correctly assembled, or information conflicting with prior understandings of risk. Turner noted that people were prone to discount, neglect or not take into discussion relevant information, even when available, if it mismatched prior information, rules or values of the organization.

The problem is that this doesn't really explain how or why people who manage a ward or a service or a hospital are unable to "fully" appreciate available information despite the good intentions of all involved. In the absence of such an explanation, the only prescription for them is to try a little harder, and to realize that safety should be their main concern. To try to imagine how risk builds up and travels through their organization. Indeed, man-made disaster theory offers that managerial and operational activities aimed at preventing a drift into failure both reflects and is promoted by at least the following four features:

❍ Senior management commitment to safety

❍ Shared care and concern for hazards and a willingness to learn and understand how they impact people

❍ Realistic and flexible norms and rules about hazards

❍ Continual reflection on practice through monitoring, analysis and feedback systems.

Through these four aspects, an organization is supposed to be able to continuously monitor weak signals and revise its responses to them. Some high-risk industries apparently succeeded in reaching an end state in which they were they were quite adept at this. Empirical observations of systems such as air traffic control, power generation and aircraft carriers in the 1980s grew into an entire school of thought – that of high-reliability organizations.

High Reliability Organizations

High reliability theory asks how risks are monitored, evaluated and reduced in organizations; what human actions, what deliberate processes lie behind that and how can they be enhanced? The theory argues that careful, mindful organizational practices can make up for the inevitable limitations on the rationality of individual members, High reliability theory describes the extent and nature of the effort that people at all levels in an organization can engage in to ensure consistently safe operations despite its inherent complexity and risks.13

During a series of empirical studies, high reliability organizational (HRO) researchers found that through leadership safety objectives, the maintenance of relatively closed systems, functional decentralization, the creation of a safety culture, redundancy of equipment and personnel, and systematic learning, organizations could achieve the consistency and stability required to achieve failure-free operations. Some of these findings seemed closely connected to the worlds studied – naval aircraft carriers, for example. There, in a relatively self-contained and isolated, closed system, systematic learning is an automatic by-product of the swift rotations of naval personnel, turning everybody into instructor and trainee, often at the same time. Functional decentralization meant that complex activities (like landing an aircraft and arresting it with the wire at the correct tension) were decomposed into simpler and relatively homogenous tasks, delegated down into small workgroups with substantial autonomy to intervene and stop the entire process independent of rank. High reliability researchers found many forms of redundancy – in technical systems, supplies, even decision-making and management hierarchies, the latter through shadow units and multi-skilling.

When researchers first set out to examine how safety is created and maintained in such complex systems, they focused on errors and other negative indicators, such as incidents, assuming that these were the basic units that people in these organizations used to map the physical and dynamic safety properties of their production technologies, ultimately to control risk. The assumption was wrong: they were not. Operational people, those who work at the sharp end of an organization, hardly defined safety in terms of risk management or error avoidance. Four ingredients kept reappearing, and they form the contours of what has become high reliability theory, or the theory of high reliability organizations.14

Leadership safely objectives are the first ingredient. Without such commitment, there is little point in trying to promote a culture of reliability. The idea is that others in the organization will never be enticed to find safety more important than their leadership. Short-term efficiency, or acute production goals, are openly (and sometime proudly) sacrificed when chronic safety concerns come into play. Agreement about the core mission of the organization (safety) is sought at every available opportunity, particularly through clear and consistent top-down communication about the importance of safety. Such commitments and communication may not be enough, of course. Other people have pointed out that the distance between loftily stated goals and real action is quite large, and can remain quite large, despite leadership or managerial pledges to the contrary.15

The need for redundancy is the second ingredient. The idea behind redundancy is that it is the only way to build a reliable system out of unreliable parts. Multiple and independent channels of communication and double-checks should, in theory, be able to produce a highly reliable organization. Redundancy in high reliability theory can take two forms: duplication and overlap. In duplication, two different units or people or parts perform the same function, often in real time. Duplication is also possible serially, as in the double-checking of a medication preparation. Overlap is redundancy where units or people or parts have some functional areas in common, but not all. It is obviously a cheaper solution.16

Decentralization, culture and continuity form the third ingredient. High reliability organizations rely on considerable delegation and decentralization of decision authority about safety issues. These organizations don't readily court government or regulatory interference with their activities and instead acknowledge the superiority of local entrepreneurial efforts to improve safety through engineering, procedure or training.17 People inside organizations continually create safety through their evolving practice. In high reliability organizations, active searching and exploration for ways to do things more safely is preferred over passively adapting to regulation or top-down Control.

As a result, sharp-end practitioners in high-reliability organizations are entrusted to take appropriate actions in tight situations because they will have been inculcated through rituals, values, exercises and incentives. Having members work in a "total institution," isolated from wider society and inside their own world, seems to contribute to a culture of reliability. This aim is consistent with the maintenance of a relatively closed system. Finally, continuous operations and training, non-stop on-the-job education, a regular throughput of new students or other learners, and challenging operational workloads contribute greatly to reduced error rates and enhanced reliability.

Organizational learning is the fourth ingredient for high reliability organizations. High reliability grows out of incremental learning through trial and error. Things are attempted, new procedures or routines are tested (if carefully so), the effects are duly considered. Smaller dangers are Courted in order to understand and forestall larger ones.18 Simulation and imagination (for example, disaster exercises) are important ways of doing so when the costs of failure in the real system are too high. For high reliability theory, such learning does not need to be centrally orchestrated. In fact, the distributed, local nature of learning is what helps new and better ways of doing things emerge, a faster and quicker and more operationally grounded way of learning by trial and error.

Challenging the Belief in Continued Safe Operations

Ensuing empirical HRO work, stretching across decades and a multitude of high-hazard, complex domains (aviation, nuclear power, utility grid management, navy) affirmed this picture. Operational safety – how it is created, maintained, discussed, mythologized – is much more than the control of negatives. As Gene Rochlin put it:

The culture of safety that was observed is a dynamic, intersubjectively constructed belief in the possibility of continued operational safety, instantiated by experience with anticipation of events that could have led to serious errors, and complemented by the continuing expectation of fixture surprise.19

The creation of safety, in other words, involves a belief about the possibility to continue operating safely. This belief is built up and shared among those who do, the work every day. It is moderated or even held up in part by the constant preparation for future surprise – preparation for situations that may challenge people's current assumptions about what makes their operation risky or safe. It is a belief punctuated by encounters with risk, but it can become sluggish by overconfidence in past results, blunted by organizational smothering of minority viewpoints, and squelched by acute performance demands or production concerns. Instead it should be a belief that is open to intervention so as to keep it curious, open-minded, complexly sensitized, inviting of doubt, and ambivalent toward the past.20

Note how these HRO commitments try to pull an organization's belief in its own infallibility away from Turner's disaster incubation. The past is no good basis for sustaining a belief in future safety, and listening to only a few channels of information will render the belief narrow and unchallenged. Disaster gets incubated when the organization's belief in continued safe operations is left to grow and solidify. So what can an organization do to continually challenge its own belief and to calibrate it against "reality"? How can any organization reach this desired high-reliability end state, and stay there?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Drift Into Failure, page 16

Other author's books: