The Columbia space shuttle disaster revisited

21 downloads 11154 Views 171KB Size Report
Apr 13, 2011 - The Board found that a piece of foam had damaged ..... lack of hard data – a consequence of no imagery – made it nearly impossible to jumpstart the discussion. ..... Systems and structures for prevention and recovery (pp.
Available online at www.sciencedirect.com

Policy and Society 30 (2011) 77–87 www.elsevier.com/locate/polsoc

The importance of failure theories in assessing crisis management: The Columbia space shuttle disaster revisited§ Arjen Boin a,*, Denis Fishbacher-Smith b a

Utrecht University, Netherlands & Louisiana State University, USA b University of Glasgow, United Kingdom Available online 13 April 2011

Abstract An adequate assessment of crisis management failure (and success) requires a validated causal theory. Without such a theory, any assessment of crisis management performance amounts to little more than a ‘‘just so’’ story. This is the key argument of this paper, which describes how hindsight biases and selective use of social science theory gave rise to a suggestive and convincing – but not necessarily correct – assessment of NASA’s role in the Columbia space shuttle disaster (1 February 2003). The Columbia Accident Investigation Board (CAIB) identified NASA’s organizational culture and safety system as a primary source of failure. The CAIB report reads as a stunning indictment of organizational incompetence: the organization that thrilled the world with the Apollo project had ‘‘lost’’ its safety culture and failed to prevent a preventable disaster. This paper examines the CAIB findings in light of the two dominant theoretical schools that address organizational disasters (normal accident and high reliability theory). It revisits the Columbia shuttle disaster and concludes that the CAIB findings do not sit well with the insights of these schools. # 2011 Policy and Society Associates (APSS). Elsevier Ltd. All rights reserved.

‘‘The Board believes that the Shuttle Program should have been able to detect the foam trend and more fully appreciate the danger it represented’’ (CAIB, 2003:189–190). ‘‘So today, we may be not willing to take any risk, but in that case, you can’t fly because there is always going to be risk [. . .] You have got to expect that you are going to have failures in the future’’ (George Mueller, cited in Logsdon, 1999:26). 1. Introduction: NASA, the Columbia disaster, and the assessment of failure After a crisis or disaster, a widespread consensus tends to emerge – however briefly – that this should ‘‘never happen again’’ (Boin, McConnell, & ‘t Hart, 2008). To make sure similar events will not re-occur, or to identify opportunities for timely interventions, one needs to know how those crises and disasters unfold. What is needed, in other words, is a causal theory of crisis and disaster. Scholars have made substantial progress in developing an understanding of the causes of so-called organizational disasters. Their theories provide a rich framework for describing and explaining why organizations cause, or fail to §

We thank Allan McConnell and the three anonymous reviewers for their perceptive comments, which greatly helped to improve this paper. * Corresponding author. E-mail addresses: [email protected] (A. Boin), [email protected] (D. Fishbacher-Smith).

1449-4035/$ – see front matter # 2011 Policy and Society Associates (APSS). Elsevier Ltd. All rights reserved. doi:10.1016/j.polsoc.2011.03.003

78

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

prevent, crises such as oil spills and explosions. Their theoretical work is increasingly used by public inquiries in their efforts to understand and learn from major disasters. While this increased use underlines the relevance of social scientists and their work, it conceals the actual state of the art in this field: these theories have not been tested against a large (or really any) number of cases. Using untested or maybe even faulty theories to assess the performance of organizations and the people that work in them may have serious consequences. A case in point is the official evaluation of NASA’s performance before and during the Columbia Space Shuttle disaster. On February 1, 2003, the Columbia Space Shuttle disintegrated during the final stages of its return flight to earth. The Columbia Accident Investigation Board (CAIB) found that NASA had failed. In fact, the Board reported that this disaster had the same root causes as the Challenger disaster, which had happened 17 years earlier (Mahler, 2009). The Rogers Commission, which studied the Challenger disaster, criticized NASA for not responding adequately to ‘internal’ warnings about the impending disaster (Presidential Commission on the Space Shuttle Challenger Accident, 1986). It documented flawed decision-making processes, a lack of safety considerations and bureau-political tensions between the various centers that, together, make up NASA and provide the organizational infrastructure that ultimately allows the Shuttle to fly. The CAIB found that many of the very same pathological factors were alive and well some 17 years after the Challenger disaster. The CAIB arrived at such a resolute and damaging assessment using social science theories. This paper asks whether these theories allow for such definite assessments. We outline the basic tenets of two important schools of thought in organizational disaster research. We sketch out the institutional history of NASA’s much maligned safety system, we revisit the Columbia shuttle disaster and offer an alternative interpretation of the data. The conclusion offers some thoughts on the analysis of these complex disasters and the implications for both theory and practice within accident research and management. 2. Summarizing the findings of the Shuttle disaster investigations It took the CAIB a little over six months to publish its findings. The Board found that a piece of foam had damaged the shuttle’s thermal protection system during the launch, which left the Columbia unprotected from the tremendous heat generated during re-entry of earth’s atmosphere. But the Board went further than uncovering the technical cause of the disaster. The CAIB concluded that NASA’s safety system had failed. CAIB arrived at this far-reaching conclusion by applying ‘‘high reliability theory’’ to analyze the causes of this disaster (see Chapter 7 of the CAIB report). The Commission explained that NASA suffered from organizational problems. The root causes of these problems, CAIB asserted, dated back to the years preceding the Challenger disaster. CAIB thus pointed to a fundamental problem within the culture of NASA that transcended both accidents. Let us briefly recapitulate the most important root causes identified by the two shuttle disaster inquiries: Acceptance of escalated risk: The Rogers Commission (1986) found that NASA operated what it considered a deeply flawed risk philosophy. This philosophy prevented NASA from properly investigating anomalies that had emerged during previous shuttle flights. One member of the Rogers Commission, Nobel laureate Richard Feynman, described the core of the problem (as he saw it) in an official appendix to the final report: The argument that the same risk was flown before without failure is often accepted as an argument for the safety of accepting it again. Because of this, obvious weaknesses are accepted again, sometimes without a sufficiently serious attempt to remedy them, or to delay a flight because of their continued presence (Feynman, p. 1 appendix F). The CAIB found the very same philosophy at work: ‘‘with no engineering analysis, Shuttle managers used past success as a justification for future flights’’ (CAIB, 2003:126). This explains, according to CAIB, why NASA had ‘‘ignored’’ the shedding of foam, which had occurred during most of the previous shuttle launches. Flawed decision-making: The Rogers Commission criticized NASA’s decision-making system, which ‘‘did not flag rising doubts’’ among the workforce with regard to the safety of the shuttle. On the eve of the Challenger launch, engineers of Thiokol (the makers of the O-rings) had suggested that cold temperatures could undermine the effectiveness of the O-rings. After several rounds of discussion, NASA management decided to proceed with the launch. The issues raised by Thiokol engineers were dismissed on the grounds that there was insufficient evidence to ‘prove’ that the risks existed. Similar doubts were raised and dismissed before Columbia’s fateful return flight. Several

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

79

engineers alerted NASA management to the possibility of serious damage to the thermal protection system (after watching launch videos and photographs). After several rounds of consultation, it was decided not to pursue further investigations (such as photographing the shuttle in space). Such an investigation, the CAIB report asserts, could have initiated a life-saving operation. A broken safety culture: Both commissions were deeply critical of NASA’s safety culture. The Rogers Commission noted that NASA had ‘‘lost’’ its safety program; the CAIB speaks of ‘‘a broken safety culture.’’ A critical indicator, according to both reports, is NASA’s susceptibility to ‘‘schedule pressure,’’ which allegedly caused NASA to overlook obvious safety concerns in order to maintain its launch schedules. In the case of Columbia, CAIB observed that the launch date was tightly coupled to the completion schedule of the International Space Station. NASA had to meet these deadlines, CAIB argues, because failure to do so would undercut its legitimacy (and funding).1 The CAIB, in summary, explicitly concluded that NASA had failed to prevent what was judged to be a preventable disaster. CAIB judged that NASA is not a High Reliability Organization, which, presumably, would have prevented both disasters. In the following section, we show that this theory and the way it was applied cannot support the harsh judgment delivered by CAIB. We take one step further and explain that the existing theories of disaster causes do not warrant any type of definite assessment as laid down by CAIB. 3. Explaining organizational disaster: Theories of unruly technology and faltering defenses The study of organizational disaster has substantially developed over the years (Reason, 1990; Sagan, 1993; Smith & Elliot, 2006; Turner, 1976; Turner & Pidgeon, 1997; Weick & Sutcliffe, 2001). It has moved our understanding beyond simple mono-causal explanations – the human error or God’s wrath – to more complex explanations that involve the interplay between unruly technologies, operator errors, organizational cultures, leadership, and institutional environments (Rosenthal, 1998). Several theoretical ‘‘schools’’ have emerged that portend to explain the causes of organizational disaster. We concentrate on two dominant schools of thought: Normal Accident Theory (NAT) and High Reliability Theory (HRT). The CAIB has made extensive use of the latter, while mostly ignoring the insights of the former. An important and widely shared idea in the research on socio-technical failures is that it takes ‘‘just the right combination of circumstances to produce a catastrophe’’ (Perrow, 1994:217; cf. Reason, 1990). The right combination would involve a disturbance or glitch in the organization’s core technology, which generates unique task demands for operators and managers that confuse them and confound their abilities to respond effectively. The disturbance must slip past organizational defenses that were designed to prevent them from happening. Wider organizational factors and decision making processes must compound and conceal the initial problem(s) and allow the problem to escalate. Normal Accident Theory explains how dangerous technologies can escalate out of control. According to NAT, the causes of organizational breakdown are rooted in the efforts to build perfect organizations. It is not the ‘‘sick’’ organization that is most likely to produce a disaster (Miller, 1988), but it is the modern and efficient organization that one should fear in this perspective. To harness the dangerous potential of powerful technologies, an organization must build rational structures and processes. The human capacity to build flawless control mechanisms is, however, limited. Rational structures typically produce unintended consequences, which are not always noticed right away because of the inherent impossibilities (and confusions) that come with information collection, processing, and interpretation. An important premise of NAT is that these unintended consequences make use of the rational organization to propagate efficiently and rapidly; the rational organization – designed to control disturbances – actually magnifies errors and glitches, which can then spin out of control into disaster (Turner & Pidgeon, 1997; cf. Schulman, 1989; Smith, 2005). Disastrous breakdowns require organizing ability (cf. Masuch, 1985): ‘‘Errors are only likely to develop into large-scale accidents or disasters if they occur in an organizational decision hierarchy or power hierarchy at a point at which they are likely to be magnified, and possibly to be compounded with other smaller errors, by the operation of the normal administrative processes. They are most likely to produce large-scale accidents, that is to say, if they are linked with the negentropic tendencies of an 1

See McDonald (2005) for a resolute dismissal of this claim.

80

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

organization or of a cluster of organizations, so as to become major ‘anti-tasks’ which the organization then inadvertently executes’’ (Turner & Pidgeon, 1997, p. 151). In his classic Normal Accidents, Perrow (1984/1999) illuminates this argument by introducing the concepts of interactive complexity and tight coupling. The more interactively complex an organization, its core technology and their interactions become, the harder it is for operators (who are the most knowledgeable with regard to the technology) to understand the system and to intervene in a timely manner when the system behaves in unexpected ways. Operators are bound to misinterpret small ‘‘glitches’’ and may initiate remedial actions that fuel rather than dampen the emerging disaster. If an organization is also tightly coupled, the misunderstood errors travel rapidly throughout the organizational system (cf. LaPorte, 1975:17). Together, Perrow argues, interactive complexity and tight coupling make for ‘‘normal accidents.’’ This line of thinking suggests an ‘‘inexorable systemic logic in industrial societies that makes crises almost inevitable’’ (Shrivastava, Mitroff, Miller, & Miglani, 1988:286). As Western societies become ever more complex and tightly coupled (examples include globalization and the widespread use of new technologies), we would expect more breakdowns to occur, although this does not appear to be happening (Baer, Heron, Morton, & Ratliff, 2005; Wildavsky, 1988). If anything, it could be argued that modern society has become safer. This prompts a set of intriguing questions: why do some organizations suffer ‘‘normal’’ breakdowns whereas many others do not? Could it be that some organizations have learned to deal with complexity and tight coupling? Have they found ways to discover emerging crises in time? Have they perhaps learned to ‘‘de-couple’’ organizational systems (cf. Nutt, 2004)? Have these organizations learned to become reliable? The second dominant school of thought, known as HRT, suggests that this might be the case (La Porte, 1994, 1996; LaPorte & Rochlin, 1994; Perrow, 1984; Perrow, 1994; Rochlin, 1996; Sagan, 1993). HRT scholars have identified a set of factors that may help an organization to recognize an impending disaster and to intervene in the early stages of escalation (Rochlin, 1996; Weick & Sutcliffe, 2001). They hypothesized that reliable organizations are set apart from other organizations by a pervasive safety culture, which nurtures a common awareness of potential vulnerabilities (‘‘it can happen to us’’) and a particular way of working. These organizations institutionalize mechanisms that promote rapid awareness and local resilience, a powerful combination that allows an organization to localize and compartmentalize a developing disaster, while at the same time communicating the nature of the failure (and the actions taken) to other parts of the organization. HRT scholars also pay explicit attention to the institutional environment of these reliable organizations (Egan, 2011; LaPorte, 2011). They emphasize that organizations can only accomplish a track record of reliability if their political patrons provide them with the necessary means and support. A sense of dread must exist with regard to the potential consequences of disaster. A widely shared understanding is required that such disasters (think of a nuclear explosion) must be prevented at all costs. This condition would make it hard, at least in theory, for most organizations to reach the status of ‘‘high reliability organization.’’ A crucial question is whether high reliability organizations can be designed. Organization theorists seem to assume at least a partial relation between a well-entrenched organizational culture and consistent management efforts (Perrow, 1986; Pfeffer, 1978; Schein, 1985; Selznick, 1957). Some HRT scholars argue that organization leaders can make their organization more reliable by creating a safety culture and by bringing the organizational structure in line with such a culture (Weick & Sutcliffe, 2001). Other HRT scholars appear more doubtful (LaPorte, 2011; Rochlin, 2011). Perrow does not believe organizations can prevent breakdowns or control their consequences. His theory is predicated on the assumption that modern technology cannot be understood due to the scale of its complexity and cannot be controlled due to the speed of its interactions; to think we can is what makes it all the more dangerous. NAT, Perrow (1994:213) insists, is ‘‘bleakly explicit on this account: while it is better to try hard than not, because of system characteristics trying hard will not be enough’’ (emphasis added). In other words: whatever NASA does, we should expect an occasional shuttle disaster. HRT scholars put more faith in trying hard. They point to organizations that have dealt with dangerous technology for decades without any major accidents. They show that these organizations have erected effective defenses against disastrous failure. They remain divided, however, with regard to the question whether these organizational defenses are the outcome of long-term adaptation or the product of design. HRT suggests that NASA would be wise to try hard to prevent shuttle disasters from happening, but it stops short of relating disaster to the organization’s culture (Boin & Schulman, 2008; LaPorte & Rochlin, 1994).

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

81

We thus have two intriguingly different and seemingly complementary sets of hypotheses (and that is really all they are) with regard to the causes of organizational disaster. The NAT perspective explains a disaster in terms of the inherent human inability to tame dangerous technologies and the role of organizations in allowing failures to escalate into major catastrophic events, despite the control systems that are put into place. Here, a ‘disaster’ is viewed as the inevitable low-chance, high-impact event, which will eventually materialize as a result of the interaction between human error and technological glitches. The HRT perspective is slightly more optimistic when it comes to the organizational capacity to detect and correct impending known types of disasters. A disaster is, to a degree, the result of the failures of organizational controls to stop the process of escalation. These theories are still relatively young and untested. A firmly tested theory of organizational disaster simply does not exist. If organization theorists agree on anything, it is that ‘‘complete prevention [of organizational disasters] is impossible’’ (Mitroff, Pauchant, & Shrivastava, 1988:102). But that is where the agreement ends. 3.1. Using untested theories to explain real disasters: The seducing effect of hind-sight bias The use of social science theories by public inquiries to explain crises and disasters appears to be a growing trend. The application of unproven theories – or mere hypotheses – is obviously problematic from an academic point of view. But these theories do help to create a convincing narrative, which is crucially important in the accountability phase of a crisis or disaster. After the disaster, a culprit must be found (Boin et al., 2008). What makes these untested theories work so nicely in practice is the so-called hindsight bias (Woods, 2005). Once analysts assume that ‘‘a broken safety culture’’ may be the root cause of the disaster at hand, it becomes all too easy to trace that disaster back a combination of strategic oversight, cost pressures, defective equipment and operator error (see Reason, 1990; Turner, 1976; Turner & Pidgeon, 1997). The relation between these pathological factors and the destructive outcome can easily be made to look self-evident. What is often forgotten, as Perrow (1994) points out, is that the root causes uncovered by public inquiries tend to be present in organizations that did not suffer from similar breakdowns under similar circumstances (cf. Boin & Rattray, 2004; Brunk, 2001). In fact, most of the identified ‘‘triggers’’ and ‘‘factors’’ are ubiquitous in the great majority of large organizations. Operator and managerial errors, policy failures, cost pressures, environmental shocks, faulty design, defective equipment – what organization does not suffer from these pathologies at one time or another? The hindsight bias feeds on the widespread idea that an impending disaster is an ontological entity, something ‘‘out there,’’ leaving ‘‘a repeated trail of early warning signals’’ (Mitroff et al., 1988:104). Barely audible signals are thought to announce the impending arrival of the catastrophic event. As the CAIB report (p. 184) phrased it: ‘‘How could NASA have missed the signals that the foam was sending?’’ If only organizations would pay attention! Of course, herein lies the problem. Organizations are not entities that can ‘‘pay attention.’’ They are a mosaic of elements that interact together and generate fractures within and between such interactions. It is these factors that generate the problems of communication and ‘organizational deafness’ that precede any organizational disaster (Smith, 2006). To select a single unproven theory and apply it with hindsight bias (without using a control group) is, in short, not a solid base for sound analysis and evaluation. The selection and uncritical application of, in this case HRT, leads to a one-sided conclusion: it was not the unruly technology that comes with ‘‘one of the most complex machines ever devised’’ (CAIB, 2003:14), but ‘‘a broken safety culture’’ that caused this disaster. To demonstrate just how lop-sided this verdict is, we sketch an alternative narrative that hinges closer to the NAT perspective and takes account of HRT insights. This is not the story of what ‘‘really happened.’’ It is a demonstration of how the same social science theories that were available to CAIB can lead to a fundamentally different assessment. 4. Unruly technology, pressing constraints, and an unforgiving environment: NASA’s safety system revisited CAIB asserts that NASA could have prevented the Columbia disaster, if only had it been a High Reliability Organization. From a Normal Accident Theory perspective, however, such an assertion misses an important point: humans cannot control dangerous technologies through their imperfect organizations. In fact, in their very efforts to build perfect organizations, they will probably create unforeseen vulnerabilities. The best an organization can do, from this perspective, is to minimize known risks and create sufficient capacity to deal with emerging ‘‘unknown unknowns.’’

82

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

Launching humans into space and returning them safely is an extremely daunting challenge. Each space flight is an experiment with complex technology in an unforgiving environment. Extensive testing of each and every part of the ‘system’ is one way to minimize the risks of failure. However, there is never enough time and money to provide for completely reliable tests. Perfection is simply impossible. Nor is it expected, as HRT scholars would point out (Boin & Schulman, 2008). Political reality dictates that if nothing ever flies, NASA will be stripped of its budget in a hurry. Society expects NASA to take responsible risks. This is NASA’s ever-present dilemma: safety (extensive testing by an army of specialists) has to be balanced against performance (schedules and budgets). In the early years of the Apollo project, test failures and huge budget overruns threatened the success of the program (Johnson, 2002:102). NASA administrator Webb realized that drastic change was needed if NASA was to maintain political support and succeed in its lunar mission (Johnson, 2002:130–132). A new way of working was introduced, which turned NASA around ‘‘from a loosely organized research team to a tightly run development organization’’ (Johnson, 2002:142; cf. Murray & Cox, 1989). This way of working would mark NASA’s culture to the day of the Columbia disaster. One change was the imposition of the ‘‘all-up testing’’ concept. NASA engineers used to test individual parts many times. Once individual parts were put together more testing would follow. By adding tested parts, a gradual process of development emerged. This time-proven practice had two drawbacks. First, it would take a very long time to do a sufficient number of tests to create a statistical base for risk assessment. Second, the test process could never completely resemble a space environment. The NASA engineers shared a background of learning through failure: firing rockets, watch them explode, determine what went wrong, and redesign the rocket – until the rocket was perfect. This learning method did not work within the context of manned space flight: once you strap people on top of a rocket, it has to work the first time around. The all-up testing principle dictated the end of endless testing. The new mantra was an ultra-rational version of engineering logic: ‘‘design it right, fabricate it per print, and the component will work’’ (Murray & Cox, 1989:103). The legendary Von Braun thought the concept ‘‘sounded reckless,’’ but he had to admit that the underlying engineering reasoning was ‘‘impeccable’’ (Murray & Cox, 1989:162). This impeccable reasoning held that ‘‘on a probability basis alone, there was no way to make 0.999999 claims on the basis of statistical evidence unless the engineers tested the parts millions of times’’ (Murray & Cox, 1989:101). Chris Kraft, NASA’s first flight director, defines the risk that accompanied the all-up testing principle: ‘‘We said to ourselves that we have now done everything we know to do. We feel comfortable with all of the unknowns that we went into this program with. We know there may be some unknown unknowns, but we don’t know what else to do to make this thing risk-free, so it is time to go’’ (cited in Logsdon, 1999:23). In short, NASA rejected the verisimilitude of quantitative risk analysis and simply accepted that every space flight could end in disaster. This philosophy demanded an unwavering commitment to ‘‘sound engineering’’ principles and generated a powerful culture around expertise. The launch procedures reflect this philosophy of calculated risk. 4.1. A philosophy of calculated risk The process leading up to a launch has always been governed by strict rules. Missions were followed by mission evaluation reports (MERs), which identified anomalies that had occurred during flight. These anomalies had to be dealt with (‘‘closed’’) before the next flight could take place. All judgments were based on the basis of engineering arguments only. In the words of one famous NASA character (‘‘Mad’’ Don Arabian): ‘‘If anybody does anything technically that’s not according to physics, that’s bullshitting about something, I will forever be death upon them’’ (Murray & Cox, 1989:361). This engineering attitude would determine the evaluation and preparation of all manned space flights. These rules originated in practice; they codified long-term experience. They represented ‘‘a set of values’’ and ‘‘took on the elements of a philosophy’’ (Murray & Cox, 1989:262). The rules thus served several purposes: prescribe best practices, enhance central control, and protect the organization and its individuals from external criticism (Johnson, 2002:225). This philosophy was checked by failure and reinforced by success. The spectacular success of the 1969 moon landing proved to many within NASA that the philosophy worked (Johnson, 2002). Just how effective and resourceful NASA culture had become was demonstrated some time after the successful lunar mission when Apollo 13 experienced an explosion in space. The adherence of procedures enabled the

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

83

engineers to figure out what had happened and what was possible. Yet, it was the capacity to be flexible and to depart from enshrined rules that gave rise to the level of improvisation that in the end saved the day (and the crew). The shared commitment to sound engineering and the institutionalized practice of open communication made it possible to solve this crisis in the nick of time. If High Reliability Scholars would have been around in the late 1960s, NASA would be their example organization. 4.2. Classifying emerging problems as acceptable risks The design and operation of the space shuttles was much more complex than the Apollo spacecraft. Through the acceptable risk process, NASA seeks to identify ‘‘potential hazards associated with the design of the shuttle’’ (Vaughan, 1996:81). These hazards – there are many of them given the experimental nature of the shuttle – have to be addressed before the shuttle can fly. If an identified hazard cannot be eliminated before launch time, NASA has to determine whether such a hazard qualifies as an ‘‘acceptable risk.’’ If engineers can provide an acceptable ‘‘rationale’’ – always on the basis of sound engineering – that explains why a risk should be accepted (rather than redesigning the parts that posed the risk), the hazard is officially classified as an acceptable risk. The infamous O-rings that caused the failure of the Challenger shuttle had been subjected to this acceptable risk procedure (Vaughan, 1996). The Rogers Commission considered this idea of ‘‘acceptable risk’’ essentially unacceptable.2 It failed to recognize, however, that this idea was not born from recklessness (as both presidential commissions suggest). The idea was firmly entrenched in the NASA culture, because it had proved its worth in the Apollo years. It had kept the NASA engineers to a schedule – if the original philosophy of endless testing had been maintained, then it is likely that the Apollo missions would have never flown.3 Moreover, it had not resulted in an unacceptable number – as judged by political stakeholders – of deadly disasters. The foam (and the associated tile) problem, which would ultimately cause the demise of Columbia, had been classified as an acceptable risk (CAIB, 2003:123). It was considered a dangerous problem in the early days of the shuttle program. The first shuttle flight – which was carried out by the Columbia in 1981 – caused the replacement of 300 tiles. The foam-and-tile issue thus qualifies as one of the oldest in the shuttle catalogue. It was dealt with in the way described above: study the problem, determine the cause, fix the problem and fly the shuttle again. After well over a hundred flights, the NASA engineers thought they understood the problem and deemed the risk acceptable. In hindsight, of course, this assumption proved incorrect. Both commissions criticized NASA for expanding its risk definition. To NASA engineers, however, there was no feasible alternative. Since it is impossible to prove that a spacecraft will fly safely, safety can only be achieved through careful and controlled experimentation. The implied risk of failure – a ‘‘normal’’ accident – had been accepted since the Apollo years. The accepted risk procedure aimed to minimize, not eliminate, risk on the basis of sound engineering insights. While both commissions paid lip service to the inherently risky nature of space flight, they ignored the practical implications of this insight. 4.3. Deciding on emerging risks: The flight readiness review Each shuttle flight is preceded by a so-called flight readiness review (FRR). This formal review procedure is a bottom-up process designed to identify risks and bring them to the attention of the higher management levels. Because it is impossible and undesirable that top-level administrators review all possible risks and anomalies, the FRR aims to filter out the critical anomalies for senior management review. The review process is funneled through four levels of discussion (from the work floor to headquarters) at which engineers from the involved centers and contractors must agree that ‘‘the shuttle is ready to fly and to fly safely’’ (Vaughan, 1996:82). The FRR has two basic characteristics, both of which can be traced to the Apollo years. The first characteristic is the openness of the process (Logsdon, 1999). The FRR provides all NASA engineers with at least one opportunity to voice 2

Vaughan (1996) suggests that the commission members did not fully grasp the process behind the terminology. A popular saying in NASA during the Apollo era held that the ‘‘better is the enemy of the good’’ (Murray & Cox, 1989:175). The search for perfectly ‘‘proven’’ solutions would lead to endless alterations of the design; each alteration to an existing design, in turn, introduces new and unforeseen risks (Vaughan, 1996:116). 3

84

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

any concerns they may have with regard to flight safety (Vaughan, 1996:89; 256). As one project manager at Marshall told Vaughan (1996:241): ‘‘it is inconceivable, impossible, that any one individual or group of individuals would decide to keep a problem from the normal [FRR] process. Too many people are involved. It would take a conspiracy of mammoth proportions.’’ The House Committee on Science and Technology concurred: ‘‘The procedure appears to be exceptionally thorough and the scope of the issues that are addressed at the FRRs is sufficient to surface any problems that the contractors of NASA management deem appropriate to surface’’ (Vaughan, 1996:240). Second, all discussions are held on the basis of engineering logic; every flight risk and anomaly is assessed against the laws of physics and engineering. Managerial considerations are subjugated to ‘‘scientific positivism’’ – Vaughan (1996:89) speaks of ‘‘dispute resolution by numbers.’’ In the FRRs, there is no room for ‘‘gut feeling’’ or ‘‘observations,’’ only solid engineering data are admissible (Vaughan, 1996:192). 4.4. NASA’s blind spot: Critical decision-making and intuition Both commissions discovered that NASA engineers had voiced concerns, which, if heeded, could have prevented the Challenger and Columbia disasters. On the eve of the Challenger launch, Thiokol engineers had raised doubts with regard to the safety of the O-rings in cold weather. The CAIB report describes how NASA ultimately dismissed the concerns of engineers who suspected that the foam damage was more serious than NASA managers seemed to think. In both cases, NASA did not recklessly ignore warnings (as both commissions allege), but abided by its safety system (the risk procedure and the FRR). But the safety system was not perfect, as we now know. A crucial vulnerability in NASA’s safety system – which remained unaddressed and unresolved in the wake of Challenger – eventually sealed the fate of Columbia: NASA had no proper procedures to identify and properly weigh signals of doubts, coming from respected engineers, which were not substantiated by engineering data (cf. Dunbar & Garud, 2005). The foam-caused damage to Columbia was not discovered until day 2 of the trip after the Intercenter Photo Working Group had studied the film of the launch. The Photo Group formed a Debris Assessment Team (DAT), which was to consider whether the damage would pose a safety issue. The photo material showing the foam hit was widely disseminated throughout NASA and its contractors by e-mail. Both the media and the astronauts on board of Columbia knew of the problem. As a result, two types of discussion emerged: the formally bounded discussion in the DAT and the grapevine-type discussion later unearthed by the CAIB. In hindsight, we can clearly see the problem: NASA did not know how to deal with ambiguously communicated ‘‘gut feelings’’ of its own engineers. Initial assessments that circulated between NASA engineers and contractors did not provide any cause for alarm and ‘‘may have contributed to a mindset that [the foam hit] was not a concern’’ (CAIB, 2003:141). Boeing, for instance, used a software tool (‘‘Crater’’) to assess potential damage, but the outcome did not give rise to concern.4 The DAT members suspected that damage could have occurred and wanted on-orbit images of Columbia’s left wing. The first informal meeting (on flight day #5) of the DAT team resulted in a formal agenda point (for next day’s meeting) discussing a request for on-orbit imaging – no sense of urgency apparently existed at this time. Mission Control was under the impression that the foam strike fell within the experience base and waited for additional information to emerge from the DAT (CAIB, 2003:146). This impression was confirmed by an e-mail of Calvin Schomburg – whom Shuttle program managers considered an expert on the matter – stating that the hit ‘‘should not be a problem’’ (CAIB, 2003:149). The managerial decision not to pursue imagery created a sense of unease with at least some DAT members, but the lack of hard data – a consequence of no imagery – made it nearly impossible to jumpstart the discussion. On day #9 of the flight, the DAT presented its findings to a representative of Mission Control. The DAT engineers ‘‘ultimately concluded that their analysis, limited as it was, did not show that a safety-of-flight issue existed’’ (CAIB, 2003:160). As a senior engineer wrote to his colleagues two days later: ‘‘I believe we left [the shuttle manager] the impression that engineering assessments and cases were all finished and we could state with finality no safety of flight issues or 4 The CAIB discovered that the CRATER software was, in effect, not designed to perform this type of analysis nor were the Boeing engineers performing the analysis sufficiently qualified, which undermines the validity of their findings. The question is whether NASA could or should have known this.

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

85

questions remaining’’ (CAIB, 2003:163).5 The CAIB quite rightly pointed out that many uncertainties were noted in this presentation, but, as we have seen above, NASA culture did not allow for ‘‘feelings’’ and ‘‘observations.’’ The DAT never made a convincing case (in terms of engineering logic) that a safety-of-flight issue existed, nor could it explain why the mission schedule would have to be altered in order to get imagery of the shuttle’s wing. Moreover, the shuttle managers assumed there was nothing that could be done at this point.6 The CAIB suggests that doubts of respected engineers should give rise to experiments and testing, until safety can be proven. During the Apollo years, however, NASA had learned that this does not work: engineers will tinker, test and experiment forever (for they know that they can never prove the safety of an experimental space craft). The system in place served NASA well: no astronauts had been lost in space until the Challenger explosion. It was this institutional logic and disciplined adherence to proven safety systems, not the loss of a safety culture, which allowed the Challenger and the Columbia to take off.

5. Conclusion: Reconsidering judgments of failure Making use of social science insights, the CAIB report paints a bleak picture of an organization that cuts safety corners, ignores clear-cut warnings, suppresses whistle-blowing engineers, and does everything it can to beat irresponsible deadlines. The consequences of the CAIB report have been serious for NASA. Not only did the report undermine the legitimacy of NASA and unleash an avalanche of reform proposals, ‘‘it questioned the very essence of what the NASA family holds so dear: our ‘‘can-do’’ attitude and the pride we take in skills to achieve those things once unimagined’’ (O’Keefe, 2005:xviii). This article makes two points. First, it shows that the theory on which CAIB leans so heavily does not support the Board’s conclusions. The Board applied a highly selective and rather simplified version of HRT to assess NASA’s safety system (cf. Boin & Schulman, 2008). Second, this article demonstrates that empirical analysis informed by both NAT and HRT leads to a more ambiguous explanation of the shuttle disasters. In our assessment of NASA’s safety system, we begin by considering the extraordinary challenges of space exploration. NASA has to design spacecraft that can safely transport humans into space and back to earth. This must be accomplished in an unforgiving environment (small glitches tend to have deadly consequences), with tight schedules, limited resources and fluctuating legitimacy. It follows that accidents have to be expected (Perrow, 1984). NASA has achieved many spectacular successes, but it also suffered a few stunning disasters. NASA had in place a safety system to prevent these failures from happening. The core elements of this system were the institutional residues of the successes and failures of the Apollo years. To suggest that NASA had ‘lost its safety culture’ since Apollo is therefore misleading. The essence of NASA’s culture had, in fact, not changed. NASA’s safety culture was not perfect. It had a blind spot corner that – in hindsight – appears to have played a crucial role in the pre-disaster phases of both Challenger and Columbia. In the institutionalization of its safety culture, NASA seems to have lost some of its ability to recognize significant emerging events. This ability is informed by the intuition of engineers who are intimately familiar with their designs and who ‘‘feel’’ that something is going wrong. NASA culture has always scoffed at making judgment based on soft data, but the organization also used to have ‘‘institutional recalcitrants’’ who could do just that: so-called ‘‘intuitive engineers’’ who were respected as brilliant and whose judgments would be heard (Murray & Cox, 1989). The institutionalized system of safety engineering in NASA has captured many good lessons from past failures, but it does not prescribe an effective way to deal with ‘‘emerging indeterminacies’’ (Dunbar & Garud, 2005). This is where the system may have failed NASA: it did not recognize the deep uncertainty that cannot be captured or explained by sound engineering logic. This system is bound to ‘‘miss signals.’’ The core challenge is to make 5

By the time he wrote this, Alan Rocha had become convinced himself (as a result of Boeing’s analysis) that no real risk existed. Here the CAIB makes an interest jump in its own rationale. Operating on a ‘‘what . . . if?’’ basis, the CAIB asked NASA to provide rescue scenarios had NASA appreciated the level of damage at day #3 (which seems nearly impossible even in hindsight). NASA indicated that it would have been possible to scramble another shuttle (Atlantis) for departure, and launch it in time to rescue the crew. NASA notes that such a mission would have been very risky (most FRRs would have been bypassed for instance), but the CAIB (2003:174) seemed to prefer this type of double jeopardy noting that rescue was ‘‘challenging but feasible,’’ thus undermining its own safety philosophy expounded in the report. 6

86

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

structural room for indeterminacies, to value the ‘‘disconnects’’ in risky, rational systems (Leveson et al., 2004:17). How this should be done, however, remains a mystery. The causes of both shuttle disasters had little to do with fiefdoms, schedules or negligence – causes cited by both the Rogers Commission and the CAIB. As both commissions emphasized the wrong causes, the real vulnerabilities remain unaddressed. The recommendations of both commissions did little to solve the weakness of the safety system (without sacrificing its strength). They did force NASA into energy-consuming efforts to ‘‘become a learning organization’’ and work on its ‘‘leadership problem.’’ It remains painfully vague how NASA should balance budgets, schedules and safety (which remains the core challenge). The CAIB report may thus undermine a system that has produced great feats without offering a better alternative that works.7 Using unproven theories to arrive at absolute verdicts may not be without adverse consequences. In a politicized environment, official assessments of crisis management performance – delivered by prominent inquiries – carry great weight. Their conclusions and prescriptions become milestones for progress. Whether they are sound, feasible and without unintended consequences usually does not concern public inquiry committees and the politicians that heed their advice. All this suggests that a degree of caution should be employed in assessments of failure. The most cited theories are still unproven and hindsight biases tend to cloud theoretically informed analysis. Perhaps more importantly, the findings of this paper suggest that we rethink our expectations with regard to truth-seeking inquiries. The truth may be harder to find than our theoretical work promises it to be. References Baer, M., Heron, K., Morton, O., & Ratliff, E. (2005). Safe: The race to protect ourselves in a newly dangerous world. New York: HarperCollins. Boin, R. A., McConnell, A., & ‘t Hart, P. (Eds.). (2008). Governing after crisis: The politics of investigation, accountability and learning. Cambridge: Cambridge University Press. Boin, R. A., & Rattray, W. A. R. (2004). Understanding prison riots: Towards a threshold theory. Punishment & Society, 6(1), 47–65. Boin, R. A., & Schulman, P. (2008). Assessing NASA’s safety culture: The limits and possibilities of High Reliability Theory. Public Administration Review, 68(6), 1050–1062. Brunk, G. G. (2001). Self-organized criticality: A new theory of political behavior and some of its implications. British Journal of Political Science, 31, 427–445. Columbia Accident Investigation Board. (2003). Columbia accident investigation report. Burlington, Ontario: Apogee Books. Dunbar, R., & Garud, R. (2005). Data indeterminacy: One Nasa, two modes. In W. Starbuck & M. Farjoun (Eds.), Learning from the Columbia accident (pp. 202–219). Oxford: Blackwell. Egan, M. J. (2011). The normative dimensions of institutional stewardship: High reliability, institutional constancy, public trust and confidence. Journal of Contingencies and Crisis Management, 19(1), 51–58. Johnson, S. B. (2002). The secret of Apollo: Systems management in American and European space programs. Baltimore: Johns Hopkins University Press. LaPorte, T. R. (1975). Organized social complexity: Explication of a concept. In T. R. LaPorte (Ed.), Organized social complexity: Challenge to politics and policy (pp. 3–39). Princeton: Princeton University Press. La Porte, T. R. (1994). A strawman speaks up: Comments on the limits of safety. Journal of Contingencies and Crisis Management, 2(4), 207–211. La Porte, T. R. (1996). High reliability organizations: Unlikely, demanding and at risk. Journal of Contingencies and Crisis Management, 4(2), 60– 71. LaPorte, T. R. (2011). On vectors and retrospection: Reflections on understanding public organizations. Journal of Contingencies and Crisis Management, 19(1), 59–64. LaPorte, T. R., & Rochlin, G. (1994). A rejoinder to Perrow. Journal of Contingencies and Crisis Management, 2(4), 221–228. Leveson, N., Cutcher-Gershenfeld, J., Barrett, B., Brown, A., Carroll, J., & Dulac, N. (2004). Effectively addressing NASA’s organizational and safety culture: Insights from systems safety and engineering analysis. Paper presented at the Engineering Systems Division symposium. Logsdon, J. M. (moderator) (1999). Managing the moon program: Lessons learned from Project Apollo. Proceedings of an oral history workshop, conducted July 21, 1989. Monographs in Aerospace History, Number 14, NASA, Washington, DC. Mahler, J. (2009). Organizational learning at NASA: The Challenger and Columbia accidents. Washington, DC: Georgetown University Press. Masuch, M. (1985). Vicious circles in organizations. Administrative Science Quarterly, 30(1), 14–33.

7 In a decade marked by a 40% budget reduction, workforce reductions, forced privatization of selected NASA tasks, considerable political turmoil, NASA safely flew a large number of successful missions (cf. McCurdy, 2001). Interestingly, the CAIB (2003:184) does note that ‘‘the investigation revealed that in most cases the Human Space Flight Program is extremely aggressive in reducing threats to safety.’’ The CAIB (2003:198) further observes that ‘‘Flight Readiness Reviews had performed exactly as they were designed, but they could not be expected to replace engineering analysis.’’

A. Boin, D. Fishbacher-Smith / Policy and Society 30 (2011) 77–87

87

McCurdy, H. E. (2001). Faster, better, cheaper: Low-cost innovation in the U.S. space program. Baltimore: Johns Hopkins University Press. McDonald, H. (2005). Observations on the Columbia accident. In W. H. Starbuck & M. Farjoun (Eds.), Organization at the limit: Lessons from the Columbia disaster (pp. 336–346). Oxford: Blackwell Publishing. Miller, D. (1988). Organizational pathology and industrial crisis. Industrial Crisis Quarterly, 2, 65–74. Mitroff, I. I., Pauchant, T., & Shrivastava, P. (1988). The structure of man-made organizational crises: Conceptual and empirical issues in the development of a general theory of crisis management. Technological Forecasting & Social Change, 33(2), 83–108. Murray, C., & Cox, C. B. (1989). Apollo: The race to the moon. New York: Simon and Schuster. Nutt, P. C. (2004). Organizational de-development. Journal of Management Studies, 41(7), 1083–1103. O’Keefe, S. (2005). Preface. In W. H. Starbuck & M. Farjoun (Eds.), Organization at the limit: Lessons from the Columbia disaster (pp. xvii–xix). Oxford: Blackwell Publishing. Perrow, C. (1986). Complex organizations: A critical essay. New York: McGraw-Hill. Perrow, C. (1994). The limits of safety: The enhancement of a theory of accidents. Journal of Contingencies and Crisis Management, 2(4), 212–220. Perrow, C. (1999). Normal accidents: Living with high-risk technologies. Princeton: Princeton University Press. Pfeffer, J. (1978). Organizational design. Arlington Heights, IL: AHM. Presidential Commission on the Space Shuttle Challenger Accident. (1986). Report to the President by the Presidential Commission on the space shuttle Challenger accident. Washington, DC: Government Printing Office. Reason, J. (1990). Human error. New York: Cambridge University Press. Rochlin, G. I. (1996). Reliable organizations: Present research and future directions. Journal of Contingencies and Crisis Management, 4(2), 55–59. Rochlin, G. I. (2011). How to hunt a very reliable organization. Journal of Contingencies and Crisis Management, 19(1), 14–20. Rosenthal, U. (1998). Future disasters, future definitions. In E. L. Quarantelli (Ed.), What is a disaster? Perspectives on the question (pp. 146–159). London: Routledge. Sagan, S. D. (1993). The limits of safety: Organizations, accidents, and nuclear weapons. Princeton: Princeton University Press. Schein, E. H. (1985). Organizational culture and leadership. San Francisco: Jossey-Bass. Schulman, P. R. (1989). The ‘logic’ of organizational irrationality. Administration & Society, 21(1), 31–53. Selznick, P. (1957). Leadership in administration: A sociological interpretation. Berkeley: University of California Press. Shrivastava, P., Mitroff, I., Miller, D., & Miglani, A. (1988). The anatomy of industrial crisis. Journal of Management Studies, 25, 285–304. Smith, D. (2005). Dancing around the mysterious forces of chaos: Exploring issues of complexity, knowledge and the management of uncertainty. Clinician in Management, 13(3–4), 115–123. Smith, D. (2006). The crisis of management: Managing ahead of the curve. In D. Smith & D. Elliott (Eds.), Key readings in crisis management. Systems and structures for prevention and recovery (pp. 301–317). London: Routledge. Smith, D., & Elliot, D. (Eds.). (2006). Key readings in crisis management: Systems and structures for prevention and recovery. London: Routledge. Turner, B. A. (1976). The organizational and interorganizational development of disasters. Administrative Science Quarterly, 21(3), 378–397. Turner, B. A., & Pidgeon, N. F. (1997). Man-made disasters (2nd edition). Oxford: Heinemann. Vaughan, D. (1996). The Challenger launch decision: Risky technology, culture and deviance at NASA. Chicago: University of Chicago Press. Weick, K. E., & Sutcliffe, K. M. (2001). Managing the unexpected: Assuring high performance in an age of complexity. San Francisco: Jossey-Bass. Wildavsky, A. (1988). Searching for safety. New Brunswick: Transaction. Woods, D. (2005). Creating foresight: Lessons from enhancing resilience from Columbia. In W. Starbuck & M. Farjoun (Eds.), Organization at the limit: Lessons from the Columbia disaster (pp. 289–308). Oxford: Blackwell Publishing.