Whenever something really bad happens in this high-tech world--or whenever someone wants to make sure it doesn't--there's a good chance that before long, someone will be calling Nancy Leveson. Or wishing they had. And whether the issue involves anything from advanced spacecraft to pharmaceuticals or critical computer systems, chances are the MIT professor of aeronautics and astronautics and engineering systems will have a good answer to the problem.
Risk analysis for complex systems like nuclear power plants or space shuttles typically involves analyzing sequences of failures, figuring out every part of the system that might fail and what effects that might have, and putting all those pieces together--essentially, a bottom-up way of looking at things. Leveson, after nearly three decades of working on such problems, has revolutionized the field by developing a new top-down way of analyzing the risks of complex systems, which leads to a more integrated approach to managing the risks.
Her advice and analysis have been applied in recent years to helping a presidential commission understand how to prevent the communications failures that led to the space shuttle Columbia accident, helping pharmaceutical companies manage the risks of new drugs being introduced, and helping the Federal Aviation Administration assess new technology for air traffic control. She's also investigated how to make sure that a new missile defense system being developed by the U.S. would not be vulnerable to an accidental launch, and how to reduce the risks of corporate fraud that could damage the economy.
What tends to happen in complex, high-tech systems, Leveson has found, is not so much a random failure of one or two parts of a system, but rather a gradual drift over time from a safe operation to one where the safety margins have eroded and one small problem can throw everything out of kilter.
Sometimes, there is no failure in the traditional sense: Each part of the system did what it was supposed to do, but there was an underlying error in the overall design. That's what happened, for example, in the loss of NASA's Mars Polar Lander a few years ago, after the lander's engineers failed to inform colleagues working on its onboard software of a potentially dangerous source of "noise."
For Leveson, the turning point came in 2000, when she realized, "after about 20 years, nobody was making any progress" in figuring out how to manage the risks of complex systems, she says. "Usually, that's means there's something wrong with the underlying assumptions everybody is using."
She realized that the basic component-based approach to assessing risk was something that had prevailed at least since World War II, and "it just didn't apply" to many of the highly computerized technological systems in operation today. "Accidents just occur differently. Risk has changed as the technology has changed." So she started developing her new approach, based on systems theory.
At first, she was afraid that nobody would take her radical new approach seriously. "I thought people would just think I was nuts," she says with a laugh. But when she started applying her new approach to specific cases, such as identifying the potential for inadvertent launch in the new missile defense system, it clearly worked: It identified significant hazardous scenarios that nobody had noticed otherwise.
"We tried it on extremely large, complex systems, and it worked much better than what people do now," she says. "I realized we could solve problems that weren't solvable before."
The new approach to analysis led to a whole new way of dealing with the risk management of complex, socio-technical systems. Instead of looking at the individual components and trying to minimize the chances that each would fail, "what you really want is to enforce safety constraints" on the behavior of the entire system, Leveson says.
"We used to build systems that were simple enough so that you could test everything, and test the interactions," she says. "Now, we're building systems so complex that we can't understand all the possible interactions." While traditional analysis assumes a linear, causal chain of events, accidents in complex systems often unfold through very nonlinear effects, feedback loops and so on.
Leveson calls her new approach STAMP, for System-Theoretic Accident Model and Processes. She has set up a company to implement the system in analyzing a wide variety of systems in different fields, and is finishing a book on the system that will be issued by MIT Press this fall. In the meantime, chapters of the book are available online on Leveson's web site (http://sunnyday.mit.edu/book2.pdf).
"Nancy Leveson has developed a control-based modeling approach to systems safety which can be applied to complex networks of hardware and humans," says professor Jeffrey Hoffman, a colleague of Leveson's in MIT's Aeronautics and Astronautics Department. "Her work has elicited considerable interest inside NASA, where safety analysis has traditionally concentrated on the reliability of individual pieces of complex systems."
While NASA is using her new approach to analyze risks in the development of the Orion spacecraft that will replace the shuttle, and in developing a future robotic planetary probe, the Japanese space agency has gone even further: They sent two engineers to work in Leveson's lab for a couple of years and observe how she does her analysis, and have been applying the lessons learned to their space systems while creating improved tools.
Though her work focuses on disasters, Leveson is upbeat about what she does. Using the old ways, she says, "it was discouraging to have something that only works in a small subset of cases." But with her new approach, she says, "it's very exciting to have something that actually works, and to be able to apply this in the social and organizational realm."
A version of this article appeared in MIT Tech Talk on May 21, 2008 (download PDF).