By George Westerman and Richard Hunter
Because IT risk is now business risk, with business consequences, enterprises must change the way they manage it.
A half century of adopting information technology at an astonishingly rapid rate has created a world in which IT is not just widely present but pervasively, complexly interconnected inside and outside the enterprise. As enterprises’ dependence and interdependence on IT have increased, the consequences of IT risk have increased as well. What is IT risk? It’s the potential for an unplanned event involving a failure or misuse of IT to threaten an enterprise objective—and it is no longer confined to a company’s IT department or data center. An IT risk incident has the potential to produce substantial business consequences that touch a wide range of stakeholders. In short, IT risk matters—now more than ever.
This change in the meaning and importance of IT risk has caught some executives unaware. Every executive at some time has experienced problems with his IT organization and systems, including delays and unexpected costs in development projects, temporary or extended loss of service, data loss or theft, processes made unnecessarily complex by systems interfaces and limitations, inaccurate information from redundant or “buggy” systems, and a myriad of other ills. Executives have generally learned to perceive—and even tolerate—such episodes as regrettably common but relatively limited in their impact on key business metrics. Case studies of companies like Tektronix and Comair, however, demonstrate how such perceptions no longer apply.

Comair, a $780 million subsidiary of Delta Air Lines, experienced a runaway IT risk incident on December 24, 2004, when the company’s crew-scheduling system failed. An airline’s crew-scheduling system is mission-critical. Federal Aviation Administration safety regulations limit the number of hours any aircrew member can work in a 24- hour period. The scheduling system is what ensures compliance with that strictly enforced regulation. Without its scheduling system, an airline does not fly.
Because of the holidays, December is always the busiest month for U.S. airlines. December 2004 was busier than normal because unusually bad weather forced airlines to cancel or reschedule many flights, including 91 percent of all flights between December 22 and December 24. No one at Comair knew that the crew scheduling system (which had been purchased from an external vendor) was capable of handling a maximum of only 32,000 changes a month. At about 10 p.m. on Christmas Eve, when Comair entered one more flight change, exceeding the monthly capacity, the system abruptly stopped functioning.
Comair technicians realized soon after, to their dismay, that the system could not simply be restarted. The only solution was to reload the entire system from scratch as quickly as possible. The tech team accomplished that task and relaunched the system late on December 25, but by then Comair had problems assembling its widely dispersed crews and aircraft where they were needed. The airline didn’t resume normal operations until December 29.
As the company struggled to recover from the disaster, nearly 200,000 stranded Comair passengers helplessly roamed airport terminals throughout the United States. Airlines were fully booked for the holiday travel season, and there were few alternative flights. Throughout the Christmas holiday, camera crews from local and national television news outlets followed passengers through those terminals, broadcasting travelers’ and Comair’s distress to the American public.
Two weeks after the system failure, the U.S. Secretary of Transportation announced an investigation into the incident. A week later, the company’s president, Randy Rademacher, resigned. In addition to the damage to the company’s reputation, its management, and its customers, Comair’s revenue losses as a direct result of this incident are estimated at about $20 million. In other words, the loss from this single incident was nearly as high as the firm’s entire $25.7 million operating profit for the previous quarter.
The company had planned, and delayed, replacement of the scheduling system several times before it failed. Despite the outcome, these decisions could be defended as rational business decisions. The system had been running for years, and the likelihood of a complete system failure—especially one that resulted from an entirely unsuspected source—was apparently low. That the system failed at a point in time when its failure was maximally damaging to the company and its customers was extremely bad luck but hardly predictable.
But something more was involved than an unfortunate decision to defer an upgrade. Comair lacked a viable plan for the immediate recovery of this mission-critical business process. Its executives failed to plan for such a high-impact failure, however unlikely it seemed. When the software went down, there was no backup system that could be pressed into immediate service, no outsourcer on call and ready to step in, no plan that could keep the company running manually while the system was fixed.
In other words, it wasn’t just the computer system that failed—it was Comair’s process for understanding and managing the business consequences of IT risk. And making sure that an organization’s major corporate risks— IT or otherwise—are managed to an acceptable level is the responsibility of the organization’s senior executives. Perhaps that’s why it was the company’s president, not the CIO, who departed in the wake of the incident.
The Comair case is about the risk of availability. The Tektronix case is about agility—the ability to change rapidly with controlled cost and risk. In the mid-1990s, executives at the $1.8 billion electronics manufacturer learned that their plans to divest a major business unit had hit an unexpected snag. Key financial and manufacturing processes for three Tektronix business units were riddled with undocumented interdependencies between critical systems. Extracting one business’ systems from that tangled mass was like removing a load-bearing wall from a building—it couldn’t be done without major restructuring. The separated unit would require duplicating nearly every one of Tektronix’s major systems (including the sensitive corporate data they contained), as well as finding technicians to maintain the systems. The difficulty of spinning out a division, with or without its IT systems, brought a focus to those IT agility risks that had been present for years.
Tektronix arrived at this strategic dilemma gradually. For decades the company’s IT department had extended existing systems, built new stand-alone systems, and written software to link systems as needed. Every new “solution” was an unconscious trade-off of long-term agility in favor of short-term benefits. The problems inherent in this approach weren’t immediately apparent to executives, but they compounded over time, just as it takes time for unplanned, uncontrolled growth in a city to visibly overload roads, schools, sewers, and support services.
By the early 1990s, Tektronix executives knew their IT systems had problems. Changes took much longer to implement than they should have and than executives would have liked. It was frustratingly difficult to get an integrated view of the company’s customers, products, and orders. Business managers complained that IT support was getting worse and worse, and IT managers knew that the systems were becoming more and more difficult to maintain. Extensive coordination by smart support staff covering for system inadequacies was so frequent that it produced a motto: “Five calls does it all.”
Continue reading ‘IT Risk and Consequences’
Latest Comments
RSS