I will never forget watching the TV news on January, 28th, 1986. That was the day the space shuttle Challenger exploded only 73 seconds after launching, killing all seven crew members.

I was reminded of that day on February 1, 2003, when the space shuttle Columbia exploded over Texas on re-entry, again resulting in the deaths of all seven crew members.

Although these catastrophic events took place more than a decade apart, they weren't isolated incidents. Rather, these tragedies pointed to serious issues within NASA (see When Safety Leadership Fails: Lessons Learned from Major Disasters for related reading).

Everyone in the organization was asking serious questions, and one of the main tools they used to uncover the answers was a Fault Tree Analysis (FTA). This tool was developed in 1962 in Bell Laboratories for the aerospace industry. It has since proven useful for risk analysis for any industry. An FTA can be used to capture all the contributing factors for any identified undesirable event.

This ability to flush out risk by reviewing a chain of causal events is what makes Fault Tree Analysis such a powerful tool for driving safety. It should be in everyone's risk assessment toolkit.

What Is a Fault Tree Analysis?

Let's break down this useful tool and have a look at how it works,

A Fault Tree Analysis is a diagram that maps out all the contributing factors that led to an undesired event. At the very top of the map is the undesirable event, such as the brakes failing on a car. Under the event, you would list any factors that could lead to the cars brakes failing, such as:

  • Faulty master cylinder
  • Low brake fluid
  • Worn brake pads

For each of the of those potential factors, you would then list any activities that could lead to them. For example, low brake fluid might be caused by:

  • Broken pipe
  • Leaking cylinder
  • Loose bleed screw

Once you draw all of these out, you'll have a representation of the event and its contributing factors.

To simplify things and make the diagrams easier to read, FTAs use a standard set of icons to represent various gates and events.

The two most commonly used gates are "or" and "and". The "or" gate is used to represent factors that could cause the failure by themselves, while the "and" gate is used to represent factors that could jointly contribute to the failure. In the case of our braking failure, low brake fluid, a faulty master cylinder, and worn brake pads would all be drawn with the "or" icon because any of them on their own could cause the brakes to fail.

Using Fault Tree Analysis to Improve Safety

Most do not realize the lengths engineers go to trying to ensure a safe design. From creating manufacturing equipment to assembling something as complex as a space shuttle, safety has to be factored into every single step of the design. For each component and all systems involved, risk assessments are performed with the aid of a Failure Mode and Effects Analysis (FMEA) and an FTA. This provides a structured approach to analyzing every component and system that could lead to a failure.

The next time you’re flying in a commercial jet, take a moment to be grateful this process was used to ensure the safety of the jet design and to identify potential problems before they could occur in real life.

Fault Tree Analysis is used to flush out potential system failures in advance, with the goal of eliminating potential failures altogether. It enables a proactive approach in safety, right from the design phase. As more data is gained from testing or product history, you can add a statistical value to the events and predict failure and how reliable particular designs will be.

A Fault Tree Analysis can be used effectively for many different potential hazards, from missile guidance system failures to cyber hacking. As you work your way down the fault tree, continually ask yourself "How can this fail?" Once you've answered this question enough times, you'll have a tree that can be read from bottom to top to provide a step-by-step guide to those hazards. It can essentially function as a how-to guide for particular hazards and safety failures. Of course, you won't be using it as a how-to guide to create hazards, but as a diagram to help you understand how these hazards come about.

With enough data from past failures, you can predict the probability of a failure based on time and conditions. By going through the diagram, you'll be able to find potential problems before they occur and put control measures in place to prevent them from happening.

Space Shuttle Challenger Disaster

The space shuttle Challenger exploded 73 seconds after launch. The investigation found that the right solid rocket booster separated, causing damage to the external tank. This led to the destruction of the shuttle by aerodynamic forces.

From there, they needed a Fault Tree Analysis.

The top item on the fault tree is the solid rocket booster separation. Working down the tree, the cause of the separation was an O-ring joint failure. The O-ring sealed a joint connecting the solid rocket booster to the main part of the shuttle. Both the primary and secondary O-rings failed, allowing heated gasses and flames to escape and make contact with the external tank, resulting in a structural failure.

Two main factors were uncovered:

  • Technical – The O-ring joint had already been identified as being inadequate and a new design was underway. Previous flights had shown O-ring erosion had taken place, making the secondary O-ring useless.
  • Organizational – Cold temperature the morning of the launch had engineers concerned. Ice had formed on the shuttle and it wasn't clear whether the O-rings would perform well in the cold weather. NASA management decided the risk was acceptable and gave the go-ahead for the launch.

This horrible disaster could have been avoided. And with a judicious use of Fault Tree Analyses, future disasters just might be (learn Lessons from 3 of the Worst Workplace Disasters).

Make FTA Part of Your Toolkit

An FTA works best in the design phase. It can help you identify potential risks and implement control measures before there's a chance of any real-life consequences.

It should also be used after an accident or undesirable event to help identify all the contributing factors that led to it.

Take time to familiarize yourself with this method and make use of it when appropriate. If you do, you'll have another structured approach to risk analysis in your toolkit.