Root Cause Analysis

Despite our best efforts from time to time it might happen that application we have build crash. Once it happens, we do our best to restore broken functionality (or in the worst case scenario entire application) as soon as possible. However, a broken feature is only the symptom of the broader problem. If we focus only on fighting the symptoms most certainly we will face the same problem sooner or later. Therefore, once the problem on hands is fixed, we should take time to uncover what was the real reason for which this problem occurred.

Root Causes Analysis is a popular technique which allows detecting the true reason – root cause – for the undesired event to happen.

But before we start looking for root causes, we need to define what that is. There are several definitions, but I like this one the most (from “Root Cause Analysis”, ASQ Quality Press, 1993):

Root cause is that most basic reason for an undesirable condition or problem which, if eliminated or corrected, would have prevented it from existing or occurring.

As this definition is very true, we need to be aware that usually, it is not as simple as root causes create a problem. In a real live, there is a chain of events which starts with a root cause and ends with the observable problem. The simple example of the chain of events can look like that:

  • a website was down (observable problem),
  • a website was down due to exceeded timeout limit,
  • a system was not able to load all components needed for a website to be generated,
  • payment component was loading more than timeout limit for a website to be generated (root cause),

Moreover, it is possible there are more than one root causes which result in a single problem. It might happen that the problem occurred only because some two (or more) events have appeared at the same time. Seriously? Let’s take a look:

  • a website was down (observable problem),
  • a website was down due to exceeded timeout limit:
  • available resources (RAM and/or CPU) has been used to the limits
    • there has been more request than usual for this website (root cause) and
    • system automated backup started at the same time (root cause)

In this example, a website would not go down if only one of the root causes would happen. There would be still enough resources to generate website within the timeout limit for more request than usual. And there would be still enough resources to serve standard traffic and do a backup at the same time. However, a combination of those two events in the same time had severe consequences.

So how to get to the bottom of our problem. How to find out what was the root cause(s) of it? So let’s go.

Identify the problem & collect data

First, determine what has happened. Start with the observed problem which catches your attention. Collect all information which allows you to prove that it is unexpected behaviour. Also, gather more information which is going be helpful in the further investigation: (1) what’s the current behaviour, (2) what’s the expected behaviour, (3) who long problem exists and (4) what’s the impact of this issue is not fixed.

Identify possible cause factors

Based on collected information define the sequence of events resulting in the observed problem. Please consider all possible conditions which might lead to the current situation.

Identify route causes

From the defined cause factors select subset of those which will always lead to observe problem under specified conditions.

Provide a recommendation

The last step is not to provide a fix, but a recommendation how to tackle this problem. Fixing issue might be one of the proposals. However, each one we propose should consider following aspects:

  • how expensive or time consuming is getting rid of the root cause
  • how many users are using this functionality/service
  • how many users or how much money we are going to lose if this won’t be fixed (including contractual penalties)
  • are we in breach of only contract or legislation (do we have any legal responsibility)

Please, note that recommendation does not have to be one proposition. In fact, it should be a set of options from which we or person (committee, company) requested RCA can pick from.

One final note. There might be a temptation to ignore the cause of the problem if the effort needed to fix the resulted problem is acceptably low. Even bigger temptation we might have if the problem appears very rarely. And I might agree with this approach but under some conditions (online shop is off, and Black Friday is coming). However to make a smart decision, and we always should make smart or educated decisions, we need to know the root causes (even if RCA is done after the issue has been fixed). It might be used as a lesson learn which helps us avoid similar problems in a future. It may even happen that by removing one root cause, we might eliminate many other issues (including some we are not aware off) – problems which share the same root cause.