Continuous improvement is a fundamental aspect of our culture here at Kakbima. We recently launched our refactored insurance management platform to help us build a better insurance for all the industry players. This shift of our technology come with multiple changes to our Root Cause Analysis (RCA) process which was intended to help us better understand why things go wrong and how we can improve our services and processes.

Our goal from the analysis is to identify the root cause of an incident, learn something from it, find a solution, and work towards preventing the same incident in the future. Incidents can be anything that affects our engineering both on the internal side or user’s side interacting with our platform, so downtime, release of a partially tested big change or just any other unplanned/emergency event with negative consequences.

There exist numerous methods of determining the root cause of an incident one can use, at Kakbima we settled with the 5 Whys. A method that was developed by Taiichi Ohno and used by the Toyota Motor Corporation to uncover the cause of manufacturing defects. The 5 Whys is simply repeating the question why? until you can’t go any further and the different roots causes are identified; that is if they’re more than one.

There needs to be a focus on aspects that are not working, not working as they should and a working structure does not exist at all. The exact number of Whys is not important neither is finding out whose at fault and pointing fingers part of the method. When working in a team trust and openness is very important as this allows the team members to freely contribute to the analysis.

The root cause should be clearly framed so it may be corrected by the completion of actions which often include the creation of a new process. In a perfect world the actions should prevent the incident from happening again, the real world that’s not possible or practical. Considering actions to help identify the issue earlier so you can react quicker is a solution one could look towards.

An example

Incident: Users were experiencing slow page load times when getting their insurance policies

Why? Requests were hanging in the backend
Why were the requests hanging? The database was experiencing heavy increased load
Why was the database experiencing heavy increased load? We released a new feature which required the product details of the policy coverage from a new table and the query was slow
Why is/was the query slow? The products table is/was missing an index

Actions:

Add an index to the table (Owner: Ben)
Do a dry run of the different table relationships that the query is touching on (Owner: Laura)
Check if all the data that is being queried is actually needed, if not then do not query what you don’t need(Owner: Nyambura)
Do a code review of the function(s) querying the data, are they optimized well enough?(Owner: Kent)
Look at your deployment infrastructure to make sure there is no latency issues between the different users and your application. This can grow to something else. (Owner: Kaleesie)

Note: You shouldn’t stop when you find the first cause and action. You can branch off at any level and keep going until as workable solution that solves the problem is found.

5a. Why is the index missing from the table? The new feature wasn’t tested under heavy load before release.

Actions:

Add a performance test for the feature (Owner: Carol)
Update launch checklist to require performance tests for big features (Owner: Chemtai)

Note: You can have more than one cause when answering a particular why

5b. Why is the index missing from the table? The new feature author and reviewer didn’t have experience in database performance.

Why was the index missed/overlooked at during the code review? The caretaker list for this part of the code review process does not include an experienced database performance reviewer.

Actions:

Update the caretaker list for schema code (Owner: Balange)
Schedule engineering brown bag talk for database schema performance (Owner: Brent)

All the above actions are a combination of process, code and communication changes with a complete-able action with specific owners responsible. This makes it very clear to know who has the ball on their court.

Let’s look at the RCA meeting

These meetings are good when they are conducted by the team that’s responsible for the feature or changes that led to an incident. Affected teams are also invited to help speed up the process. A summary of how the meetings are run is as follows:

Pick a facilitator and note taker
State the incident
Ask the Whys
Repeat
Identify and assign actions
Communicate your learnings

Step 6 above is critical. The actions will prevent this specific problem from happening again but the real leverage comes in spreading the learnings from this one particular failure so other similar type failures can be prevented in the future.

Learning from failure to improve the Kakbima insurance platform

Sam Wanekeya

Next Post