Chaos engineering

"Chaos engineering” is planning and performing various failure tests on the live production environments to check its high availability, RTO, and RPOs. The idea may sound absurd at first and may raise the eyebrows of many in your organization. But when implemented right, with proper planning it can keep the organization fully prepared for any critical component/module failure. "Chaos Engineering" by itself is very planned and detailed and involves a thorough analysis of the whole organization environment, understanding current RTO and RPO, and planning and executing the tests so as to make sure the systems are tested for a wide variety of failures. The objective of planning is to cover as many failure scenarios as possible.

An Example

Let's try to understand it with an example. Imagine we have a 3-tier critical web application hosted with high availability. Measures have been taken to make sure that for crisis the application can be restored to its working state in RTO. There are multiple instances hosting databases and applications paper with load balancing and auto-failover. On papers, the number and the setup look good but in reality, we don't know what kind of problem will come. We might have prepared ourselves to deal with the instance failure or database corruption. But might not have thought of Load balance failure. And since it was an unforeseen cause of failure the time to recover the application functionality might take longer that RTO and mean major business impact either reputation-wise, finance, or both.

Traditional vs. Chaos

Traditionally fail-over testing is done during the new implementation or very big migration changes. Routine DR testing is done only for critical components in the environment. As the organization or the service provider doesn't want downtime during testing as well that can impact business. DR testing is planned and done during non-critical hours only for the modules that we know can recover. Also, most of the architectures and its components keep on changing either due to new upgrades or cost/performance optimization. Unless the environments are tested for each and every component for failures, we can face problems just like in the above example.

Summary

Testing your live environment gives you more confidence in the stability of your application and also keeps the tech teams responsible on their toes(which is not as bad as it sounds). That helps identify the weaknesses and focus areas in production. Also identifies any direct or indirect impact of any recent changes in the environment.

Next time we will discuss how to analyze, plan, and execute the tests of "Chaos" till then, please let me know in the comments below what you think about the article. Any suggestion is much appreciated...

In the previous blog, we discussed what chaos engineering is. And in this part, we will go through how to plan and implement it.

While planning for the testing keep the following things under considerationthat:

  • Assume nothing: Just like any other testing assume nothing while analyzing the environment. Quite often the part of designs that we assume to be safe is the one which fails, leaving us in an unprepared situation.

  • Cover everything: Not so of a point but a checklist to look at the infrastructure thoroughly. Every component has its failure point, our job is to test for what happens when that fails.

  • Do not take a word for it: Highly available, low failure rate, and highly redundant, we come across such services/implementations quite often. More so since the cloud and its services have come. But It's all theoretical/SLA, the failure they promised might not cover all the scenarios for you. Also for them as service failure might be well within the SLA range, but it could cause us financial/reputation loss as an organization

Last updated