Chaos testing, also known as Chaos engineering, is a popular term in the IT industry. In chaos testing, you try to cause random and unpredictable failures in different parts of the architecture. Allowing you to provide a means to understand how the system will react to failures. Giving you a clear insight into how to improve a system and what procedures to put in place when catastrophic events happen to the system. Many giant organizations like Facebook, LinkedIn, Netflix use chaos engineering in their software development life cycle to improve the architecture of the microservices and distributed systems in a better way. Read about the basics of Choas Testi.g Its advantages and the tools available to test your system.
What is Chaos Testing?
Chaos testing is an approach used by multiple organizations to test the integrity of their system by proactively identifying and simulating failures for a given environment before it leads to a bad user experience or unplanned downtime of the application. To understand how the production environment is reacting to the chaos testing scenario, the IT and DevOps teams need to set up necessary monitoring tools in place in order to monitor the production environment reactions while running chaos testing on it. With the help of this approach, developers and testers can understand how their application or service will respond in real-life scenarios when they face unexpected issues and are evaluated under pressure and stress.
Chaos testing has brought a major disruption in multiple organizations because of how software is designed and engineered on a large scale today. Unlike other practices where you test the flexibility and velocity of the application, chaos testing helps you understand how your application will tackle the unexpected issues in the distributed systems. With the help of chaos testing, you can achieve a very high-quality customer experience, and massive scale adds an opportunity to innovate quickly.
Benefits of Chaos testing:
- Finds critical issues and vulnerabilities which control losses on revenue
- Reduction in application or system failure
- High-quality user experience with less disruption and no downtime
- It helps you learn about the system and gain confidence
History of Chaos Testing
Chaos engineering / Chaos testing term was coined by Netflix. As one of the top media streaming service providers, Netflix wanted its cloud-based streaming service to be robust enough to handle any unexpected failure so that its users are not affected. These unexpected failures can be anything such as downtime of data centres, any cyber-attack such as distributed denial of service (DDoS) attack, network segments feeling, or any other type of production failure.
In the early phase, Chaos testing was just an experiment with a null and alternative hypothesis by Netflix. in this experiment, they deliberately injected failures onto their cloud data platform to understand how it affected their application and how it responded to the injected failures. The idea was to understand what might happen if these failures occur in real-life scenarios while millions of users are accessing the service.
In order to run such experiments, Netflix developed an internal tool called chaos monkey, which is now called as Simian army. It is a collection of open-source cloud testing tools that are used to test or break applications running on telnet open environment by creating havoc. By running a simian army on the Netflix production environment, Netflix ream was able to make instances or entire data centre segment go offline, add network delays, Identify security vulnerabilities. They are also used to perform application health checks and clean up the system resources which are not being used.
Chaos Testing Principles
Chaos Testing has five main principles:
- Define a steady-state for the working system: The first principle of chaos testing is that you must define a steady state against which you will measure the resilience of the system. For this, you need to define control and variable groups. This steady-state is the normal state of the applications and services which is expected from them. There should be some key metrics that need to be tracked and monitored to understand the system’s normal behaviour.
- Create a hypothesis for the steady-state behaviour: Here, you need to brainstorm with your team to understand the scenarios where the system will act normally and in which all cases the system might fail. So, you need to think about different ways on how you can disrupt the expected behaviour of a steady system. This will help you understand how your system might react in cases of failure. Once you understand the scenarios, you can make the necessary modifications in the system to handle the failures.
- Ensure minimum blast radius: In this scenario, the blast radius is the number of users affected after the system fails. So, you need to make sure whenever your system fails while performing chaos testing, and the number of users affected because of this failure should be minimum. This will help the team understand a lot of insights about the system, and also, there will be minimal impact on the user.
- Test on Production environment: In the beginning, it is acceptable to do all the chaos testing on the staging environment, but afterward, all the experiments need to be tested on the production environment. Only then you will be able to understand how the system will react to the rear life scenarios when it is disrupted. And when you are performing chaos testing on a production environment, minimizing the blast radius becomes very important.
- Run the experiments automatically and continuously: Performing experiments using chaos testing manually would require a lot of workforces. Hence these chaos testing experiments need to be automated. And these experiments need to run continuously and monitor continuously to identify the issues early in the system so that the developers can quickly do the modifications to avoid the failures identified.
Chaos Testing tools
Below are few vital chaos testing tools you should know about:
This is the chaos testing tool that is used on Kubernetes to inject failure into every layer of its system. This includes pods, networks, kernel, and system I/O. This tool can automatically kill the pods running inside Kubernetes and see how the Kubernetes system reacts to this failure.
It is an open-source chaos testing tool that is used to kill random pods inside the Kubernetes cluster periodically. By default, it kills the pod every 10 minutes from any namespace randomly.
This is the tool created by Netflix to perform chaos testing on their system. Now it is called the simian army. Netflix has made this tool to test the resiliency and recoverability of its AWS infrastructure. It was named a chaos monkey because this tool creates destruction like a wild monkey.
This tool is used for chaos testing on cloud-native applications and services. It helps in identifying the weaknesses and vulnerabilities in your applications running on Kubernetes. So, it is as simple as creating a chaos logic of a docker image and then adding this chaos framework to the Litmus tool, which will run all the defined experiments using chaos operator and CRDs (CustomResourceDefinitions).
So that was all about chaos testing. Make sure that you are very clear about the experiments you want to perform in the production environment with chaos testing. Sometimes it can lead to a very negative impact on the user experience if chaos testing is not well planned. But irrespective of this risk, you must perform chaos testing on your application process because it will help them become more resilient and recoverable.