Are you ready for chaos engineering? Here's how to weather the storms you're about to create.
You know what’s fun as an engineer, but also terrifying if you’re accountable for service reliability? Breaking your systems, on purpose, and then magically restoring them to working order a few minutes later.
You guessed it: we’re talking about chaos engineering. It’s truly a thrilling part of our engineering practice, but how can we keep both software engineers and service reliability folks happy?
In this article, I'll go over how to know when you're tall enough to ride the chaos engineering rollercoaster, and how to have fun and stay safe at the same time. With that in mind, let's get started. Remember to keep your hands inside the moving vehicle at all times!
Preconditions for chaos engineering
Before you board the chaos engineering car, make sure that you’re adequately prepared, and that you can afford to spend the engineering resources to enter the theme park in the first place. If your organization does not believe that reliability is important, or conversely believes in rigid reliability at all costs, then chaos engineering is not yet for you. There may be far more accessible wins for you and your organization by setting appropriate targets for reliability and being able to meet them.
For instance, if your pager is already on fire from real or phantom incidents, you cannot afford to inject new chaos into your systems. Those systems are chaotic enough already, and will put your engineers at risk of burnout if the situation is allowed to continue. If your product is unreliable, your customers will churn if you do not improve reliability first before adding new services and features. Focus on those existing issues before you pile future hypotheticals onto your to-fix list.
Conversely, if your service is generally running fine but your organization is insistent upon restrictive change control boards, "100% uptime," and manual QA testing in staging environments, then it will be challenging for you to accumulate the organizational capital to perform live tests in production. However, you can make incremental progress by adding chaos tests to your staging environment. You can do tabletop exercises to rehearse communication patterns and team cooperation without touching your systems. Additionally, you can start setting more realistic expectations about the tradeoffs between reliability and velocity without dialing the production chaos up to eleven.
Let's assume your organization is somewhere in the middle between these two extremes. Things aren't constantly on fire, you can push code without weeks of delay, and you have enough breathing room to start asking, "What's looming next on the horizon?" First, you'll want to quantify the amount of reliability budget you have to work with. Not everyone can or should ride the wildest rollercoaster, and sometimes even the gentlest log ride will feel like a big enough thrill. Knowing what you and your customers' level of queasiness is at all times will prevent you from going too far and causing an unfortunate incident.
Engineering your safety margins
Thus, you'll want Service Level Objectives (SLOs) in order to understand where you and your application are, and how much maneuvering room you have. While SLOs originally gained popularity among Google Site Reliability Engineering practitioners, engineers with all sorts of job titles and at all scales find them useful today. Their core idea is to quantify reliability as experienced by your customers, rather than lower level system metrics. Crucially, the quantified reliability is not expected to be 100%. Using a more achievable target recognizes the inherent tradeoffs between what your customers might incrementally perceive and the costs of delivering that additional reliability. The costs are not just in terms of the technology and infrastructure required for redundancy, but also in the risk of burnout for your engineers.
Infrastructure cost and sophistication is proportional to the expected level of reliability, thus the demand and cost curves must converge somewhere short of 100%. This gives you a realistic target to aim for, which you can then track towards and proactively manage. If you wind up with higher than expected reliability, any remaining "error budget" is left for you to spend on innovation. When it comes to knowing how ambitious to get with chaos engineering, examining the amount of unspent error budget gives you an idea of your safety margins. For a more comprehensive discussion, see chapter four of Site Reliability Engineering by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (O'Reilly), or Implementing Service Level Objectives by Alex Hidalgo (O'Reilly).
It’s important to realize that SLOs aren’t the entire picture. They're a great way to find out whether your customers have had too broken an experience recently, but they won't necessarily help you pinpoint what has gone wrong mid-incident or mid-drill. Ironically, monitoring cannot help with debugging chaos engineering experiments. After all, the entire idea is to discover unknown properties of your system, so any metrics or monitors you might set up in advance may be rendered obsolete by what the gremlins actually do when they're released. What you need instead is observability.
Observability allows you to analyze data from your systems on the fly in novel ways, enabling you to understand unknown-unknowns that may arise during chaos engineering (or regular system chaos from unscheduled outages and service degradation). You'll want to be able to revert and debug if any one particular customer is degraded by your experiment. Instead of having to guess as to temporal correlation and causation, you can pin down whether two components interacting in novel ways are genuinely the source of the problem. After all, if you perform an experiment but cannot learn from it, there is no point in performing it. For more on observability, see my article, ‘What is the business case for observability?’ or Observability Engineering by myself, Charity Majors, and George Miranda (O'Reilly).
By now, you hopefully have obtained the organizational buy-in you need, and met the minimum technical maturity standards in order to benefit from doing chaos engineering. You have paid for the ride and are tall enough, so let's get strapped in! Let's learn about the safety equipment and guardrails that keep you secure when riding.
Designing the experiment
As with any rollercoaster, design is a critical part of ensuring safety while maximizing fun and effectiveness. How can you design your experiments to be minimally dangerous while still exciting and thrilling? The first thing is to ensure that you limit the degrees of freedom in the experiment. You don't want your passengers to be able to move around and alter the center of gravity for instance, so ensure that everyone is firmly buckled down.
In order to prevent unexpected changes from interfering with the experiment, you'll want to understand what normal sources of change exist in your system. Some of these may be ordinary things your experiment should tolerate such as scheduled software changes or deployments. You may want to pause others or schedule your experiment for a time that does not conflict with more disruptive and rarer changes such as database migrations or hardware replacement.
Once you've secured everything that shouldn't move, the next step is to identify what property of the system you want to experiment on. You might be interested in understanding how high you can push CPU utilization, or whether your system's fault tolerance against single instance failures (or, more ambitiously, even whole availability zones) is working correctly.
Next, you'll need to figure out how to dynamically adjust the variables for the purposes of the experiment. For instance, you can constrain CPU by either running CPU soaker processes that do no useful work as sidecars to the processes you are attacking. You can route traffic to a smaller subset of nodes than would normally run, or increase the target CPU utilization percentage on the auto scaling group. For simulating availability zone (AZ) failures, you can block access to AZs with load balancing and firewall rules. Restarting entire instances and observing the automatic recovery processes, both during and after the instance comes back online.
No matter the kind of failure or degradation you are simulating, you'll want to make sure that you have the option to revert the experiment at any time and return to normal operations. Big red stop buttons are a critical piece of your safety story whether you're operating an amusement park ride, or a chaos engineering experiment. As with anything, always have a way to return to the previous operating capacity. You want to flirt and dine with chaos but not invite it to stay the night.
For CPU saturation experiments, you'll want to be able to put the cordoned off nodes back into the load balancing rotation, or stop the CPU soakers. For AZ failure simulation, you can undo the routing changes to restore access to the missing AZ. Unfortunately, instance or pod replacement can't be undone with the wave of a magic wand, so you'll need to wait for the nodes and jobs to come back on their own but your stop button should at least prevent any further machines from restarting.
However, another key piece to operating within your safety envelope is having adequate telemetry signals to know the status of all of the moving pieces. While proliferating dashboards is not a good long-term solution for operability, preparing relevant data queries in advance for your known-unknowns will help you keep an eye on things. You were keeping Service Level Indicators and Objectives, right? Make sure that you know when you start having SLI failures, so you can revert before you blow your entire error budget on the experiment. If your SLI involves latency for critical operations, plot a heatmap of the latency so you'll notice if it starts to creep up. Ensure that error rates stay within the bounds you're expecting.
Remember, you're not just looking for a lack of a negative signal; you want to get data about what you expect to degrade and then recover. If you're testing CPU, check how saturated the CPU actually is (so that a broken CPU soaker doesn't cause you to falsely believe that your system is resilient to CPU saturation when the test was what failed to run correctly). If you're testing that you can survive without an AZ, make sure that AZ actually is properly drained of traffic. Ensure that you're tracking how quickly the cluster rebuild is progressing, if you're testing the recovery of a stateful system.
Learn from the experiment, then make it repeatable
So, you've run your experiment and tested your system. Share some photos from your ride! What made you scream? What was fun but not scary? How can we do better next time? If you treat the chaos engineering experiment as a self-contained incident, hopefully you'll have a chat record of all of the changes being tested and how the humans and systems reacted to them.
If you did your changes gradually, for instance increasing CPU from 65% to 70% to 75%, then you'll be able to identify that first point at which things went south and latency degraded. Going straight to 75% may not necessarily have given you that insight into where exactly the boundary was. You're doing these experiments to identify what the safety margins of your systems are, so remember that going slow enables you to learn more from your experimentation.
It's important to not just identify at what point things broke, but why. This is where observability can help. When the system reaches its tipping point, what starts to degrade and why? Did your mitigation strategies such as load-shedding and caching work as designed? What unanticipated services might have depended upon the service you were testing?
Once you've done a chaos engineering experiment as a one-off, it's tempting to say you're done, right? Unfortunately, the forces of entropy will work against you and render your previous safety guarantees moot unless you continuously verify the results of your exploration. Thus, after a chaos engineering experiment succeeds once, run it regularly either with a cron job or by calendaring a repeat exercise. This fulfills two purposes: ensuring that changes to your system don't cause undetected regressions, and the people involved get a refresher in their knowledge and skills. As an added third benefit, they can be a great on-call training activity, as well.
Pressing the red “Abort!” button on the experiment does not make your experiment a failure, as long as you learned something. A “failed experiment” (or less than ideal experiment) should be retried until you can achieve the desired result. For instance, if you encounter a blocker that prevents you from switching the active region for your service, you can correct it and try again. If you find manual steps that take too long or are error-prone, you can work on automating them out of existence before you repeat the experiment. If it took 15 hours last time, it might only take 10 the next time (like my friend Tom demonstrates from Stack Overflow's failovers).
Always debrief with your team afterwards, regardless of whether things went exactly according to plan. This enables you to discover the perspectives of everyone on your team. What was obvious to you might not be obvious to another member of your team. Perhaps you'll be able to identify potential improvements from team members who haven't performed the processes before and have less attachment to the existing ones.
A case study: Chaos engineering Kafka
“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” – Richard P. Feynman
At Honeycomb, where I'm Field CTO, we operate a mixture of stateful and stateless services to provide our observability tooling to customers. Our stateless services are already prepared for chaos, as we have been running them on AWS Spot for several years now. Spot already terminates our Kubernetes nodes and forces pod rescheduling onto what's available in terms of different instance types and AZs on an ongoing basis (and gives us cost savings for that flexibility in where and how we run).
If we hadn't tested the behavior of our services when preempted and rescheduled, and instead expected them to run continuously for long periods of time, the restart of a host would become an untested emergency. Thus, we'd need to specially plan and run single-instance or single-pod chaos engineering experiments. Adapting to constant rescheduling of our jobs thus gives us both cost savings as well as resilience benefits through continuous testing. However, we will at some point need to start doing regular whole AZ test outages.
It is of note that our stateful storage and coordination systems, such as Kafka, Zookeeper, and our custom-written data engine Retriever, do not run on Spot. They therefore would potentially run for months without interruption from the cloud provider. Reliability is good, right? Not so fast. Rare events, such as instance degradation and replacement, are inevitable but can be hard to plan for. If we accidentally broke the process of bootstrapping a new Kafka, Zookeeper, or Retriever node, we may not find out until AWS schedules an instance replacement months later.
To avoid this, we periodically schedule the instance replacement of one Kafka and Retriever node in each environment, every single week, during business hours. This ensures that if a regression occurs, it'll be easy to bisect the new fault to the past week of changes, and any firefighting will happen while a majority of employees are available to correct any issues. It's more than just outright breakages of the bootstrap and provisioning code that we can catch by doing chaos engineering though.
Load on a distributed system is highest just after one node has failed, because there are fewer nodes remaining to handle the existing workload from clients, and the surviving nodes must take on the additional responsibility of bringing a replacement node up to date. Thus, the early warning of our chaos experiments has allowed us to discover latent capacity issues by testing replacement during weekday peak hours.
For instance, when there are only two brokers in each of three AZs, taking any one broker out of commission makes the brokers in the two unimpacted AZs work much harder to restore full redundancy – almost to the breaking point! This is counterintuitive, as you'd expect that capacity would only drop by 16% from losing a single worker, not 33%. This occurs because we don't store replicated partitions in the same AZ, only in different AZs. Thus, the remaining broker in the same AZ cannot contribute to restoring partition redundancy.
In order to mitigate this, we have increased our cluster size to 9 Kafka brokers (three per AZ), to ensure that there are always 6 brokers available to shoulder the burden of restoring fault tolerance to normal levels. Because there are more brokers per AZ, there are also fewer partitions on each broker. With that only 33%, rather than 50% of the partitions need to be re-replicated when a single broker is lost from an AZ. There are more workers available in the unimpacted AZs as well, to both serve the data so we don't drop a single write, and to send across the bytes to re-populate storage on the fresh new instance.
By detecting this capacity issue early before our fault tolerance was pushed beyond its capacity, we were able to mitigate the issue and ensure that we could grow our system without it breaking unexpectedly in the future. Having a solid base of SLOs and observability has enabled us to identify when chaos engineering may benefit our engineers in understanding the limits of our systems, and deliver better reliability for our customers into the future.
Conclusion
Chaos engineering is an important practice for mature engineering teams to proactively manage risk. It is an excellent learning tool for individual members or a team as a whole. Experimentation facilitates team members learning about hidden dependencies they may rely upon to work each day, and enables those who are experienced with those systems to identify key areas of needed improvement in order to make their service more robust and reliable.
However, don't just jump into chaos engineering without a plan – otherwise, you'll just have unbridled chaos. Advance planning helps to ensure teams get the most out of their effort. Set learning goals, find specific questions you want answered, and outline them to your team. Schedule both briefing and debriefing sessions so your team knows not only what the objectives are, but also can then learn from the experience as a group. Replicability is key. A one-off result isn’t as useful as it would seem; you need to be able to reliably weather the chaos storms you generate.