A terrible, horrible, no-good, very bad day at Slack

7 mins

Promoted partner content

Outages are awful for users and teams alike. What can we learn from them when they happen?

On May 12, 2020, Slack had its first significant outage in a long time. We published a summary of the incident shortly after, but this story is an interesting one, and we’d like to go into more detail on the technical issues around it.

The user-visible outage began at 4:45pm Pacific time, but the story really begins around 8:30am that morning. Our Database Reliability Engineering team was alerted about a significant load increase in part of our database infrastructure at the same time as our Traffic team received alerts that we were failing some API requests. The increased load on the database was due to a rollout of a configuration change, which triggered a longstanding performance bug. The change was quickly pinpointed and rolled back – it was a feature flag which performed a percentage-based rollout, so this was a fast process. We had some customer impact, but it lasted only for three minutes and most users were still able to send messages successfully throughout this brief morning incident.

One of the incident’s effects was a significant scale-up of our main webapp tier. Our CEO, Stewart Butterfield, has written about some of the impact of the lockdown and stay-at-home orders on Slack usage. As a result of the pandemic, we’ve been running significantly higher numbers of instances in the webapp tier than we were in the long-ago days of February 2020. We autoscale quickly when workers become saturated, as happened here – but workers were waiting much longer for some database requests to complete, leading to higher utilization. We increased our instance count by 75% during the incident, ending with the highest number of webapp hosts that we’ve ever run to date.

Everything seemed fine for the next eight hours – until we were alerted that we were serving more HTTP 503 errors than normal. We spun up a new incident response channel and the on-call engineer for the webapp tier manually scaled up the webapp fleet as an initial mitigation. Unusually, this didn’t help at all. We very quickly noticed that a subset of the webapp fleet was under heavy load while the rest of the webapp instances were not. Multiple strands of investigations began, looking into both webapp performance and our loadbalancer tier. A few minutes later, we identified the problem.

We use a fleet of HAProxy instances behind a layer 4 load-balancer to distribute requests to the webapp tier. We use Consul for service discovery, and consul-template to render lists of healthy webapp backends that HAProxy should route requests to.

Figure 1: High-level view of Slack’s ingress load-balancing architecture

We don’t render our webapp host list directly into our HAProxy configuration file, however. The reason for this is that updating the host list via the configuration file requires reloading HAProxy. The process of reloading HAProxy involves creating a brand-new HAProxy process while keeping the old one around until it’s finished dealing with in-flight requests. Very frequent reloads could lead to too many running HAProxy processes and poor performance. This constraint is in tension with the goal of autoscaling the webapp tier, which is to get new instances into service as quickly as possible. Therefore, we use HAProxy’s Runtime API to manipulate the HAProxy server state without doing a reload each time a web tier backend comes into or goes out of service. It’s worth noting that HAProxy can integrate with Consul’s DNS interface, but this adds lag due to the DNS TTL, it limits the ability to use Consul tags, and managing very large DNS responses often seems to lead to hitting painful edge-cases and bugs.

Figure 2: How the set of webapp backends is managed on a single Slack HAProxy server

We define HAProxy server templates in our HAProxy state that are effectively ‘slots’ which our webapp backends can occupy. When a new webapp instance is provisioned, or an old one becomes unhealthy, the Consul service catalog is updated. Consul-template renders a new version of the host list, and a separate program developed at Slack, haproxy-server-state-management, reads that host list and uses the HAProxy Runtime API to update the HAProxy state.

We run M parallel pools of HAProxy instances and webapp instances, each pool in its own AWS Availability Zone. HAProxy is configured with N ‘slots’ for webapp backends in each AZ, giving a total of N * M backends that can be routed to across all the AZs. A few months ago, this total was more than ample headroom – we’d never needed to run anything even approaching that number of instances of our webapp tier. However, after the morning’s database incident, we were running slightly more than N * M instances of the webapp. If you think of the HAProxy slots as a giant game of musical chairs, a few of these webapp instances were left without a seat. That wasn’t a problem — we had more than enough serving capacity.

Figure 3: ‘Slots’ in the HAProxy process, with some excess webapp instances that aren’t receiving traffic

However, over the course of the day, a problem developed. The program which synced the host list generated by consul template with the HAProxy server state had a bug. It always attempted to find a slot for new webapp instances before it freed slots taken up by old webapp instances that were no longer running. This program began to fail and exit early because it was unable to find any empty slots, meaning that the running HAProxy instances weren’t getting their state updated. As the day passed and the webapp autoscaling group scaled up and down, the list of backends in the HAProxy state became more and more stale.

By 4:45pm Pacific, most HAProxy instances were only able to send requests to the set of webapp backends that had been up since the morning, and this set of older webapp backends was now a minority of the fleet. We do regularly provision new HAProxy instances, so there would have been a few fresh HAProxy instances that had correct configuration, but most of them were more than eight hours old and therefore were stuck with full and stale backend state. The outage was eventually triggered at the end of the business day in the US because that’s when we begin to scale down the webapp tier as traffic drops. Autoscaling will preferentially terminate older instances, so this meant that there were no longer enough older webapp instances remaining in the HAProxy server state to serve demand.

Figure 4: HAProxy state has grown stale over time and references mainly deprovisioned hosts

Once we knew the cause of the failure, it was resolved quickly with a rolling restart of the HAProxy fleet. After the incident was mitigated, the first question we asked ourselves was why our monitoring didn’t catch this problem. We had alerting in place for this precise situation, but unfortunately, it wasn’t working as intended. The broken monitoring hadn’t been noticed, partly because this system ‘just worked’ for a long time and didn’t require any change. The wider HAProxy deployment that this is part of is also relatively static. With a low rate of change, fewer engineers were interacting with the monitoring and alerting infrastructure.

The reason that we haven’t been doing any significant work on this HAProxy stack is that we’re moving towards Envoy Proxy for all of our ingress load-balancing (we’ve recently moved our websockets traffic onto Envoy). While HAProxy has served us well and reliably for many years, it also has some operational sharp edges, of exactly the kind highlighted by this incident. The complex pipeline we use to manipulate HAProxy server state will be replaced by Envoy’s native integration with an xDS control plane for endpoint discovery. The most recent versions of HAProxy (since the 2.0 release) also solve many of these operational pain points. However, Envoy has been our proxy of choice for our internal service mesh project for some time, and this makes a move to Envoy for our ingress load-balancing attractive. Our initial testing of Envoy + xDS at scale is very exciting and this migration should improve both performance and availability going forward. Our new load-balancing and service discovery architecture is not susceptible to the problem that caused this outage.

We strive to keep Slack available and reliable, and in this case, we failed. We know that Slack is a critical tool for our users, and that is why we aim to learn as much as we can from every incident, whether customer visible or not. We apologize for the inconvenience this outage caused and will continue to use the lessons learned to drive improvements in both our systems and our processes.

Episode 03 Living with legacy: landing boring change smoothly

Episode 05 Forging the path to faster shipping in enterprise orgs

A terrible, horrible, no-good, very bad day at Slack

Posted in:

Written by:

Share:

Promoted partner content

Related content

How to build an effective technical strategy

WebAssembly is still waiting for its moment

Generate buy-in with compelling engineering strategies

PostgreSQL: The database that quietly ate the world

How Zalando uses its own Tech Radar to make better technology choices

4 things you need to know from the latest Thoughtworks Tech Radar

AI and Kubernetes are pushing cloud costs out of control

Who holds the edge in the JavaScript framework wars?

12 things to consider when assessing open source software

Leading open-source teams in large organizations

Working with leadership to plan for a successful new year

How to get leadership buy-in on your tech strategy

The 6 biggest generative AI risks for developers

Being a tech lead doesn’t mean having all the answers

Can platform engineering help you do more with less?

When to migrate from a monolithic to a distributed frontend architecture

Kubernetes for engineering managers

Using workshops to align technical vision and team principles

Want to stay technical as a manager? Stay curious

Crafting an effective technical strategy: Business success through targeted investment

Building an effective technical strategy

Crafting an effective tech strategy and getting buy-in for it

How to make plans for an uncertain future

The five stages of digital maturity

The difficult teenage years: setting your tech strategy after the launch

Setting a vision, mission, and strategy for your team

Using Open Source safely and effectively

Learnings from 'Carving a modern engineering org out of an enterprise’

Building a successful and sustainable CI/CD pipeline

Scaling Incident Management: How we grew Google Meet 50x during COVID19

Technical strategy power chords

Making ‘Big Changes’ Successfully

Forging the path to faster shipping in enterprise orgs

When planning long-term, favor accuracy over precision

Best practice for seamless product integration

Laying the foundations for a successful build

Building a more globally inclusive internet

Carving a modern engineering org out of an enterprise

The thin line between technology advocacy and ideology

To build, or to buy, that is the question

Measuring and improving the efficiency of software delivery

Four key metrics for measuring DevOps success

Managing technical risk

Creating technology products that your customers love

Getting GitOps right

Learnings from 'Maintaining speed while minimizing risk'

Achieving speed and quality without sacrifice in engineering

Scaling held knowledge to unblock teams and untangle software complexity

How to adapt your UI testing strategy to your product's stage

Hypothesis-driven development

The problem with "the platform"

The Boring Stack

Avoiding the pitfalls of rebuilding software

Building and conveying vision

Avoiding “shiny object” syndrome when building software

Lessons for frontend development at scale

Breaking down our understanding of system resilience

Creating Architecture and Teams at Less-than-Google Scale

Telling stories through your commits

The importance of pragmatism when building and maintaining systems

Plug in to LeadDev