Engineering incidents happen. What’s your plan for when they do?
Broken systems and outages are the stuff of engineering nightmares. But incidents happen, no matter how hard you try to prevent them. The important question is, how can organizations build the resilience to navigate tech emergencies, keep teams calm under pressure, and create sustainable processes to protect engineers from burnout?
As part of our series on carving a modern engineering org out of an enterprise, LeadDev brought together a group of senior engineering leaders to discuss how their organizations respond when things go wrong, what their biggest challenges are relating to incident management, and how they are finding potential solutions.
To open the session, Laura Nolan, Senior Staff Engineer at Slack, gave a presentation on a particularly bad day at Slack in May 2020, when their platform experienced a significant outage. Walking attendees through the story, she shared how the team dealt with the issue through their incident management process, and what they learned from the experience. Here are her key takeaways:
- Incidents happen. With the best will in the world, things can and do go wrong when working with complex systems.
- A solid incident management process is essential. It helps engineering teams solve issues in a more deliberate, sustainable manner, and minimizes the impact on the customer.
- Incident management is most helpful in complex incidents with many people and a lot of pressure.
- Preparation is key. It takes a huge amount of planning to be ready to swing into action. Planning should include mitigations for certain kinds of scenarios, as well as making sure there is an organizational process structure in place for how to manage incidents.
After Laura’s presentation, the group moved into a discussion around their own experiences. They shared the biggest challenges related to incident management and identified the common obstacles or behaviors that slow things down. They also considered how organizations can overcome these challenges, reflecting on effective solutions already in place, and exploring potential new ideas which could help. Here are the key problems and solutions they identified:
1. Engineers knowing how to respond
Problem: Your system is down, and nobody knows (or remembers) the correct steps to flag and fix the issue. Your engineers are panicking – which doesn’t help them to think clearly. How should they respond? Who should they tell? And through which channel?
Solutions: Identify an incident commander. This is the key person in the process who deals with facilitation, coordination, and decision making. This can help to solve issues significantly faster and with less stress, and is especially important for major, multi-layered incidents. This system is great for everyone’s safety, and for the long-term sustainability of running a complex system. Incident commanders should be fully empowered to have the first and last word on everything, from who should be involved, to the details of the technical process. They are the boss, regardless of who else is in the room (including senior leaders).
For the longer term, training is important. Educate your team so that they understand the different roles in an incident management process (e.g. commander and responders); the expectations and behaviors of that role; and of course, which role they will be taking on. Document the process, but don’t forget to practice too; a good way to do this is with game days where you simulate incidents and review your team’s response. Educate people thoroughly and regularly, because when they panic, they forget things!
You can also try rotating the roles within the incident management process so that folks don’t get too comfortable as a commander or responder. This gives everyone an opportunity to get multiple perspectives, makes sure everyone is involved, and protects from complacency.
2. Taking ownership over systems
Problem: Most systems last longer than the tenure of the people who built them. Making sure these systems are properly owned and monitored is a challenge, especially if there is a high turnover of folks working on them. If nobody is working with them regularly, they can’t develop the feel for when things are working and catch when things go wrong.
Solutions: In the immediate term, keep the lights running on legacy systems, and make sure that people with the right knowledge are on the team. Looking ahead, sharing this knowledge is key, so make sure all information is well-documented and try pairing folks who are in the know with new people onboarding to systems.
Attendees also agreed that companies need to recognize the value of legacy systems. By making sure that engineers maintaining old systems receive the same recognition and promotions as those launching shiny new systems, you can motivate your engineers to take ownership of legacy projects, and maximize your chances of somebody catching when something goes wrong.
3. Getting the right people in the room
Problem: Having too many people in the room (real or virtual) creates unhelpful noise. If it’s unclear who needs to be there – and who is in charge – things get overwhelming quick. There’s also a potential mirror problem when you have too many folks, but not the right type; sometimes you might need input from marketing, customer success, legal, or product, but they aren’t part of the process.
Solutions: If your incident commander is fully empowered, this shouldn’t be a problem. If they decide someone needs to be on the call, they do. If they deem a team is not relevant for this incident, they need to excuse them. As part of your training, explain to folks that nobody needs to be offended by this – it’s not personal, it’s essential.
That’s not to say you shouldn’t be inclusive and transparent (you should). To do this, you need a good input and output channel concept. For example, you could have one place for people on the inside to discuss the incident (like a Slack channel or Zoom call), and another with read-only information to update everyone else on the status of the incident (like a separate Slack channel).
For the long term, you can also focus on building best practices and etiquette. Teach people to self-select out of an incident process if people feel they aren’t contributing enough. Remind them that this isn’t lazy or abandoning – this is helpful! And work towards building a respectful, open culture in which folks can ask each other to back up without making each other uncomfortable.
Lastly, make sure that marketing, legal, product, and customer success teams are empowered to use incident tooling, raise incidents, and participate along with engineering. This should be an important part of your training efforts.
Problem: It can be very difficult for engineers to feel empowered to declare that something has gone wrong. They may feel afraid of being blamed or lack the psychological safety to admit to making a mistake. This can lead to further problems during post-incident follow-ups if folks are unable to honestly reflect on what went wrong.
Solutions: Psychological safety is critical, so reassure your team that they won’t be blamed – and follow up on your promises. Instead of responding with stress, try thanking people when they report an issue. Having a common community across different departments can also be helpful to foster an open, honest, and communicative culture. And don’t forget to build your engineers’ confidence by recognizing their successes and giving them new career opportunities. If your team is happy, they will feel empowered, and your software and customers will benefit.
The conversation concluded with an agreement that you can’t stop incidents from happening – when working with complex systems, it’s inevitable that something will eventually go wrong. But with the right planning, and an effective incident management process in place, you can make sure that teams know what to do during a crisis. By recognizing the value of legacy system maintenance, you can motivate teams to take ownership and catch incidents before they develop. And by listening to engineers and practicing blamelessness, you can ensure that more incidents are reported at the earliest moment – and perhaps most importantly, that teams don’t burn out in the process.