Uptime matters, but so do your people. At Intercom, keeping our product online and working well at all times is critical to the success of our business.
Out-of-hours on-call is inherently disruptive to your life as an engineer. You need to be ready to respond quickly and competently to an alert about something being broken. This means having a decent Internet connection, a computer, power for the computer, whatever you’re using for 2FA, and passwords available. However, we realized that we had ended up with an on-call setup that we weren’t proud of, and had a number of problems to solve. There were too many people on-call at any one moment in time. The quality of alarms and runbooks was inconsistent across teams and there were ad-hoc review processes for new and existing alarms. We decided to attempt to solve these problems by creating a new virtual team that would take over all out-of-hours on-call work, consisting of volunteers, not conscripts, from teams across the engineering organization. This talk goes into the process we applied, the positive impact to our on-call, and lessons learned.