Many organizations take time to set and realign on department and company-wide goals at the beginning of every quarter.
As engineering leaders, we’re often handed company objectives (as in Objectives and Key Results, aka OKRs) that demand the on-time delivery of product features without impacting availability or performance. The best way to achieve these ambitious goals is by making sure everyone understands how their work connects back to the overall company vision.
When engineering work and constraints aren’t aligned with business objectives, it leads to miscommunication and misunderstanding during the goal-setting process, which often results in meaningless goals.
So how can we create alignment when it seems engineering and business folks aren’t all speaking the same language?
Service-level objectives (SLOs) have been gaining popularity because they provide a common language between business stakeholders and engineers to ensure they’re aligned on setting and achieving goals. However, getting started involves more work than just setting targets. In this post, I’ll provide some examples of how to (and how not to) use SLOs in setting goals, shared or otherwise.
What are SLOs?
A lot has been written about service level objectives (SLOs), service level indicators (SLIs), service level agreements (SLAs), SLO burn alerts, and more. As a quick refresher:
- SLOs are the negotiated availability agreements you set internally
- SLIs are how you measure SLOs
- SLAs are how those agreements are communicated externally to your customers
SLOs should be expressed as business goals for service reliability; in other words, they measure your service’s customer experience. For example, your business might have a strategic goal expressed as ‘providing the fastest website experience possible.’ The technical translation would then be ‘a user should be able to load our home page and see a result quickly.’ For this example, an appropriate SLI might:
- Identify qualified events by looking for events where request.path = ‘/home’.
- For qualified events, the criterion for considering it successful means duration_ms < 100.
Meanwhile, the SLO target might be that during any trailing 30-day period, we want 99.9% of all events (as qualified by the SLI) to be successful. Every event slower than the threshold is considered an error and counts against your error budget.
Different parts of your business – and different parts of your stack – will have different targets. There’s more depth and detail to SLOs, but you can see how the example enables explicit agreements with business stakeholders that determine critical paths and necessary investments, allows engineers to clearly understand priorities and how their work impacts business goals, and gives managers the tools needed to set expectations with both groups.
Getting started with SLOs
Many teams can’t wait to implement SLOs because of the order-of-magnitude impact they have on team efficiency by reducing alert fatigue and boosting productivity. So they’re tempted to set a few ‘reasonable-sounding’ (yet still actually arbitrary) targets just to get started down the SLO path. But it takes time to reorient, learn new processes and practices, and build the new team muscle memory necessary to set the right targets or respond when error budgets have been exceeded.
Instead, when first approaching SLOs, I recommend taking a quarter or two to gather data. You should use this time to understand how both planned and unplanned operations impact your SLO error budgets. That, in turn, informs how service availability maps to business performance indicators. With that data in hand, you can then negotiate targets with business stakeholders and engineering leaders. Doing all that with three to five services is a pretty significant goal.
Keep in mind that your SLOs don’t need to be perfectly defined from the get-go – they can always be improved. Identify the services that are most critical for your business outcomes, gather some early data, negotiate as needed, implement your SLOs, and improve them iteratively as you go.
How to gather data for SLOs
There are no universal thresholds for SLO targets. Even if there were, it’s important to do what’s right for your particular situation. SLOs should integrate into your company goals once you establish the right targets for your particular services. When implementing SLOs, it’s important to start by gathering data for a small set of pilot services before committing to any targets baked into your goals.
Here are steps to help you figure out your own SLO starting points.
- First, prioritize your reliability needs by the customer value they provide. If each of your website and API components all have equal availability goals, start there. Separate the various functions and rank their reliability needs, as all of your services will not have equal availability needs.
- Next, meet with customer-facing teams to validate those findings against the technical and business functions that have the highest customer expectations for availability. The data you gather in this step informs the possible size of your error budget. Consider any maintenance windows defined in your SLAs, contracts, or terms of service. If you don’t have maintenance windows for your services today, now is the time to change that.
- Then, have a conversation with business stakeholders and product management to set an error budget. You need to be clear on both availability targets and on what happens when those targets are in danger of being missed. For instance, feature development work may need to be paused in favor of work that improves availability when that happens. Define, discuss, and negotiate those various scenarios upfront to avoid unpleasant surprises later. Be explicit in your negotiations, and document commitments in a contract.
- Last, define how your team reports when your error budget is spent. The error budget is how much time your stakeholders have agreed that your service can be away from its job, so to speak.
SLOs will help you set goals that are meaningful to business stakeholders and engineers. To briefly recap, if you’re new to SLOs, you can figure out your own starting points for SLO targets by prioritizing service reliability needs by their customer value; meeting with customer-facing teams to validate those reliability needs; seeking agreement from business and product stakeholders on an error budget contract; and reporting on how your service’s error budget was used.
Finally, pairing SLOs with observability practices can help your company build competitive advantages, boost customer loyalty, deliver features faster, get better service reliability, and help set more effective company goals.