I am here to tell you a story – the story of what happened when we realized: this product is too big for us to build ourselves.
The choice of build vs. buy can be a difficult one to navigate in your organization. In this article, I will walk you through our journey at Xero from using a struggling, internally-built product to achieving velocity with an external solution. I will share the things that worked well for us, and the things we got a little bit wrong, so that you may have the opportunity to get them right.
When we were using our built-in-house tool, we used to dread production releases. Through being honest with ourselves, understanding our tool’s scalability issues, and investing in external experts to build the product we needed to thrive, we’re now much more comfortable with making bold changes to our architecture at scale.
Xero has 3,000 staff worldwide with 1,000 of those people in the product team alone. So, on any given day, we’re shipping a lot of software. Let’s go back to 2015, where it would take us around 4–5 hours to release and we’d manage this process through a giant calendar. A team would look at the calendar and decide where in the week they wanted to book their release slot – but with lots of teams working on a big software product, the calendar easily became congested.
This was a big problem. If you were halfway through a release and your allotted time was running out, you’d have to vacate the pipeline so the next team could have their turn. Then, when one data science team was working on a brand new feature and staring down the prospect of having to wait two weeks to find a slot in the calendar to deploy, they thought: this is not good software engineering. They realized they needed to do something else entirely.
The initial solution
The data science team landed on decoupling their deployment events from their release events to fix the problem; this was the creation of Xero’s first feature management API.
The web application would call out across the network to Xero’s feature API, and it would ask, ‘Hey, can the user see this feature? Yes or no?’ and the feature service would respond back to the web application.
This was great at first because that web application took 4 hours to release, and Xero’s feature system took 25 minutes to release. This gave us a real boost in productivity. And it became very popular.
That quickly meant that all of the applications Xero looked after in our stack started to use this feature management API to do feature toggling at scale. The feature service ended up under an incredibly high load.
Two interesting things happened as a result of this:
- The availability of our applications was really tightly coupled to the uptime of this feature system. If something went wrong, everyone felt that pain.
- The data science team now had two full-time jobs: ordinary data science work, and managing/maintaining the feature management system.
We had an overworked team and an overworked system. There was only one way this was going to end.
It wasn’t uncommon for the feature system to take over 20 seconds to respond in some cases. This meant a user wasn’t going to be getting the best experience.
What started out with a lot of early enthusiasm for us, over time, gave way to fear and frustration. It was fear because this feature system had actually blown up a couple of times. We’d had some quite visible incidents that had kept the whole team up at night. And it was frustration because of staffing changes. The people working on this feature system, initially, moved onto other roles internally and externally. Active development on this solution really stagnated; it really slowed down.
The search for a better tool
One day, one of the engineers I worked with came to me with a problem. He said, ‘Look, we’ve built this new version of our developer documentation and we can deploy this application really quickly – in about 10 minutes. But I don’t want to wait 30 minutes for the Xero feature system to finally release the feature out to customers. It’s slowing me down; it’s a real drag. Is there any other way we can do this? We need a better tool.’
Our built-in-house solution hadn’t kept pace with the innovation happening in the market. If you’re building a modern serverless application that you can deploy in 30 seconds, you don’t want to be waiting 30 minutes for a feature change to roll out the door.
We had certain principles in mind as we were searching for a new feature management solution. The first being: the tools at work should be as good as the tools at home. Tools are getting better all the time; if we want people to come and work on our problems, we need to make sure that the tools we offer are keeping pace with best practices in the industry.
Second, was Rich Archbold’s article, Run less software. Time is critically short; in a SaaS business, you’ve only got time to focus on your unique customer edge. You can’t spend time building tools and all the supporting infrastructure you need. It’s much better to buy those things from the market so you can work at a higher level of abstraction. We want to be in the business of building the house, rather than having to build the hammer, the nail, and the brick as well. This can sometimes cause friction among the most self-reliant of folks, and so, as I did, you may have to go on a little internal engagement campaign. Start the conversation and ask your team, ‘How could we use ‘buying’ as a software delivery strategy?’
The third principle is this: to make big changes, you’ve got to start small. You’re going to have some senior stakeholders in your organization. They will probably have had a technology decision backfire on them – publicly and expensively. So, naturally, they’ll be reluctant and resistant when you come in and say ‘Hey, we should replace that thing that you’ve just spent all this time and money building with an off-the-shelf solution.
You need to do your homework and find out if a new solution is actually going to be better for your environment. One approach to a minimum viable product is this: if you were to get a wedding cake made, you’d go to a bakery and try their cupcake. If you liked the cupcake, you’d then take things further and invest in the wedding cake.
This was the approach we took when we reached out to LaunchDarkly and started working on a proof of concept in Xero.
Built vs. bought: the test
The first thing we put LaunchDarkly to work on in Xero was powering our single sign-on solution: Sign on with Xero. But this was also a brand new stack; there was no traffic going through it yet and we hadn’t exposed it to anyone. This meant that it was very low risk for us to try things out in this way (the cupcake).
We then lined up some third-party security testers (a.k.a. professional hackers) to come and have a go at breaking the software. This was a great opportunity for us to validate LaunchDarkly and ensure that we were going to be dealing with the right kind of security concerns.
We did some load testing, and we figured out that this was gonna run at production scale for us. We were then able to use the tool to onboard one or two test customers to try out this new feature, and this was very successful for us.
The next thing was that we were really focused on creating a great developer experience for our internal engineers, especially when it came to on-call. Now, if you’ve ever managed an API platform before, you’ll know that you’ll want to have lots of switches and levers you can pull to be able to shed load really quickly when things go bad; it only takes one really badly-behaved integration to cause real headaches. Using LaunchDarkly’s kill switch capability gave us ways to quickly switch off misbehaving integrations. With this tool, engineers are able to identify the misbehaving app, drop their ID into LaunchDarkly, and solve that potentially disastrous problem very quickly – essentially in under a second. That’s a DevOps superpower.
The third thing that we did is we really wanted to validate some of the claims LaunchDarkly made about performance. Now, remember that we had been really burned by this in the past with our home-built system. So, this was to be the true test of a built product vs. a bought product. One of our teams that looks after a pretty important internal API decided to put all sorts of times into the application so they could measure how quickly LaunchDarkly was performing for us.
As you can see, the results were amazing. We learned that the average feature evaluation time was measured in microseconds. The slowest ever response time that we saw was still under one second. It was such a far cry from the world we were living in before with our previous home-built system.
Making the case
We took all this information, put it into a big business case, and took it to the stakeholders to really fight for taking our relationship with LaunchDarkly further. We also included direct quotes from engineers, e.g., ‘I never want to use the old feature system again. Please let me use a great tool like this.’ This really helped us get the business case across the line.
Reaching escape velocity
I would love to be writing here that we’ve managed to complete migration, that we’ve been able to get rid of our old, bad feature system entirely and move to this great, new world. But you know as well as I do that in an enterprise environment, that’s just not going to happen.
You’ve got far too many teams and roadmaps that would be disrupted by that kind of migration. So, we haven’t completely escaped the old world yet. But I think we are starting to reach an escape velocity. Let’s take a look at some statistics.
12 months since LaunchDarkly
1,680 flag changes (15%)
9,573 flag changes (85%)
35b feature evaluations (46%)
42b feature evaluations (54%)
We can see here that at the point of 12 months since rolling out LaunchDarkly, it is now responsible for 85% of all feature management activity within Xero. This number highlights that we are also doing five times more feature management stuff than we were with our old system in the previous year. This is a testament to how having really good tools actually helps you to adopt this engineering practice – i.e., if you’re trying to do feature management, having a good solution means you will do more of it.
In July 2020, we hit the tipping point with traffic; we were serving more feature evaluations through LaunchDarkly than through our legacy system. Though we haven’t completely escaped it yet, I think we’re well on the way. And what we want to do is keep harnessing this momentum, to do even more amazing things with LaunchDarkly in the future.
Here are the key takeaways I want to share with you from our journey moving from internally to externally-built tools:
- Invest in your developer experience. Tools are continually getting better; if you want people to work on your problems, give them effective tools. This improves psychological safety and retention.
- Run less software. We tried building something ourselves and quickly realized that at that scale, it is really hard to do. Leave it to the experts.
- To make big changes, you’ve got to start small. Find a team and run a pilot program to start showing the value of a solution like an externally-built tool.
- Do your homework. Ease the reservations of stakeholders by measuring and gaining testimonials that will make the investment an easier sell.
- Build momentum. If you’re trying to run away from a horrible, old legacy system – run faster; show the organization all the amazing things you can do with a new solution.
Thinking back to that giant release calendar, it feels like a whole world away. Harnessing and building confidence in an external tool that is built to thrive under complexity enables us to make really bold changes to our platform at scale without fear. So, if your internally-built systems feel as though they’re wading through treacle, why not give them, and your engineers, a new lease of life with an external solution and see what amazing things you can achieve.