Why do most migrations fail?
When I first started working as a software engineer at a financial services company, my first assignment wasn’t glamorous: I was asked to migrate a service from a legacy framework to a newer, more modern application framework. It was interesting as a new joiner in that it allowed me to dig into pieces of a large codebase that I didn’t understand well, but I quickly learned that I’d need to change lots of code and do so in a repetitive manner with only slight variance from module to module. It was the kind of mind-numbing work that lent itself to automation, and I was able to speed up my efforts by writing a series of refactoring scripts.
But before I released it into production, I still had to feel confident that I hadn’t broken something unintentionally – the consequences of which were significant for a stock market trading application. Sadly, the test coverage for the application wasn’t so great, and it turned out that something did break only to be realized later in the form of lost revenue and screaming on the trading floor.
From fast to slooooow
Fast forward to many years later and not much has changed. Engineering teams tend to spend a significant amount of their time migrating from old protocols, frameworks, languages, and infrastructure to newer, shinier equivalents. What I’ve seen in my experience as a software engineer and then as an organizational leader is that teams start off moving quickly, spinning up services, data pipelines, and machine learning (ML) integrations with relative ease and speed. Not only is there a lot of positive energy for the team to deliver on its mission, but there’s also no technical debt that the team has accumulated, and, hopefully, the team opts to use the latest industry standards, frameworks, and libraries.
The greenfield life is a good life! The team delivers on features, focusing on product innovation to support a growing part of the business. But after several months, warning signs begin to appear: a critical library will soon be end-of-life while others need to be updated to newer versions. What’s more, several pieces of the infrastructure need to be patched to ensure the application is appropriately hardened. The greenfield team built and deployed a series of microservices (each month, they push out several more) and realize the compounding effect these changes are having on their overall velocity. Suddenly, the rate of product development slows to a halt as each of these migrations needs to be accounted for – and done so repeatedly across a fleet of components.
This is a common story and one that continues to be repeated in companies of various sizes. What are some best practices to avoid this trap of technical debt and migration hell?
At Spotify, we’ve invested a lot of time in building up a Platform organization that helps to create processes, guidance, and tools around migrations. We treat a large-scale migration that would affect our near-500 engineering teams as if it were a product launch, and give the same level of care that you would in developing a go-to-market plan for an externally facing product. This allows us to have a robust strategy for getting to 100% completion. I’ve written about some of the tools and tactics we use on our engineering blog.
But what if you don’t have the benefit of a platform organization to lead the way? What are some guiding principles that can help inform teams who want to strike a balance between moving fast and investing in foundations?
Migrations as learning opportunities
Every time you need to migrate a piece of code or configuration to a newer variant, it’s worth asking the question: how can this be avoided in the future? Here are some questions that can help guide these discussions:
- Are there minimal abstractions that would’ve reduced the cost of this migration? Abstractions are tricky, as they can often turn into half-hearted attempts to provide poorly maintained layers that hide valuable features from engineers, so it’s worth asking the question if there’s an abstraction that would not only reduce the cost of the migration but also improve the developer experience in the long run.
- What are some fleetwide refactoring capabilities that would allow us to automate much of the low-variance work? We’re starting to see a number of tools pop up in this space, and it’s worth looking to see if any of the existing tools could serve your needs or if it makes sense to build your own automation.
- What are structures that would have avoided the need to update code in multiple places (e.g. monorepos, or actively moving code from user repositories to system repositories)? This exercise is useful as it asks structural questions about the most basic part of code development – how information is organized and structured.
Of course, not all migrations are wasted work. If you’re rewriting a component in a new language, it might be best to intentionally not automate this work to give every opportunity to understand the particulars. So, much of this applies to migrations of code or infrastructure that’s largely behind the scenes (and typically should be, for good reason).
Migrations as pathways to better testing
Going back to my first job assignment, the platform I was working on had a suite of unit tests, but test coverage wasn’t good enough. There were some integration tests, but they suffered from a high degree of flakiness. There weren’t any contract tests, so any changes at the API level were highly circumspect. What’s worse, builds were flaky, and even running the unit tests took nearly an hour.
Instead of investing more in building out a more robust testing platform, I jumped to automating the low-variance work. This meant I’d made many small changes across a massive codebase but with little confidence on quality. I now see migrations as opportunities to reconsider a system’s overall testing strategy.
If a new engineer were to join the team, would they be able to upgrade a dependent library with confidence that nothing would be broken post-deployment? Migrations can serve as a wake-up call to look deeper into our approaches to testing to ensure we have appropriate test coverage, consistent and reproducible builds, and the right level of integration testing between systems.
Migrations as pathways to improved reliability
Paradoxically, migrations can also serve as opportunities to examine the overall reliability of a system. I think of a migration as a kind of chaos event. It’s something that’s changing the internals or behavior of a system in a way that we oftentimes don’t fully understand until we roll out to production. As such, it raises important questions about our overall reliability strategy:
- Is there a culture of running disaster recovery tests within the team, interrogating, and force-failing various components in the stack?
- Are teams able to safely deploy to production, knowing that canaries with proper analysis will alert on regressions?
- What’s the incident triage process? Are we set up to get alerted when there are issues?
- What’s the observability level of our system? Can we ask meaningful questions to understand the user experience pre and post-deployment?
This is all to say that migrations can serve as learning and discovery opportunities. They can help us understand whether the choices we’ve made in building our system help us optimize – not only for the short-term, but also the long-term health of our products. Specifically, they can help us examine our testing strategy, our culture of reliability, and the architectural choices around sustainable infrastructure. So, next time there’s the need to upgrade a framework, a library, or something else in your system, I invite you to treat this as a learning opportunity – how can this migration improve the health of my system, my team, and our engineering culture?