Early Sunday, March 1, I received a Slack notification from my director. Could I be available for a meeting later that morning to address suggestions for scaling our virtual visits platform to handle ten times the load within a month?
We work for Providence Health and Services, a large non-profit healthcare system that serves the West Coast. Not long before that meeting, the first known US death from COVID-19 had been announced in Washington State. Providence was assembling its resources to prepare for a rapid increase in patients. Our product, DexCare, was central to that response, especially the on-demand virtual visit component.
Our systems had been performing well under relatively low but steadily increasing load. We were enabling our healthcare providers to see around 70 virtual visits a day over a 12 hour operating shift, and we were working with a second health system to spin up the same services for them.
Our busiest day prior to COVID-19 was 120 visit requests. We would triple that by the end of the week. We would multiply the load by ten times in just over two weeks.
Understanding your system
To scale your system, you need to understand how it works and who uses it.
Our virtual visit platform has three main user groups: patients, providers, and support staff.
The main components for patients are the visit intake and virtual visit UI, both of which are available on web, iOS, and Android.
Providers use the same web UI for virtual visits. Providers and support staff, who we call caregivers, have the caregiver portal for visit information. Providers use the portal to prepare for and start visits, and staff use it to coordinate with providers, as well as chat with patients outside of the actual visit.
The backend consists of several microservices running in a kubernetes cluster. Third party systems include the electronic health record (EHR) system, databases, and browser and mobile push notifications.
We were in the midst of an effort to simplify our overall architecture. This included reducing the number of services, and separating some functionality into new services. We were also in the middle of replacing our custom-built WebRTC platform with a third party offering.
Use data to identify your biggest bottlenecks
Our biggest concern around scaling was making sure caregivers could easily keep track of their patients.
We focused on how we built multi-chat, allowing support staff to manage communicating with various patients who were waiting in the queue to be seen by a provider. We suspected we would overtax the underlying signaling services. We began to think through possible improvements to the implementation.
Meanwhile, we had a second team creating load tests: something we had always wanted but hadn’t yet prioritized.
Our QA team already had a suite of automated tests that simulated most of the patient and provider workflow. With some scripts and several laptops, we set up our first load test within a few hours to see if we could handle 100 concurrent visits.
The results were not as disastrous as we feared, but showed we had a lot of work to do. It also showed that the multi-chat system would scale just fine. Our real issue was the way we were fetching visit and patient data to render on the caregiver portal to help our caregivers keep track of patients.
With that information, we were able to stop worrying about the chat component and focus on the data load performance. Now that we had reproducible load tests, we could validate our fixes. We were able to reduce both the latency of key calls and the total number of calls needed.
We now continue to use load tests to identify bottlenecks and to validate our changes to make sure we’re improving our system and not introducing defects.
Monitor the health of the whole system
Having monitors is great. But they’re of no use if you are not looking at them.
Using the load tests to identify our biggest bottlenecks, we were able to quickly turn around several fixes to our platform that allowed us to handle the rapidly-growing demand.
Then one morning, we started getting reports of things just not working quite right. Intermittent errors were being reported by caregivers and patients, but no single pattern emerged. We checked the dashboards we had created and could see one of our critical services was thrashing about quite a bit. Latency was climbing, causing downstream calls to time out.
Finally it occurred to us to check the monitors for our service's database. The CPU had been climbing to 100% and staying there. It had actually been spiking off and on for the past few days, and it had now hit a critical level where it couldn’t handle the load.
We were able to increase the size of the database server, and with just a small amount of downtime, the system was running smoothly.
We now continually review what monitors we have in place to make sure we’re watching all the parts of our system.
Pay off your tech debt now
Often the critical fix to your scaling issue is that tech debt ticket that’s been lingering.
One of the confounding issues with our database problems was that we hadn’t configured our service to take advantage of our highly available cloud database. Our system is read intensive, but we weren’t actually using the reader; instead we were overtaxing our writer while our reader sat idle.
We’d long known we needed to modify our configuration to take advantage of the read replica; it was a story on our backlog that we had never prioritized.
Modifying the configuration was simple. But then our load tests started showing an intermittent error. The code would complain that a row didn’t exist, when we knew it had to.
In delaying the implementation of what we all agreed was a standard practice, we’d allowed a race condition to creep in. Our code would write a row to table, and then soon after it would read that same row from the database, and fail if it wasn’t there. Sometimes that read would happen before replication had propagated to the reader. We had to delay releasing the configuration change to take advantage of replication while we tracked down and fixed the code where this happened.
We now have a better understanding as a team of how to code for database replication, and better data to evaluate the relative importance of other lingering tech debt. And we have the ability to horizontally scale our database by adding new readers when load increases.
The impact of non-technical constraints
One of the biggest constraints to scaling our system wasn’t our software. It was our people. Even more than our software, our caregivers were tested by the influx of patients.
A single provider can only see so many patients in a single shift. To add a provider, they must be trained and licensed. For virtual healthcare in the US, providers must be credentialed in the state in which the patient is physically located at the time they are receiving the care.
While we were working to make our software scale, our health system was working to train new virtual providers and get their credentials in place so that they could join the provider pool.
In the short term, this reduced the maximum load we had to be able to support, but it exposed other areas in need of improvement.
Adding new caregivers to our system was highly manual. The plan to create self service tools was additional tech debt and feature work that had not yet been prioritized.
We also needed a better way to control the influx of patients. As an on-demand system, we would accept as many visits as could be requested, and manage those users in queues based on the state where the patient was requesting care. But queues were growing beyond the capacity of the available providers being able to see everybody waiting by the end of their shift. Meanwhile patients were frustrated by long wait times. We would have to prematurely close hours of operation to cap the queue.
To better manage this, we added the ability for caregivers to indicate a state as ‘busy’ to enable them to slow down the intake of new patients throughout the day and avoid overlong queues. We also improved the features around wait time calculation so patients had a better expectation of how long they would have to wait.
Be kind to one another
While our fellow caregivers and software systems were being pushed to the brink, we were all having to live in a changed world.
One of my least favorite emails to receive (and send) was one that cancelled long-anticipated vacations. Schools were closed, concerts cancelled, restaurants shut down. We went from optional work-from-home to it being mandatory.
Our response as a team was to find new ways to interact. Our biweekly team happy hour became a weekly Zoom. We encouraged each other to actually turn on our cameras during meetings so we could see each other. We enjoyed the inevitable appearance of pets and children. Our “Where Am I” channel started to include dog walks and just time away from the keyboard. And we started encouraging each other to take time off, even if it was just to be home, and completely stay off work communication.
The first big wave of virtual visit requests started to slow down mid-March. A second spike started at the end of June. We revisited many of the lessons we learned the first time and were able to respond to new bottlenecks more rapidly, though we still have work to do to support our patients, caregivers, and health system.
Scaling a complex system under stress is a hard thing to do. In our case, it helped considerably to know that what we were doing directly helped patients who were seeking healthcare during the pandemic. But that added its own levels of stress, and we worried we weren’t doing enough, fast enough. During this time we made sure to support each other and celebrate every achievement, which fueled our motivation and helped us maintain our high performance.
I’m proud to be part of a team that was able to help so many people in the pandemic, and a huge part of that was being able to recognize and respond to the moment-by-moment strains on our system.
I’ll take the lessons we learned to better inform our future technical decisions. I don’t ever again want to scale in the face of a global pandemic, but I want to be prepared if I have to.