Have you ever tried to replace the engine of an airplane while it was in flight? If you literally have then you have lived an amazing life; however, I mean metaphorically. I want to share how we have standardized the process for replacing database backends in high throughput, zero-downtime production environments. And how Eriksen, a small data marshaling library we built, helped us execute it multiple times.
We built SparkPost based on our powerful on-premise MTA, Momentum. At the time this meant carrying over the databases Momentum was built to use. As our user base grew, we started to stretch the capabilities of these technologies, at least for the way they were configured. We were sticking with them out of convenience and not because they were the right tool for the job. Our main pain point was Apache Cassandra, which we’d spread to several of our microservices. While Cassandra is a great technology, we had several implementations that were not the right use for it.
Problem in the Process
As we embarked on our journey to remove our reliance on Cassandra, we questioned how we did database cutovers. Our current solution wasn’t very elegant. Once our new database was set up and all relevant code changes had been written, we would announce a maintenance window and put the API in read-only mode. For the duration of the cutover, any non-GET requests to the API would return errors. Since we had no canary deployments, we had no idea how it would perform in the production environment. So on top of the planned downtime, there was a high risk of problems arising after the cutover. We needed a solution to mitigate those risks and remove the need to issue maintenance windows. Companies rely on our service to run their business and expect it to always to be working.
All of our node.js APIs follow an MVC pattern, and we wanted the ability to deploy changes based on each model incrementally. We wanted to be able to monitor the new data source. Once we were comfortable with performance and didn’t observe errors, we’d cutover. If we started seeing errors while writing to a new database, we still wanted the current database to work. Lastly, we wanted a way to be able to roll back to the old database with no data loss. What we needed was a dual-write, single read solution. The team on this project at the time couldn’t find a library available via npm. It may exist and our Google-fu was subpar, but we decided to roll our own. Enter Eriksen, it marshalls calls to different data models. Marshall Eriksen, get it?
Eriksen allows a developer to define two models, a primary and secondary data store. This “supermodel” is then used in controller code. The only rule is that the function signatures have to be the same so it can proxy the appropriate data. Each model can then handle the specific database idiosyncrasies. Eriksen relies on the primary model for deciding if a request should succeed or fail. The secondary model gets the same data written or read, but Eriksen swallows errors after logging them. So errors from the secondary model don’t make the request fail, but are recorded and able to be debugged.
Here’s an example: https://github.com/SparkPost/eriksen/blob/master/demo/index.js
We have now swapped the backends to some of our most used microservices: authorization, account management, user management, subaccount management, webhooks, and IP management. Eriksen allowed us to do this with no impact to our users.
Here’s the rough plan we developed for database cutovers:
- Write the new model and instantiate as the secondary source with Eriksen.
- Ship it with the secondary model disabled.
- Backfill data to the new database.
- Enable secondary model.
- Monitor secondary model.
- Fix bugs as they appear and return to 5 until performance is satisfactory.
- Change secondary to primary and vice versa
- Disable the secondary model (originally the primary model), once there are no issues with the new primary.
- Keep the secondary model code around until confident we won’t need to roll back, then remove.
There you go! This library has had an enormous impact on the way we write and ship software. Let us know what you think! Stay tuned for more real-world examples. If the community finds this library as useful as we have, PRs are welcome and we can help make this even better!