In Part 1 of this series I reviewed our initial agile adoption and move to the cloud. Read on to learn how we adopted continuous delivery and deployment automation to become more nimble.
Following the well received beta of SparkPost we realized we needed to reorient the broader engineering team towards the cloud. We had ambitious goals for the official SparkPost launch in early 2015. Many features including self-service billing and compliance measures (to keep the spammers and phishers out) were on our to-do list. We also targeted additional client libraries and had to make important improvements to performance, scalability, and usability.
To move faster we had to tackle the challenge of reliably and frequently deploying changes to the production environment. While some of our microservices were more suitable to move towards continuous deployment, the Momentum software was not. Some challenges we encountered included lengthy build times and a regression test suite that ran overnight with numerous flaky test cases which slowed us down. We also started from a home grown installation utility written in Perl to perform installation and upgrades. We had designed this utility for our on-premises customers who installed and upgraded software very infrequently and it proved clunky for our use case.
To tackle these problems head on we decided to fully embrace the continuous delivery model and committed to tackling two short term objectives: to automate the deployment of any change to a UAT environment within 1 hour and to deploy Momentum to SparkPost production environment twice a week.
At this time we switched all of the engineering teams over to Kanban and incorporated all the learnings from the initial SparkPost beta team.
During the next few months there were a number of dramatic results to come out of this concerted effort to adopt continuous delivery. One change was a deliberate switch in who was responsible for doing software deployments and a resulting decrease in deployment times and unintended service interruptions. Rather than the developers providing software and instructions to the operations team, the development team took over this responsibility while still getting valuable assistance from the operations team. To solve our deployments problem we created a new cross-functional “Deployment Team” which included members from each dev team and operations.
The Deployment Team experimented with several approaches and tools before choosing Bamboo and Ansible to automate the deployment of database, code, and configuration changes. Within a short period of time the team had automated the nascent build and deployment pipelines for each service. We removed any long running test suites from the critical path, and we incorporated automated upgrade, smoke tests, and rollback scripts. The on-premises installer script was finally obsolete.
We achieved a reasonably good continuous delivery and deployment pipeline by the time of the GA launch in April 2015 and we were deploying several times a week during business hours, including not just the many lightweight microservices but also the Momentum platform.
Another big and positive result was the dramatic reduction in our cycle time. In 2014 our cycle time averaged around 8 days for all issues but within a few months this dropped to 6 days for 2015. Even more stunning, average cycle times for user stories dropped from 22 days to less than 10 days. This was even after moving the goal post on the definition of done from “verified in UAT” to “verified in production”. We were pleased to discover that our reduced cycle times resulted in greater velocity and improved quality with all teams getting a lot more done faster and better.
As an important enabler to these improvements we adopted an MVF (minimum viable feature) approach that clearly identified the customer need but let the development teams drive the solutions in an incremental way focusing on delivering quickly, eliminating a lot of the upfront requirements analysis and technical design.
We learned to listen more to our developer user community and took advantage of our shorter development cycle times to quickly deliver fixes and improvements that users wanted.
Over time the development teams gradually evolved their processes to fully incorporated unit, acceptance, and performance testing and we eliminated the separate QA function. Some of the QA team members transitioned into development and some moved into the Deployment Team.
Around this time we discontinued our traditional Project Management Office (PMO) which had centrally controlled all development projects. We decentralized responsibility for delivery to the individual development team managers, embedding Product Owners directly within those teams. This helped further reduce overhead and increased agility.
Part 3 of this series will focus the lessons we learned as our service rapidly grew in 2016 and share some of what we have in store for this year. If you have any questions or comments about our devops journey please don’t hesitate to connect with us on Twitter – and we are always hiring.
VP Engineering and Cloud Operations
Some two years ago, a small team met in a conference room to discuss building a self-service offering on top of Momentum, the world’s best email delivery platform. Since then, SparkPost has gone from an idea to a developer-focused service with an automated release cycle built on a culture of testing and constant iteration. So we figured it was time to share what we’ve learned and how we handle continuous integration and deployment.
Why We’ve Embraced Continuous Integration and Deployment
Before we dive into the how, you need to know why we’ve embraced continuous integration and deployment. We have 20 components that make up our service and we routinely deploy 15-20 times a week. In fact, deploying frequently allows us to focus on creating a better experience for our users iteratively. Since we can deploy small changes to any component of our service independently, we can respond quickly based on what we learn from our community. We’ve found that releasing discrete pieces of functionality for specific components lowers the risk of deployments because we can quickly verify the work and move on.
Testing is at the core of being able to continuously deploy features. The testing culture at SparkPost gives us the confidence to deploy at will. We don’t have an enforced or preferred method of testing like BDD or TDD. We have a simple rule – write tests to cover the functionality you are building. Every engineer writes unit, functional, and smoke tests using mocha, nock, sinon, and protractor. Each type of test is critical to the deployment pipeline.
Our deployment pipeline orchestration is done using Atlassian Bamboo for our private projects. We have three types of plans chained together: test, package, and deploy. During the test plan, we clone both of our automation scripts. We house all our continuous integration bash scripts, and the component we’re working on (e.g. our metrics API) in them. Bamboo then runs the unit and functional tests for that component, generating test results and coverage reports. Upon successful build, the packaging plan is triggered, generating any necessary RPM packages and uploading them to a yum repo. Once the packaging is complete, it triggers the deployment of the package. Deploy plans are responsible for installing/upgrading the component and any related configuration using Ansible, running smoke tests using protractor, and, if necessary, rolling back to a previous version.
Open source work, like our client libraries, Slack bots, and API documentation, is run through TravisCI. Check out the .travis.yml files for our Python library, PHP library, API docs, and developer hub to see what they do.
Slack and Additional Ways We Use Automation
You most likely know about our obsession with Slack by now. We use it for both manual and automated notifications related to deploying features. Before we merge/deploy code, we announce the component and the environment it will be going to. Merges to develop branches trigger deployments to UAT. Merges to master (hotfixes or develop branch promotions) trigger deployments to staging. Deployments to production are push button to allow for proper communication and timed releases of features. Once merged, it triggers the deployment pipeline outlined above. Bamboo sends email notifications upon successful plan builds, the start of a deployment, and the success or failure of a deployment. This email is sent to an internal address which is consumed by a process that posts a message in Slack.
Some additional ways we use automation include:
- Deploying the Web UI
- Deploying Momentum, our core platform written using C and Lua
- Testing and upgrading Node.js
- Making nginx configuration changes
- Deploying one of our 18 APIs
- Pushing customer acquisition and community data into Keen.io dashboards
- Deploying cron jobs that run cleanup tasks and reports
- Deploying Fauxmentum, our internal tool for generating test data against our various environments
Continuous integration and deployment are vital parts of SparkPost’s ability to listen and respond to what our community members asks for. To sum up, we hope that we’ve given you some insight that will help you improve upon your own ability to build, test, and deliver features by sharing some of our experience and process. If you’d like to see some of our pipeline in action then you can sign up for an account here. Also, feel free to join our community Slack channel, and chat with us about your experiences with SparkPost. We’d love to hear from you!
—Rich Leland, Director of Growth Engineering