In Part 1 of this series I reviewed our initial agile adoption and move to the cloud. Read on to learn how we adopted continuous delivery and deployment automation to become more nimble.
Following the well received beta of SparkPost we realized we needed to reorient the broader engineering team towards the cloud. We had ambitious goals for the official SparkPost launch in early 2015. Many features including self-service billing and compliance measures (to keep the spammers and phishers out) were on our to-do list. We also targeted additional client libraries and had to make important improvements to performance, scalability, and usability.
To move faster we had to tackle the challenge of reliably and frequently deploying changes to the production environment. While some of our microservices were more suitable to move towards continuous deployment, the Momentum software was not. Some challenges we encountered included lengthy build times and a regression test suite that ran overnight with numerous flaky test cases which slowed us down. We also started from a home grown installation utility written in Perl to perform installation and upgrades. We had designed this utility for our on-premises customers who installed and upgraded software very infrequently and it proved clunky for our use case.
To tackle these problems head on we decided to fully embrace the continuous delivery model and committed to tackling two short term objectives: to automate the deployment of any change to a UAT environment within 1 hour and to deploy Momentum to SparkPost production environment twice a week.
At this time we switched all of the engineering teams over to Kanban and incorporated all the learnings from the initial SparkPost beta team.
During the next few months there were a number of dramatic results to come out of this concerted effort to adopt continuous delivery. One change was a deliberate switch in who was responsible for doing software deployments and a resulting decrease in deployment times and unintended service interruptions. Rather than the developers providing software and instructions to the operations team, the development team took over this responsibility while still getting valuable assistance from the operations team. To solve our deployments problem we created a new cross-functional “Deployment Team” which included members from each dev team and operations.
The Deployment Team experimented with several approaches and tools before choosing Bamboo and Ansible to automate the deployment of database, code, and configuration changes. Within a short period of time the team had automated the nascent build and deployment pipelines for each service. We removed any long running test suites from the critical path, and we incorporated automated upgrade, smoke tests, and rollback scripts. The on-premises installer script was finally obsolete.
We achieved a reasonably good continuous delivery and deployment pipeline by the time of the GA launch in April 2015 and we were deploying several times a week during business hours, including not just the many lightweight microservices but also the Momentum platform.
Another big and positive result was the dramatic reduction in our cycle time. In 2014 our cycle time averaged around 8 days for all issues but within a few months this dropped to 6 days for 2015. Even more stunning, average cycle times for user stories dropped from 22 days to less than 10 days. This was even after moving the goal post on the definition of done from “verified in UAT” to “verified in production”. We were pleased to discover that our reduced cycle times resulted in greater velocity and improved quality with all teams getting a lot more done faster and better.
As an important enabler to these improvements we adopted an MVF (minimum viable feature) approach that clearly identified the customer need but let the development teams drive the solutions in an incremental way focusing on delivering quickly, eliminating a lot of the upfront requirements analysis and technical design.
We learned to listen more to our developer user community and took advantage of our shorter development cycle times to quickly deliver fixes and improvements that users wanted.
Over time the development teams gradually evolved their processes to fully incorporated unit, acceptance, and performance testing and we eliminated the separate QA function. Some of the QA team members transitioned into development and some moved into the Deployment Team.
Around this time we discontinued our traditional Project Management Office (PMO) which had centrally controlled all development projects. We decentralized responsibility for delivery to the individual development team managers, embedding Product Owners directly within those teams. This helped further reduce overhead and increased agility.
Part 3 of this series will focus the lessons we learned as our service rapidly grew in 2016 and share some of what we have in store for this year. If you have any questions or comments about our devops journey please don’t hesitate to connect with us on Twitter – and we are always hiring.
VP Engineering and Cloud Operations