In Part 1 of this series I reviewed our initial agile adoption and move to the cloud. Then in Part 2 I reviewed how we adopted continuous delivery and deployment automation to become more nimble. In this final part of the series I share lessons we learned as our service rapidly grew in 2016 and share some of what we have in store for this year.
3rd Generation Infrastructure
We learned many valuable lessons operating email infrastructure at scale in AWS and by mid 2015 had completed the move to our third generation of infrastructure. Now we properly leverage Amazon’s VPCs, security groups, ELBs, EC2 instance types, CloudFormation provisioning instead of Terraform, EBS and ephemeral storage, and even more service resilience using clusters spanning multiple availability zones. We switched from Nagios to Circonus for monitoring. An outbound email proxy now separates the management of outbound IP addresses from the MTAs which allowed us to easily add more MTAs independently of the number of IP addresses we send email with.
During this time we formalized on-call schedules across not just the operations team but also the development teams with the understanding that everyone shares responsibility for the health of our production environment. This was increasingly important since most changes were deployed to production using automated deployment pipelines build by the dev teams. Our Deliverability team and Technical Account Managers also use Opsgenie for on-call rotations. Besides shortening the resolution time for production issues, this approach empowered the development teams to make the necessary improvements to minimize and often eliminate the source of production issues resulting in a much more reliable service.
We improved a number of important processes including Change Management (CM) and Root Cause Analysis (RCA). Our CM procedures cover all deployments and changes to our production environments. Change Management helps prevent negative impact to customers by enforcing a thorough testing and review process for all production changes. This approach has greatly improved transparency and risk management for us and reduced the number of off-hour fire-drills. Not all CM tickets are the same; we account for differences between standard and emergent changes and we do not require separate CM tickets for changes deployed through our automated deployment pipelines.
Our RCA process helps us properly identify the root cause of customer impacting events and follows the “5 whys” approach. We don’t use RCA’s to place personal blame but focus on the corrective actions instead – technology, process, or training – to ensure we do not fail the same way twice.
It’s important we optimize our time to find and fix a bug in production rather than slow things down too much in a futile effort to prevent all bugs with testing. We use our continuous delivery and deployment processes to quickly fix and deploy a patch confidently.
With so many customer facing changes made by different teams, we need effective internal communication without introducing unnecessary blocking dependencies. A core group from Product Management, Product Owners, Support, and technical team leads have a “scrum of scrums” each week to ensure there is sufficient awareness of coming changes.
To further help spread awareness of changes, we automatically post a daily summary of “customer impacting” JIRA tickets to an internal Slack channel and any major changes throughout the day get automatically posted here by one of our Slack bots. Slack is a fantastic tool and we use it very extensively throughout the company for team, topic, and interpersonal communication.
By early 2016 when MailChimp announced they were discontinuing their developer oriented standalone Mandrill transactional email service, we were in an excellent position to fill that gap for developers and offer them a much needed alternative.
As these developers came over to SparkPost in droves we were able to easily scale our platform and quickly add features that were in high demand, including subaccounts and dedicated IP pools.
Additionally, we ramped up our Developer Relations team to support this influx of developers. This team is grounded in the SparkPost Engineering department. The team’s mission is to support developers through client libraries, tools, content and even *gasp* direct human interaction. We love to interact with our developer community at various events, hackathons and on our community Slack team. You can find our upcoming events on developers.sparkpost.com.
In the summer of 2016 we collapsed Tech Ops into Engineering to improve collaboration and efficiency. At this time we created a new Site Reliability Engineering (SRE) team that incorporated a number of functions into a single cross-functional team within Engineering. This finally broke down any remaining walls between development and operations. This new team fully embraces the “infrastructure as code” approach and has oversight of cloud infrastructure, deployments, upgrades, monitoring, and auto-scaling while promoting discipline, safety, and a positive customer experience.
Due to the efforts of this team we made significant improvements in our availability and reductions in customer impacting events. We also built out more comprehensive and actionable monitoring and alerting which has improved the overall customer experience – and boosted the morale of the team.
The Road Ahead
Besides lots of great features on the roadmap, we are rolling out the fourth generation of our AWS infrastructure. This latest iteration includes further decoupling of service tiers, improved automation and monitoring, replacing or augmenting some traditional technologies with AWS services such as SQS and DynamoDB, advances in auto-scaling of service tiers, and improvements in our outbound email proxy technology. We have some API performance improvements in the works as well, which users will love. We will complete the final transition of application configuration management from Puppet to Ansible and fill in any system configuration management gaps with Puppet. Early experiments with Amazon’s container services should begin to make their way into production. All of these advances will help make SparkPost APIs even more reliable, scalable, and faster.
This year we will likely move some services to continuous deployment. This means eliminating the manual button pushing for the final production deployment step. There is a little more work to do on our automated smoke tests and roll-back scripts before we are ready and we are exploring options such as canary releases.
Through disciplined devops practices and a lot of hard work over the past few years, the SparkPost team has achieved tremendous success. But we realize this is a journey and there is always more to do as we continue to scale up and move faster.
VP Engineering and Cloud Operations
In Part 1 of this series I reviewed our initial agile adoption and move to the cloud. Read on to learn how we adopted continuous delivery and deployment automation to become more nimble.
Following the well received beta of SparkPost we realized we needed to reorient the broader engineering team towards the cloud. We had ambitious goals for the official SparkPost launch in early 2015. Many features including self-service billing and compliance measures (to keep the spammers and phishers out) were on our to-do list. We also targeted additional client libraries and had to make important improvements to performance, scalability, and usability.
To move faster we had to tackle the challenge of reliably and frequently deploying changes to the production environment. While some of our microservices were more suitable to move towards continuous deployment, the Momentum software was not. Some challenges we encountered included lengthy build times and a regression test suite that ran overnight with numerous flaky test cases which slowed us down. We also started from a home grown installation utility written in Perl to perform installation and upgrades. We had designed this utility for our on-premises customers who installed and upgraded software very infrequently and it proved clunky for our use case.
To tackle these problems head on we decided to fully embrace the continuous delivery model and committed to tackling two short term objectives: to automate the deployment of any change to a UAT environment within 1 hour and to deploy Momentum to SparkPost production environment twice a week.
At this time we switched all of the engineering teams over to Kanban and incorporated all the learnings from the initial SparkPost beta team.
During the next few months there were a number of dramatic results to come out of this concerted effort to adopt continuous delivery. One change was a deliberate switch in who was responsible for doing software deployments and a resulting decrease in deployment times and unintended service interruptions. Rather than the developers providing software and instructions to the operations team, the development team took over this responsibility while still getting valuable assistance from the operations team. To solve our deployments problem we created a new cross-functional “Deployment Team” which included members from each dev team and operations.
The Deployment Team experimented with several approaches and tools before choosing Bamboo and Ansible to automate the deployment of database, code, and configuration changes. Within a short period of time the team had automated the nascent build and deployment pipelines for each service. We removed any long running test suites from the critical path, and we incorporated automated upgrade, smoke tests, and rollback scripts. The on-premises installer script was finally obsolete.
We achieved a reasonably good continuous delivery and deployment pipeline by the time of the GA launch in April 2015 and we were deploying several times a week during business hours, including not just the many lightweight microservices but also the Momentum platform.
Another big and positive result was the dramatic reduction in our cycle time. In 2014 our cycle time averaged around 8 days for all issues but within a few months this dropped to 6 days for 2015. Even more stunning, average cycle times for user stories dropped from 22 days to less than 10 days. This was even after moving the goal post on the definition of done from “verified in UAT” to “verified in production”. We were pleased to discover that our reduced cycle times resulted in greater velocity and improved quality with all teams getting a lot more done faster and better.
As an important enabler to these improvements we adopted an MVF (minimum viable feature) approach that clearly identified the customer need but let the development teams drive the solutions in an incremental way focusing on delivering quickly, eliminating a lot of the upfront requirements analysis and technical design.
We learned to listen more to our developer user community and took advantage of our shorter development cycle times to quickly deliver fixes and improvements that users wanted.
Over time the development teams gradually evolved their processes to fully incorporated unit, acceptance, and performance testing and we eliminated the separate QA function. Some of the QA team members transitioned into development and some moved into the Deployment Team.
Around this time we discontinued our traditional Project Management Office (PMO) which had centrally controlled all development projects. We decentralized responsibility for delivery to the individual development team managers, embedding Product Owners directly within those teams. This helped further reduce overhead and increased agility.
Part 3 of this series will focus the lessons we learned as our service rapidly grew in 2016 and share some of what we have in store for this year. If you have any questions or comments about our devops journey please don’t hesitate to connect with us on Twitter – and we are always hiring.
VP Engineering and Cloud Operations
From On-Prem to Continuous Deployment
These days, we have our heads in the clouds. Or the cloud, rather, as we engineer a fast-moving cloud service and deploy software dozens of times each week. SparkPost helps customers send billions of emails each month using our cloud APIs.
But that has not always been the case.
SparkPost began as Message Systems, which for over seven years specialized in an on-premises commercial software product called Momentum. If you’re unfamiliar with this legendary messaging platform, Momentum powers the email infrastructure of many large senders including Twitter, LinkedIn, Comcast and our own cloud email delivery service SparkPost. Our journey to the cloud has taken us from quarterly release cycles of Momentum to continuous delivery of SparkPost. We safely and quickly deploy new features and fixes to production as soon as they are ready. And now we are ready to gradually remove that final manual push button deployment step for most services and transition more fully to continuous deployment.
We decided to provide a cloud service to take advantage of the growth and transition to cloud computing in the email market. It was a fantastic opportunity for us to bring the power of Momentum to a broader developer audience. Our cloud service has brought the company meteoric growth in 2016. We have now completed our transformation to a cloud-first engineering team and company, which required a significant infrastructure and organization evolution.
Here’s what worked and didn’t work on our devops journey towards continuous deployment.
Meanwhile, the core Momentum development team was also building RESTful APIs to support templates and message generation. All of this new functionality was in support of Momentum v4, the next major release of our on-premise product, and would prove to be an excellent foundational API-first architecture on which to build our cloud service when the time came. The core Momentum team gradually adopted a more agile workflow which was no small feat considering the size and maturity of this code base. What a huge improvement over the prior days of development throwing code over the wall to QA. However, the build and test cycles still measured in days and weeks.
To the Cloud!
Our Managed Cloud service launched mid-year of 2014, essentially Momentum hosted in AWS. We did this under the assumption that our Tech Ops team could build out a customer environment in AWS and then install Momentum, just like any on-premise customers would. We targeted this offering at our traditional enterprise customers who did not want to operate the Momentum email infrastructure themselves. The newly formed Tech Ops team consisted of former Support and Remote Management team members and was separate from Engineering at the time. With little initial AWS experience they did a great job building what was our first generation of AWS infrastructure.
We chose AWS because it allowed us to get going quickly and provided a lot of flexibility not available to us if we had decided to build out our own data center. Nevertheless, we borrowed heavily from a normal data center approach, especially when it came to networking, since that is what our team had the most experience with. Additionally, our managed cloud business would be a customer, albeit an important customer, but still without any fundamental changes to the underlying product or how we build, ship, and deploy it. As we rapidly added more features it wasn’t long before we realized that this approach was problematic. There were disconnects between the dev teams and the operations team resulting in inefficiencies. The traditional on-premise installation and upgrade methods were not compatible with a rapidly changing cloud service.
A Startup Within a Startup
Meanwhile, that same summer we formed a small team focused on delivering the beta release of our as-yet-unnamed public cloud service targeted at developers. This team included a handful of application developers, along with a few engineers from the Momentum and Tech Ops teams. We took the approach of “a startup within a startup” to ensure focus on the mission and avoid distraction or blockers from the core on-prem enterprise business. This team built out our second generation AWS environment based on lessons learned from Managed Cloud. Collaboration improved between development and operations. Now developers deployed code on their own (manually) and provided more guidance on the infrastructure.
To bring this new service to life, with the help of the awesome UX service provider Intridea, the app dev team designed and built a new Web UI. The team followed a very light weight Kanban process with very little overhead. By September we settled on the name “SparkPost” and began to sign up beta users while we readied things for our official beta launch at our user conference later that year.
Part 2 of this series will focus on how we adopted continuous delivery and deployment automation to become more nimble. If you have any questions or comments about our DevOps journey please don’t hesitate to connect with us on Twitter – and we are always hiring.
VP Engineering and Cloud Operations
Some two years ago, a small team met in a conference room to discuss building a self-service offering on top of Momentum, the world’s best email delivery platform. Since then, SparkPost has gone from an idea to a developer-focused service with an automated release cycle built on a culture of testing and constant iteration. So we figured it was time to share what we’ve learned and how we handle continuous integration and deployment.
Why We’ve Embraced Continuous Integration and Deployment
Before we dive into the how, you need to know why we’ve embraced continuous integration and deployment. We have 20 components that make up our service and we routinely deploy 15-20 times a week. In fact, deploying frequently allows us to focus on creating a better experience for our users iteratively. Since we can deploy small changes to any component of our service independently, we can respond quickly based on what we learn from our community. We’ve found that releasing discrete pieces of functionality for specific components lowers the risk of deployments because we can quickly verify the work and move on.
Testing is at the core of being able to continuously deploy features. The testing culture at SparkPost gives us the confidence to deploy at will. We don’t have an enforced or preferred method of testing like BDD or TDD. We have a simple rule – write tests to cover the functionality you are building. Every engineer writes unit, functional, and smoke tests using mocha, nock, sinon, and protractor. Each type of test is critical to the deployment pipeline.
Our deployment pipeline orchestration is done using Atlassian Bamboo for our private projects. We have three types of plans chained together: test, package, and deploy. During the test plan, we clone both of our automation scripts. We house all our continuous integration bash scripts, and the component we’re working on (e.g. our metrics API) in them. Bamboo then runs the unit and functional tests for that component, generating test results and coverage reports. Upon successful build, the packaging plan is triggered, generating any necessary RPM packages and uploading them to a yum repo. Once the packaging is complete, it triggers the deployment of the package. Deploy plans are responsible for installing/upgrading the component and any related configuration using Ansible, running smoke tests using protractor, and, if necessary, rolling back to a previous version.
Open source work, like our client libraries, Slack bots, and API documentation, is run through TravisCI. Check out the .travis.yml files for our Python library, PHP library, API docs, and developer hub to see what they do.
Slack and Additional Ways We Use Automation
You most likely know about our obsession with Slack by now. We use it for both manual and automated notifications related to deploying features. Before we merge/deploy code, we announce the component and the environment it will be going to. Merges to develop branches trigger deployments to UAT. Merges to master (hotfixes or develop branch promotions) trigger deployments to staging. Deployments to production are push button to allow for proper communication and timed releases of features. Once merged, it triggers the deployment pipeline outlined above. Bamboo sends email notifications upon successful plan builds, the start of a deployment, and the success or failure of a deployment. This email is sent to an internal address which is consumed by a process that posts a message in Slack.
Some additional ways we use automation include:
- Deploying the Web UI
- Deploying Momentum, our core platform written using C and Lua
- Testing and upgrading Node.js
- Making nginx configuration changes
- Deploying one of our 18 APIs
- Pushing customer acquisition and community data into Keen.io dashboards
- Deploying cron jobs that run cleanup tasks and reports
- Deploying Fauxmentum, our internal tool for generating test data against our various environments
Continuous integration and deployment are vital parts of SparkPost’s ability to listen and respond to what our community members asks for. To sum up, we hope that we’ve given you some insight that will help you improve upon your own ability to build, test, and deliver features by sharing some of our experience and process. If you’d like to see some of our pipeline in action then you can sign up for an account here. Also, feel free to join our community Slack channel, and chat with us about your experiences with SparkPost. We’d love to hear from you!
—Rich Leland, Director of Growth Engineering