In Part 1 of this series I reviewed our initial agile adoption and move to the cloud. Then in Part 2 I reviewed how we adopted continuous delivery and deployment automation to become more nimble. In this final part of the series I share lessons we learned as our service rapidly grew in 2016 and share some of what we have in store for this year.
3rd Generation Infrastructure
We learned many valuable lessons operating email infrastructure at scale in AWS and by mid 2015 had completed the move to our third generation of infrastructure. Now we properly leverage Amazon’s VPCs, security groups, ELBs, EC2 instance types, CloudFormation provisioning instead of Terraform, EBS and ephemeral storage, and even more service resilience using clusters spanning multiple availability zones. We switched from Nagios to Circonus for monitoring. An outbound email proxy now separates the management of outbound IP addresses from the MTAs which allowed us to easily add more MTAs independently of the number of IP addresses we send email with.
During this time we formalized on-call schedules across not just the operations team but also the development teams with the understanding that everyone shares responsibility for the health of our production environment. This was increasingly important since most changes were deployed to production using automated deployment pipelines build by the dev teams. Our Deliverability team and Technical Account Managers also use Opsgenie for on-call rotations. Besides shortening the resolution time for production issues, this approach empowered the development teams to make the necessary improvements to minimize and often eliminate the source of production issues resulting in a much more reliable service.
We improved a number of important processes including Change Management (CM) and Root Cause Analysis (RCA). Our CM procedures cover all deployments and changes to our production environments. Change Management helps prevent negative impact to customers by enforcing a thorough testing and review process for all production changes. This approach has greatly improved transparency and risk management for us and reduced the number of off-hour fire-drills. Not all CM tickets are the same; we account for differences between standard and emergent changes and we do not require separate CM tickets for changes deployed through our automated deployment pipelines.
Our RCA process helps us properly identify the root cause of customer impacting events and follows the “5 whys” approach. We don’t use RCA’s to place personal blame but focus on the corrective actions instead – technology, process, or training – to ensure we do not fail the same way twice.
It’s important we optimize our time to find and fix a bug in production rather than slow things down too much in a futile effort to prevent all bugs with testing. We use our continuous delivery and deployment processes to quickly fix and deploy a patch confidently.
With so many customer facing changes made by different teams, we need effective internal communication without introducing unnecessary blocking dependencies. A core group from Product Management, Product Owners, Support, and technical team leads have a “scrum of scrums” each week to ensure there is sufficient awareness of coming changes.
To further help spread awareness of changes, we automatically post a daily summary of “customer impacting” JIRA tickets to an internal Slack channel and any major changes throughout the day get automatically posted here by one of our Slack bots. Slack is a fantastic tool and we use it very extensively throughout the company for team, topic, and interpersonal communication.
By early 2016 when MailChimp announced they were discontinuing their developer oriented standalone Mandrill transactional email service, we were in an excellent position to fill that gap for developers.
As these developers came over to SparkPost in droves we were able to easily scale our platform and quickly add features that were in high demand, including subaccounts and dedicated IP pools.
Additionally, we ramped up our Developer Relations team to support this influx of developers. This team is grounded in the SparkPost Engineering department. The team’s mission is to support developers through client libraries, tools, content and even *gasp* direct human interaction. We love to interact with our developer community at various events, hackathons and on our community Slack team. You can find our upcoming events on developers.sparkpost.com.
In the summer of 2016 we collapsed Tech Ops into Engineering to improve collaboration and efficiency. At this time we created a new Site Reliability Engineering (SRE) team that incorporated a number of functions into a single cross-functional team within Engineering. This finally broke down any remaining walls between development and operations. This new team fully embraces the “infrastructure as code” approach and has oversight of cloud infrastructure, deployments, upgrades, monitoring, and auto-scaling while promoting discipline, safety, and a positive customer experience.
Due to the efforts of this team we made significant improvements in our availability and reductions in customer impacting events. We also built out more comprehensive and actionable monitoring and alerting which has improved the overall customer experience – and boosted the morale of the team.
The Road Ahead
Besides lots of great features on the roadmap, we are rolling out the fourth generation of our AWS infrastructure. This latest iteration includes further decoupling of service tiers, improved automation and monitoring, replacing or augmenting some traditional technologies with AWS services such as SQS and DynamoDB, advances in auto-scaling of service tiers, and improvements in our outbound email proxy technology. We have some API performance improvements in the works as well, which users will love. We will complete the final transition of application configuration management from Puppet to Ansible and fill in any system configuration management gaps with Puppet. Early experiments with Amazon’s container services should begin to make their way into production. All of these advances will help make SparkPost APIs even more reliable, scalable, and faster.
This year we will likely move some services to continuous deployment. This means eliminating the manual button pushing for the final production deployment step. There is a little more work to do on our automated smoke tests and roll-back scripts before we are ready and we are exploring options such as canary releases.
Through disciplined devops practices and a lot of hard work over the past few years, the SparkPost team has achieved tremendous success. But we realize this is a journey and there is always more to do as we continue to scale up and move faster.
VP Engineering and Cloud Operations