What’s in Our Technical Operations Stack?
To follow-up our Email Infrastructure Stack post, here’s insight into what we use to provision, manage, and monitor the systems underneath that email infrastructure, our technical operations stack.
Like any properly agile development shop, we iterate a lot and our technical operations stack is no exception. We’re on our third generation of the stack, and nothing is set in stone. We’ll review several technologies we’ve progressed through and where we’re at currently. All our previous technologies were good fits at the time, but they’re not necessarily the right ones to use forever. Odds are we’ll also migrate past most of our current tools as our needs evolve. However, this is where we’ve been and where we are today.
Provisioning Generation 1: Installing in the Cloud
SparkPost as a company has been around for 15 years, providing the on-premise Momentum MTA product. The first request our engineering and ops teams got for what would become SparkPost was essentially, “give us a Momentum installation in the cloud.” Our goal then, as now, was to enable developers to use Momentum to send their mail through a cloud product. However, we didn’t want to waste time making guesses about what the specific needs would be in the cloud. We wanted to get something out there, and improve on it as we went. So the first iteration was simply a basic installation of our on-premise product in the cloud. We also added application and API layers to allow for multiple cloud customers to make use of that email infrastructure.
To achieve this, we spun up a basic set of instances in Amazon’s cloud. Our needs were basic and AWS’s API and CLI interfaces are fairly robust. Therefore, we automated our provisioning with basic bash scripts to create subnets, instances, security groups, and other AWS objects as needed. This method works well for creating a few assets here and there. Unfortunately, it doesn’t take advantage of everything AWS offers and doesn’t scale across a large team managing many assets.
Provisioning Generation 2: Embracing the Cloud
As our user base grew and our needs evolved, we began to re-architect our deployments in a method more appropriate for the cloud. We separated database and other components into their own dedicated tiers and partitioned some portions of the install from others via distinct VPCs.
To support this we needed a provisioning tool that could manage a full view of the environment and the objects inside of it. We evaluated several options here, including AWS Cloud Formation, and ultimately settled on Terraform from HashiCorp. Similar to AWS Cloud Formation and other tools, Terraform allows defining a full AWS environment, with all of its dependent objects, and then apply that configuration to have them all created and connected as defined. We use it to create entire VPCs, complete with peering connections, network routes, security groups, and instances.
Provisioning Generation 3: Living the Cloud
Two years into the life of SparkPost as a public cloud offering, we have enough understanding of our common use cases and expected growth to more fully embrace other cloud technologies to handle provisioning for us.
Our ops and engineering teams have put together a number of Ansible playbooks and Python scripts to further automate provisioning of AWS assets and software deployments. Through these automations we have decreased our time to deploy new assets and upgrade existing ones.
We are further separating our product components into dedicated tiers and configuring those individual tiers as AWS auto scaling groups. We are using Packer, also by HashiCorp, to further standardize our machine images for each tier. AWS CloudWatch monitoring of AWS SQS queues connected to our mail flows and other key traffic indicators drive the auto scaling of clusters. Once complete, this model will allow our deployment footprint to grow and shrink for each part of our technical operations stack as required by current workload demand.
Of course, provisioning of instances is only the first step. Once we launch them, we need to apply configuration, install applications, and handle maintenance.
Since our original launch, we have been primarily using Puppet for automated configuration management. Puppet is one of the best options around for maintaining configuration across a large collection of instances without the kinds of inconsistencies that can otherwise become common when several people are working on them. Also, the ability to insist on specific versions of packages means we know every instance will be kept in sync with the rest and not drift behind the state of the art. Between Puppet’s facter tool and Ruby templating engine there’s very little dynamic configuration that can’t be readily handled.
As reliable as Puppet is at maintaining a given state, it’s not necessarily the best tool for everything related to managing our instances. For changes we want to make at a specific point in time across an entire cluster with a high degree of control, we’ve found Ansible to be a much better fit. Over time, more of our deployments, upgrades, and even our creation and management of AWS objects have moved under Ansible. (We’ll be discussing that in a future article dedicated to our deployment model.)
Once instances are up and running with the right software and configuration, the real ops work starts. We need to monitor services and queues, and handle any potential impacts, preferably before they actually cause an issue.
We started out with Nagios and Check_MK, but in an effort to more fully embrace cloud technologies, we transitioned to Circonus to monitor our services. We actively monitor tens of thousands of data points in our cloud environment to ensure that our infrastructure and software is performing up to our standards. Circonus allows us to access all of our raw data for analysis. The raw data is invaluable in troubleshooting, post-mortem, and future planning activities.
Our on-call personnel receive alerts through OpsGenie when they detect a potential problem in our environment. The on-call personnel can then quickly assess the situation through the monitoring and AWS tools. Afterwards, it can take action to start remediation and repair. In less than 30 minutes, a team spanning multiple time zones will be on RingCentral and working towards a resolution.
Improvements to monitoring, data analysis, and remediation of issues are an on-going effort to enhance our visibility into our environment and reduce the potential of a customer impacting event. Automated remediation is the future of monitoring. We are currently exploring AWS SNS and Lambda as potential tools to help us realize that vision.
Hosting email in the cloud presents unique challenges. It’s inherently a push protocol, not a pull protocol. Subscribers who receive messages from us asked to receive them. Yet, unlike a video streaming service or news site they don’t ask for and then receive content immediately. They subscribe, then continue to receive messages from that subscription until they unsubscribe. If something goes wrong and they don’t receive a message, they may not immediately realize there’s a problem. Combined with the need to ensure users are only receiving email they requested and the related anti-spam and anti-phishing technologies, we must be constantly aware of the state of our delivery flows to a variety of service providers and the reputations of each of our IPs and customer domains with each of those providers. This is in addition to all the normal monitoring that our services are up, functional, and attempting to deliver email.
Fortunately, email was designed to be a robust protocol. While temporary delays and deferrals in delivery are not ideal, they are not the end of the world. And of course, we designed Momentum to deliver email, and that’s where it shines. Detailing everything Momentum does internally to handle fault tolerance and making sure email is delivered is well beyond the scope of this article. Our ops team knows we can rely on it as long as it’s running, patched, and properly configured.
Running in the cloud, we have to deal with additional abstractions on our network stack. We also can’t rely on the same kind of direct access to our external IP address bindings we would have in an on-premise installation. We rely heavily on AWS’ Elastic Load Balancer technology and a Momentum outgoing proxy module developed by our engineering team specifically to address these challenges.
We also use ELBs to manage load balancing and failover for our other tiers, both externally-facing and internal-only services. Keeping the services available to our customers hasn’t been a major problem for SparkPost. Especially because we combine the auto scaling discussed previously and an operations staff that is always on the ball. We currently deliver billions of messages a month. Despite that, we rarely had an outage or service degradation that resulted in the loss of even a single message.
Each of these topics have plenty more detail to them. The bottom line, is that our ops team loves to deliver your email. Moreover, we’re excited to make that more reliable and cost effective with each generation of new cloud technology. For more information on our technical operations stack, or other questions, check out our DevHub. Also feel free to hit us up on Slack or Twitter if you’d like to chat.
–Jeremy & Nick