A DNS Performance Incident

At SparkPost, we’re building an email delivery service with high performance, scalability, and reliability. We’ve made those qualities key design objectives, and they’re core to how we engineer and operate our service. In fact, we literally guarantee our service level and burst rates for our Enterprise service level customers.

However, we sometimes encounter technical limitations or operating conditions that have a negative impact on our performance. We recently experienced a challenging situation like this. On May 24, problems with our DNS infrastructure’s interaction with AWS’ network stack resulted in errors, delays, and slow system performance for some of our customers.

When events like this happen, we do everything we can to make things right. We also commit to our customers to be open and transparent about what happened and what we learn.

In this post, I’ll discuss what happened, and what we learned from that incident. But I’d like to begin by saying we accept responsibility for the problem and its impact on our customers.

We know our customers depend on reliable email delivery to support their business, security, and operational needs. We take it seriously when we don’t deliver the level of service our customers expect. I’m very sorry for that, as is our entire team.

Extreme DNS Usage on AWS Network Hits a Limit

Why did this slowdown happen? Our team quickly realized that routine DNS queries from our service were not being answered at a reasonable rate. We traced the issue to the DNS infrastructure we operate on the Amazon Web Services (AWS) platform. Initially we attempted to address query performance by increasing DNS server capacity by 500%, but that did not resolve the situation, and we continued to experience an unexplained and severe throttling. We then repointed DNS services for the vast majority of our customers at local nameservers in each AWS network segment, which were not experiencing performance issues. This is not the AWS-recommended long-term approach for our DNS volume, but we coordinated it with AWS as an interim measure that allowed us to restore service fully for all customers about five hours after the incident began.

I’ve written before about how critical DNS infrastructure is to email delivery, and the ways in which DNS issues can expose bugs or unexpected limits in cloud networking and hosting. In short, email makes extraordinarily heavy use of DNS, and SparkPost makes more use of DNS than nearly any other AWS customer. As such, DNS has an outsized impact on the overall performance of our service.

In this case, the root cause of the degraded DNS performance was another undocumented, practical limit in the AWS network stack. The limit was triggered under a specific set of circumstances, of which actual traffic load was only one component.

Limits like these are to be expected in any network infrastructure. But one area where the cloud provides unique challenges is troubleshooting them when the network stack is itself an abstraction, and the traffic interactions are much more of a shared responsibility than they would be in a traditional environment.

Diagnosing this problem during the incident was difficult for us and the AWS support team alike. It ultimately required several days of effort and assistance from the AWS engineering team after the fact to recreate the issue in a test environment and then identify the root cause.

Working with the AWS Team

Technology stacks aside, we know how much our customers benefit from the expertise of our technical and service teams who understand email at scale inside and out. We actually mean it when we say, “our technology makes it possible—our people make the difference.”

That’s also been true working with Amazon. The AWS team has been essential throughout the process of identifying and resolving the DNS performance problem that affected our service last week. SparkPost’s site reliability engineering team worked closely with our AWS counterparts until we clearly understood the root cause.

Here are some of the things we’ve learned about working together on this kind of problem solving:

  • Your AWS technical account manager is your ally. Take advantage of your account team. They’re advocates and guides to navigate AWS internal resources and services. They can reach out to product managers and internal engineering resources within AWS. They can hunt down internal information not readily available in online docs. And they really understand how urgent issues like the one we encountered can be to business operations.  If a support ticket or other issue is not getting the attention it deserves don’t hesitate to push harder.
  • Educate AWS on your unique use cases. Ensure that the AWS account team—especially your TAM team and solution architect—are involved in as much of your daily workflow as possible. This way, they can learn your needs first hand and represent them inside of AWS. That’s a really important part of keeping the number of unexpected surprises to a minimum.
  • Run systematic tests and generate data to help AWS troubleshoot. The Amazon team is going to investigate the situation on their end, and of course they have great tools and visibility at the platform layer to do that. But they can’t replicate your setup, especially when you’ve built highly specialized and complex services like ours. Running systematic tests and providing the data to the AWS team will provide them with invaluable information that can help to isolate an unknown problem to a particular element of the platform infrastructure. And they can monitor things on their end during these tests to gain additional insight into the issue.
  • Make it easy for engineers on both teams to collaborate directly. Though your account team is critical, they also know when letting AWS’ engineers and your engineers work together directly will save time and improve communication. It’s to your advantage to make that as easy as possible. Inviting the AWS team into a shared Slack channel, for example, is a great way to work together in real-time—and to document the interactions to help further troubleshooting and reproduce context in the future. Make use of other collaboration tools such as Google docs for sharing findings and plans. Bring the AWS team onto your operations bridge line during incidents and use conference calls for regular check-ins following an incident.
  • Understand that you’re in it together. AWS is a great technical stack for building cloud-native services. But one of the things we’ve come to appreciate about Amazon is how openly they work through hard problems when a specialized service like SparkPost pushes the AWS infrastructure into some edge cases. Their team has supported us in understanding root causes, developing solutions, and ultimately taking their learnings back to help AWS itself continue to evolve.

The AWS network and platform is a key part of SparkPost’s cloud architecture. We’ve developed some great knowledge about leveraging AWS from a technical perspective. We’ve also come to realize how important support from the AWS team can be when working to resolve issues in the infrastructure when they do arise.

Looking Ahead

In the coming weeks, we will write more in detail about the DNS architecture changes we are currently rolling out. They’re an important step towards increasing the resilience of our infrastructure.

Whether you’re building for the AWS network yourself, or a SparkPost customer who relies on our cloud infrastructure, I hope this explanation of what we’ve learned has been helpful. And of course, please reach out to me or any of the SparkPost team if you’d like to discuss last week’s incident.

—Chris McFadden
VP Engineering and Cloud Operations
@cristoirmac

SparkPost today is synonymous with the concept of a cloud MTA. But you might not know how deep our expertise with MTAs runs. For more than a decade, the SparkPost team has been building the technology that powers some of the most demanding deployments of enterprise MTAs in the world. In fact, more than 25% of the world’s non-spam mail is sent using our MTAs every day.

Those are impressive figures to be sure. So when we say we’re proud  that SparkPost has become the world’s fastest-growing email delivery service, we know that one reason for the trust given to us is the credibility that comes from having installations of our Momentum and PowerMTA software deployed in the data centers of the largest Email Service Providers (ESPs) and other high-volume senders such as LinkedIn and Twitter.

As CTO of SparkPost, my team and I also have faced the sizable challenge—albeit a rewarding one—of migrating complex, highly optimized software like MTAs to a modern cloud architecture. Our team’s experience developing and managing high-performance email infrastructure has been a major part of why SparkPost has been successful with that transformation, but so too has been our vision of what a “cloud native” service really entails.

A few years ago, our team and many of our customers recognized that the cloud promised the ability to deliver the performance of our best-in-class messaging with dramatically improved economics and business flexibility. We understood that not only would it be more cost-effective for our customers to get started, but that it also would reduce the ongoing burden on their resources in areas like server maintenance, software maintenance, and deliverability analysis and resolution.

To get there, we knew we needed to do it the right way. Standing up servers in a data center wasn’t an option—because traditional data center models would limit our scalability, reliability, and operational flexibility in all the same ways our customers were trying to avoid!

That’s a big part of why we selected Amazon Web Services (AWS) to provide SparkPost’s underlying infrastructure. Platforms such as AWS, Microsoft Azure, Heroku, and others have many great qualities, but building a cloud-native messaging solution is conceptually a lot more than taking an MTA and installing it on a virtual machine in the sky.

There are times when architecting for the cloud necessarily embodies contradictory requirements. Just consider these architectural challenges of bringing something like an MTA into the cloud, for example:

  • Scaling Stateful Systems in the Cloud. One of the primary lures of deploying within a cloud provider is the ability to take advantage of push-button server deployments and auto-scaling. For the majority of AWS customers this is very straightforward; most of them deploy web-based applications of some form, following well established patterns for creating a stateful application using stateless web servers. A mail server, however, is inherently stateful; it implements a store-and-forward messaging protocol delivering to tens of thousands of unique endpoints. In practice some messages may need to be queued for extended periods of time (minutes/hours/days) during normal operation. Thus, like a database, it is significantly harder to handle scaling in the cloud, since typical load-driven scale-up/scale-down logic can’t be applied.
  • Limitless Limitations. Cloud infrastructure like AWS doesn’t magically change the laws of physics—even if it does make them a lot easier to manage. Still, every service has a limit, whether published or not. These limits not only affect what instance types you deploy on, but how you have to architect your solution to ensure that it scales in every direction. From published limits on how many IPs per instance you can allocate for sending, to unpublished DNS limitations, every AWS limit needs to be reviewed and planned for (and you have to be ready for the unexpected through monitoring and fault-tolerant architecture).
  • IP Reputation Management. A further complication both in general cloud email deployments, but especially in auto-scaling, is managing the dynamic allocation of sending resources without having to warm up new IPs. You need the ability to dynamically coordinate message routing across all your MTAs and to decouple the MTA processing a message from the IP assignment/management logic.
  • It Takes a Village. Moving to the cloud is not just a technology hurdle—it took the right people to make sure our customers were successful. We had to bring in expertise in engineering, security, operations, deliverability, and customer care to ensure the success of our customers in a scalable cloud-driven environment.

As I noted earlier, building and deploying a true cloud MTA is a lot more complex than putting our software up on a virtual server. But the end results show why services like SparkPost are so important to how businesses consume technology today.

The cloud can make even the most complex systems feel deceptively simple—which allows the technical and business benefits to be front and center. But if you’re a software engineer or architect building for the cloud, you understand how important solving these complex needs really are to achieve that.

So, if you’re building services like ours, I’m interested in hearing about your experiences and what you’ve run into as you’ve developed for the cloud. Ping me on Twitter, or leave a comment below.

—George Schlossnagle

How We Tracked Down Unusual DNS Failures in AWS

We’ve built SparkPost around the idea that a cloud service like ours needs to be cloud-native itself. That’s not just posturing. It’s our cloud architecture that underpins the scalability, elasticity, and reliability that are core aspects of the SparkPost service. Those qualities are major reasons we’ve built our infrastructure atop Amazon Web Services (AWS)—and it’s why we can offer our customers service level and burst rate guarantees unmatched by anyone else in the business.

But we don’t pretend that we’re never challenged by unexpected bugs or limits of available technology. We ran into something like this last Friday, and that incident led to intermittent slowness in our service and delivery delays for some of our customers.

First let me say, the issue was resolved that same day. Moreover, no email or related data was lost. However, if delivery of your emails was slowed because of this issue, please accept my apology (in fact, an apology from our entire team). We know you count on us, and it’s frustrating when we’re not performing at the level you expect.

Some companies are tempted to brush issues like a service degradation under the rug and hope no one notices. You may have experienced that with services you’ve used in the past. I know I have. But that’s not how we like to do business.

I wanted to write about this incident for another reason as well: we learned something really interesting and valuable about our AWS cloud architecture. Teams building other cloud services might be interested in learning about it.

TL;DR

We ran into undocumented practical limits of the EC2 instances we were using for our primary DNS cluster. Sizing cloud instances based on traditional specs (processor, memory, etc.) usually works just as you’d expect, but sometimes that traditional hardware model doesn’t apply. That’s especially true in atypical use cases where aggregate limits can come into play—and there are times you run headlong into those scenarios without warning.

We hit such a limit on Friday when our DNS query volume created a network usage pattern for which our instance type wasn’t prepared. However, because that limit wasn’t obvious from the docs or standard metrics available, we didn’t know we’d hit it. What we observed was a very high rate of DNS failures, which in turn led to intermittent delays at different points in our architecture.

Digging Deeper into DNS

Why is our DNS usage special? Well, it has a lot to do with the way email works, compared to the content model for which AWS was originally designed. Web-based content delivery makes heavy use of what might be considered classic inbound “pull” scenarios: a client requests data, be it HTML, video streams, or anything else, from the cloud. But the use cases for messaging service providers like SparkPost are exceptions to the usual AWS scenario. In our case, we do a lot of outbound pushing of traffic: specifically, email (and other message types like SMS or mobile push notifications). And that push-style traffic relies heavily on DNS.

If you’re familiar with DNS, you may know that it’s generally fairly lightweight data. To request a given HTML page, you first have to ask where that page can be found on the Internet, but that request is a fraction of the size of the content you retrieve.

Email, however, makes exceptionally heavy use of DNS to look up delivery domains—for example, SparkPost sends many billions of emails to over 1 million unique domains every month. For every email we deliver, we have to make a minimum of two DNS lookups, and the use of DNS “txt” records for anti-phishing technologies like SPF and DKIM means DNS also is required to receive mail. Add to that our more traditional use of AWS API services for our apps, and it’s hard to exaggerate how important DNS is to our infrastructure.

All of this means we ran into an unusual condition in which our growing volume of outbound messages created a DNS traffic volume that hit an aggregate network throughput limit on instance types that otherwise seemed to have sufficient resources to service that load. And as denial-of-service attacks on the Dyn DNS infrastructure last year demonstrated, when DNS breaks, everything breaks. (That’s something anyone who builds systems that rely on DNS already knows painfully well.)

The sudden DNS issues triggered a response by our operations and reliability engineering teams to identify the problem. They teamed with our partners at Amazon to escalate on the AWS operations side. Working together, we identified the cause and a solution. We deployed a cluster of larger capacity nameservers with a greater focus on network capacity that could fulfill our DNS needs without running into the redlines for throughput. Fortunately, because all this was within AWS, we could spin up the new instances and even resize existing instances very quickly. DNS resumed normal behavior, lookup failures ceased, and we (and the outbound message delivery) were back on track.

To mitigate against this specific issue in the future, we’re also making DNS architecture changes to better insulate our core components from the impact of encounters with similar, unexpected thresholds. We’re also working with the Amazon team to determine appropriate monitoring models that will give us adequate warning to head off a similar incident before it affects any of our customers.

AWS and the Cloud’s Silver Lining

I don’t want to sugarcoat the impact of this incident on our customers. But our ability to identify the underlying issue as an unexpected interaction of our use case with the AWS infrastructure—and then find a resolution to it in very short order—has a lot to do with how we built SparkPost, and our great relationship with the Amazon team.

SparkPost’s superb operations corps, our Site Reliability Engineering (SRE) team, and our principal technical architects work with Amazon every day. The strengths of AWS’ infrastructure has given us a real leg up optimizing SparkPost’s architecture for the cloud. Working so closely with AWS over the past two years also has taught us a lot about spinning up AWS infrastructure and running quickly, and we also have the benefit of deep support from the AWS team.

If we had to work around a similar limitation in a traditional data center model, something like this could take days or even weeks to fully resolve. That agility and responsiveness are just two of the reasons we’ve staked our business on the cloud and AWS. Together, the kind of cloud expertise our companies share is hard to come by. Amazon has been a great business partner to us, and we’re really proud of what we’ve done with the AWS stack.

SparkPost is the first email delivery service that was built for the cloud from the start. We send more email from a true cloud platform than anyone, and sometimes that means entering uncharted territory. It’s a fundamental truth of computer science that you don’t know what challenges occur at scale until you hit them. We found one on AWS, but our rapid response is a great example of the flexibility the cloud makes possible. It’s also our commitment to our customers.

Whether you’re building your own infrastructure on AWS, or a SparkPost customer who takes advantage of ours, I hope this explanation of what happened last Friday, and how we resolved it, has been useful.

—Chris McFadden
VP Engineering and Cloud Operations
@cristoirmac

Cloud-Webhook-Infrastructure_Dev-Blog_600x300_0716 (1)

There are many ways to obtain metadata about your transmissions sent via SparkPost. We built a robust reporting system with over 40 different metrics to help you optimize your email deliverability. At first, we attempted to send metadata to our customers via carrier pigeons to meet customer demand for a push-based event system. We soon discovered that the JSON the birds delivered was not as clean as customers wanted. That’s when we decided to build a scalable Webhooks infrastructure using more modern technologies.

Event Hose

Like our reporting, the webhook infrastructure at SparkPost begins with what we call our Event Hose. This piece of the Momentum platform generates the raw JSON data that will eventually reach your webhook endpoint. As Bob detailed in his Reporting blogpost, after every message generation, bounce event, delivery, etc., Momentum logs a robust JSON object describing every quantifiable detail (we found unquantifiable details didn’t fit into the JSON format very well) of the event that occurred.

Each of these JSON event payloads are loaded into an amqp-based RabbitMQ exchange. This exchange will fan the messages out to the desired queue, including the queue which will hold your webhooks traffic. We currently use RabbitMQ as a key part of our application’s infrastructure stack to queue and reliably deliver message. We use a persistent queue to ensure that RabbitMQ holds each message until it’s delivered to your consumer. In addition, the system we’ve built is ready to handle failures, downtime, and retries.

Webhooks ETL

Between RabbitMQ and your consumer, we have an ETL process that will create batches of these JSON events for each webhook you have created. We believe in the “eat your own dogfood” philosophy for our infrastructure. So our webhooks ETL process will call out to our public webhooks API to find out where to send your batches. Additional headers or authentication data may be added to the POST request. Then the batch is on its way to your consumer.

If your webhooks consumer endpoint responds to the POST request in a timely manner with an HTTP 200 response, then the ETL process will acknowledge and remove the batch of messages from RabbitMQ. If the batch fails to POST to your consumer for any reason (Timeout, 500 server error, etc), it will be added to a RabbitMQ delayed queue. This queue will hold the batch for a certain amount of time (we retry batches using an increasing backoff strategy based on how many times it has been attempted). After the holding time has elapsed, the ETL process will receive the already-processed batch to send to your endpoint again. This retry process is repeated until either your consumer has accepted the batch with a 200 response, or the maximum number of retries has been reached.

As each batch is attempted, the ETL also sends updates to the webhook API with status data about each batch. We keep track of the consumer’s failure code, number of retries and batch ID. If your webhook is having problems accepting batches, you can access this status data via the webhook API. You can also access it through the UI by clicking “View Details” in your webhook’s batch status row.

webhooks
Conclusion

Webhooks are an extremely useful part of the SparkPost infrastructure stack. They allow customers to receive event-level metadata on all of their transmissions in a push model. While we’re operating on RabbitMQ today, we’re always looking at more modern cloud-based message queueing technologies, such as SQS, to see what can best help us meet our customers’ needs.

If you’d like to see webhooks in action, try creating a webhook for your SparkPost account. As always, if you have any questions or would simply like to chat, swing by the SparkPost community Slack.

–Jason Sorensen, Lead Data Scientist

we love developers

Architecting-Reporting_Dev-Blog_600x300_0716 (1)

At SparkPost we send over 25% of all non-spam email, but how do we account for all of those messages? Humans! Our well-equipped team calculates over 40 metrics and gives you the ability to dissect it in any way. Well, that was our first stab at this capability but we quickly realized it was too much work for them to do. Joking aside, scalable reporting is a challenging problem because we have to process millions of events per minute and ensure they can be queried within a timely manner. I hope to explain in some detail how we’ve solved this problem.

Sending email is great and we tell you we’re the best email service out there. However, if you’re like me, you need some hard evidence of this claim. You may not realize it but there’s a lot that happens when sending an email: Momentum (our email engine) receives a message (injection), the ISP receives it (delivery), the recipient’s mailbox is full (soft bounce), and so on. Seeing this at the 10,000-foot glance is great for reporting back to your boss, but you’ll also need to diagnose problems by inspecting the lifecycle of one message. We have solutions for both. And the architectural designs we made allow us to easily add different uses of this data. We implemented a strategy called ETL (extract, transform and load). This pattern is what powers Metrics, Message Events, Webhooks and Suppressions. I’ll walk you through this process starting with the Event Hose.

Event Hose

The event hose is responsible for keeping track of all the different events that can occur during the course of sending messages (injection, delivery, bounce, rejection, etc.). It logs these events as they occur in JSON to an exchange in RabbitMQ. By offloading the queuing of these events from Momentum, it provides a good separation and allows our operations team to scale this cluster of servers independently. If you are familiar with the pub-sub model, the event hose acts as the publisher. However, without subscribers, these messages would float into the ether. What are our subscribers and how do they work you ask? Read on!

Metrics ETL

The Metrics ETL is the first of a collection of subscribers in this stack. It is a clustered Node.js process that binds to a RabbitMQ queue. This process receives messages from the queue as they are emitted and batches them up by transforming the data to adhere to the schema within a database called Vertica.

Message Events ETL

Like the Metrics ETL this is a clustered Node.js process. However, it binds to a different queue and has its own independent control over processing the data. It also loads into Vertica but into a more flexible schema called a flex table. Think of it as MongoDB on steroids. As mentioned in the intro, there are other uses of this data that I will not get into today. If you use webhooks or suppressions, it has different processes and logic to process this data.

Vertica

We spent a great deal of time vetting analytic database solutions to fit our many different use cases. The big advantage are projections, similar to a materialized views, which provide the ability to store raw event data and model very complex queries for all the different ad-hoc drill downs and groupings (domain, time series, sending pool, etc). Lastly, it is horizontally scalable and allows us to easily add new nodes as our load and data set increases.

HTTP API Layer

As the processes I described above loads the data, users can retrieve data from several API endpoints. As explained in How SparkPost Built the Best Email API for Developers, we use RESTful web APIs. The Metrics API provides a variety of endpoints enabling you to retrieve a summary of the data, data grouped by a specific qualifier, or data by event type. This means I can see statistics about my emails at an aggregate level, by a time range, grouped by domain, subaccount, IP pool, sending domain, etc. The capabilities are extensive and our users love the different dissections of the data they can retrieve in almost real-time. We retain this information for six month to allow for trending of the data over time. If you need the data longer than that, we encourage you to set up a webhook and load the data into their respective business intelligence tools.

The Message Events API allows a user to search on the raw events the event hose logs above. The retention of these events is 10 days, and is intended for more immediate debugging of your messages (push, email, etc).

Web User Interface

We built SparkPost for the developer first but we understand that not all of our users are technical. We provide a Reports UI that allows a user to drill down by many different facets like: a recipient domain, sending domain, ip pool and campaign. It is built using the same APIs mentioned above.

Conclusion

I hope this shed some light on how SparkPost processes the large amount of data and makes it available to you. We’re also currently working on re-architecting everything I just talked about. We’ve learned a lot over the first 18 months of SparkPost, especially managing many different tiers of our own infrastructure. I’ve personally spent many hours triaging and fighting fires around RabbitMQ and Vertica. We have decided to leverage a service based message queue in SQS, and are starting to investigate service-based alternatives to Vertica. I plan on writing a follow up to this later in the year, so stay tuned!

Our knowledgeable staff also uses this data to ensure you’re making the best decisions when sending your messages with SparkPost. I encourage you to start using these APIs and the WebUI to start digging into how your messages are performing. It can also be crucial if you get stuck in one of those Lumbergh moments and have to provide an email report to your boss by 5pm on a Friday, or need to dig into why an ISP is bouncing your email. We’ve also seen great uses of our APIs in hackathon projects. So get creative and let us help you build something awesome.

–Bob Evans, Director of Engineering

technical operations stack: C++, CSS3, PHP

What’s in Our Technical Operations Stack?

To follow-up our Email Infrastructure Stack post, here’s insight into what we use to provision, manage, and monitor the systems underneath that email infrastructure, our technical operations stack.

Like any properly agile development shop, we iterate a lot and our technical operations stack is no exception. We’re on our third generation of the stack, and nothing is set in stone. We’ll review several technologies we’ve progressed through and where we’re at currently. All our previous technologies were good fits at the time, but they’re not necessarily the right ones to use forever. Odds are we’ll also migrate past most of our current tools as our needs evolve. However, this is where we’ve been and where we are today.

Provisioning Generation 1: Installing in the Cloud

SparkPost as a company has been around for 15 years, providing the on-premise Momentum MTA product. The first request our engineering and ops teams got for what would become SparkPost was essentially, “give us a Momentum installation in the cloud.” Our goal then, as now, was to enable developers to use Momentum to send their mail through a cloud product. However, we didn’t want to waste time making guesses about what the specific needs would be in the cloud. We wanted to get something out there, and improve on it as we went. So the first iteration was simply a basic installation of our on-premise product in the cloud. We also added application and API layers to allow for multiple cloud customers to make use of that email infrastructure.

To achieve this, we spun up a basic set of instances in Amazon’s cloud. Our needs were basic and AWS’s API and CLI interfaces are fairly robust. Therefore, we automated our provisioning with basic bash scripts to create subnets, instances, security groups, and other AWS objects as needed. This method works well for creating a few assets here and there. Unfortunately, it doesn’t take advantage of everything AWS offers and doesn’t scale across a large team managing many assets.

Provisioning Generation 2: Embracing the Cloud

As our user base grew and our needs evolved, we began to re-architect our deployments in a method more appropriate for the cloud. We separated database and other components into their own dedicated tiers and partitioned some portions of the install from others via distinct VPCs.

To support this we needed a provisioning tool that could manage a full view of the environment and the objects inside of it. We evaluated several options here, including AWS Cloud Formation, and ultimately settled on Terraform from HashiCorp. Similar to AWS Cloud Formation and other tools, Terraform allows defining a full AWS environment, with all of its dependent objects, and then apply that configuration to have them all created and connected as defined. We use it to create entire VPCs, complete with peering connections, network routes, security groups, and instances.

Provisioning Generation 3: Living the Cloud

Two years into the life of SparkPost as a public cloud offering, we have enough understanding of our common use cases and expected growth to more fully embrace other cloud technologies to handle provisioning for us.

Our ops and engineering teams have put together a number of Ansible playbooks and Python scripts to further automate provisioning of AWS assets and software deployments. Through these automations we have decreased our time to deploy new assets and upgrade existing ones.

We are further separating our product components into dedicated tiers and configuring those individual tiers as AWS auto scaling groups. We are using Packer, also by HashiCorp, to further standardize our machine images for each tier. AWS CloudWatch monitoring of AWS SQS queues connected to our mail flows and other key traffic indicators drive the auto scaling of clusters. Once complete, this model will allow our deployment footprint to grow and shrink for each part of our technical operations stack as required by current workload demand.

Configuration Management

Of course, provisioning of instances is only the first step. Once we launch them, we need to apply configuration, install applications, and handle maintenance.

Since our original launch, we have been primarily using Puppet for automated configuration management. Puppet is one of the best options around for maintaining configuration across a large collection of instances without the kinds of inconsistencies that can otherwise become common when several people are working on them. Also, the ability to insist on specific versions of packages means we know every instance will be kept in sync with the rest and not drift behind the state of the art. Between Puppet’s facter tool and Ruby templating engine there’s very little dynamic configuration that can’t be readily handled.

As reliable as Puppet is at maintaining a given state, it’s not necessarily the best tool for everything related to managing our instances. For changes we want to make at a specific point in time across an entire cluster with a high degree of control, we’ve found Ansible to be a much better fit. Over time, more of our deployments, upgrades, and even our creation and management of AWS objects have moved under Ansible. (We’ll be discussing that in a future article dedicated to our deployment model.)

Monitoring

Once instances are up and running with the right software and configuration, the real ops work starts. We need to monitor services and queues, and handle any potential impacts, preferably before they actually cause an issue.

We started out with Nagios and Check_MK, but in an effort to more fully embrace cloud technologies, we transitioned to Circonus to monitor our services. We actively monitor tens of thousands of data points in our cloud environment to ensure that our infrastructure and software is performing up to our standards. Circonus allows us to access all of our raw data for analysis. The raw data is invaluable in troubleshooting, post-mortem, and future planning activities.

Our on-call personnel receive alerts through OpsGenie when they detect a potential problem in our environment. The on-call personnel can then quickly assess the situation through the monitoring and AWS tools. Afterwards, it can take action to start remediation and repair. In less than 30 minutes, a team spanning multiple time zones will be on RingCentral and working towards a resolution.

Improvements to monitoring, data analysis, and remediation of issues are an on-going effort to enhance our visibility into our environment and reduce the potential of a customer impacting event. Automated remediation is the future of monitoring. We are currently exploring AWS SNS and Lambda as potential tools to help us realize that vision.

Fault Tolerance

Hosting email in the cloud presents unique challenges. It’s inherently a push protocol, not a pull protocol. Subscribers who receive messages from us asked to receive them. Yet, unlike a video streaming service or news site they don’t ask for and then receive content immediately. They subscribe, then continue to receive messages from that subscription until they unsubscribe. If something goes wrong and they don’t receive a message, they may not immediately realize there’s a problem. Combined with the need to ensure users are only receiving email they requested and the related anti-spam and anti-phishing technologies, we must be constantly aware of the state of our delivery flows to a variety of service providers and the reputations of each of our IPs and customer domains with each of those providers. This is in addition to all the normal monitoring that our services are up, functional, and attempting to deliver email.

Fortunately, email was designed to be a robust protocol. While temporary delays and deferrals in delivery are not ideal, they are not the end of the world. And of course, we designed Momentum to deliver email, and that’s where it shines. Detailing everything Momentum does internally to handle fault tolerance and making sure email is delivered is well beyond the scope of this article. Our ops team knows we can rely on it as long as it’s running, patched, and properly configured.

Failover

Running in the cloud, we have to deal with additional abstractions on our network stack. We also can’t rely on the same kind of direct access to our external IP address bindings we would have in an on-premise installation. We rely heavily on AWS’ Elastic Load Balancer technology and a Momentum outgoing proxy module developed by our engineering team specifically to address these challenges.

We also use ELBs to manage load balancing and failover for our other tiers, both externally-facing and internal-only services. Keeping the services available to our customers hasn’t been a major problem for SparkPost. Especially because we combine the auto scaling discussed previously and an operations staff that is always on the ball. We currently deliver billions of messages a month. Despite that, we rarely had an outage or service degradation that resulted in the loss of even a single message.

Conclusion

Each of these topics have plenty more detail to them. The bottom line, is that our ops team loves to deliver your email. Moreover, we’re excited to make that more reliable and cost effective with each generation of new cloud technology. For more information on our technical operations stack, or other questions, check out our DevHub. Also feel free to hit us up on Slack or Twitter if you’d like to chat.

–Jeremy & Nick