A DNS Performance Incident
At SparkPost, we’re building an email delivery service with high performance, scalability, and reliability. We’ve made those qualities key design objectives, and they’re core to how we engineer and operate our service. In fact, we literally guarantee our service level and burst rates for our Enterprise service level customers.
However, we sometimes encounter technical limitations or operating conditions that have a negative impact on our performance. We recently experienced a challenging situation like this. On May 24, problems with our DNS infrastructure’s interaction with AWS’ network stack resulted in errors, delays, and slow system performance for some of our customers.
When events like this happen, we do everything we can to make things right. We also commit to our customers to be open and transparent about what happened and what we learn.
In this post, I’ll discuss what happened, and what we learned from that incident. But I’d like to begin by saying we accept responsibility for the problem and its impact on our customers.
We know our customers depend on reliable email delivery to support their business, security, and operational needs. We take it seriously when we don’t deliver the level of service our customers expect. I’m very sorry for that, as is our entire team.
Extreme DNS Usage on AWS Network Hits a Limit
Why did this slowdown happen? Our team quickly realized that routine DNS queries from our service were not being answered at a reasonable rate. We traced the issue to the DNS infrastructure we operate on the Amazon Web Services (AWS) platform. Initially we attempted to address query performance by increasing DNS server capacity by 500%, but that did not resolve the situation, and we continued to experience an unexplained and severe throttling. We then repointed DNS services for the vast majority of our customers at local nameservers in each AWS network segment, which were not experiencing performance issues. This is not the AWS-recommended long-term approach for our DNS volume, but we coordinated it with AWS as an interim measure that allowed us to restore service fully for all customers about five hours after the incident began.
I’ve written before about how critical DNS infrastructure is to email delivery, and the ways in which DNS issues can expose bugs or unexpected limits in cloud networking and hosting. In short, email makes extraordinarily heavy use of DNS, and SparkPost makes more use of DNS than nearly any other AWS customer. As such, DNS has an outsized impact on the overall performance of our service.
In this case, the root cause of the degraded DNS performance was another undocumented, practical limit in the AWS network stack. The limit was triggered under a specific set of circumstances, of which actual traffic load was only one component.
Limits like these are to be expected in any network infrastructure. But one area where the cloud provides unique challenges is troubleshooting them when the network stack is itself an abstraction, and the traffic interactions are much more of a shared responsibility than they would be in a traditional environment.
Diagnosing this problem during the incident was difficult for us and the AWS support team alike. It ultimately required several days of effort and assistance from the AWS engineering team after the fact to recreate the issue in a test environment and then identify the root cause.
Working with the AWS Team
Technology stacks aside, we know how much our customers benefit from the expertise of our technical and service teams who understand email at scale inside and out. We actually mean it when we say, “our technology makes it possible—our people make the difference.”
That’s also been true working with Amazon. The AWS team has been essential throughout the process of identifying and resolving the DNS performance problem that affected our service last week. SparkPost’s site reliability engineering team worked closely with our AWS counterparts until we clearly understood the root cause.
Here are some of the things we’ve learned about working together on this kind of problem solving:
- Your AWS technical account manager is your ally. Take advantage of your account team. They’re advocates and guides to navigate AWS internal resources and services. They can reach out to product managers and internal engineering resources within AWS. They can hunt down internal information not readily available in online docs. And they really understand how urgent issues like the one we encountered can be to business operations. If a support ticket or other issue is not getting the attention it deserves don’t hesitate to push harder.
- Educate AWS on your unique use cases. Ensure that the AWS account team—especially your TAM team and solution architect—are involved in as much of your daily workflow as possible. This way, they can learn your needs first hand and represent them inside of AWS. That’s a really important part of keeping the number of unexpected surprises to a minimum.
- Run systematic tests and generate data to help AWS troubleshoot. The Amazon team is going to investigate the situation on their end, and of course they have great tools and visibility at the platform layer to do that. But they can’t replicate your setup, especially when you’ve built highly specialized and complex services like ours. Running systematic tests and providing the data to the AWS team will provide them with invaluable information that can help to isolate an unknown problem to a particular element of the platform infrastructure. And they can monitor things on their end during these tests to gain additional insight into the issue.
- Make it easy for engineers on both teams to collaborate directly. Though your account team is critical, they also know when letting AWS’ engineers and your engineers work together directly will save time and improve communication. It’s to your advantage to make that as easy as possible. Inviting the AWS team into a shared Slack channel, for example, is a great way to work together in real-time—and to document the interactions to help further troubleshooting and reproduce context in the future. Make use of other collaboration tools such as Google docs for sharing findings and plans. Bring the AWS team onto your operations bridge line during incidents and use conference calls for regular check-ins following an incident.
- Understand that you’re in it together. AWS is a great technical stack for building cloud-native services. But one of the things we’ve come to appreciate about Amazon is how openly they work through hard problems when a specialized service like SparkPost pushes the AWS infrastructure into some edge cases. Their team has supported us in understanding root causes, developing solutions, and ultimately taking their learnings back to help AWS itself continue to evolve.
The AWS network and platform is a key part of SparkPost’s cloud architecture. We’ve developed some great knowledge about leveraging AWS from a technical perspective. We’ve also come to realize how important support from the AWS team can be when working to resolve issues in the infrastructure when they do arise.
In the coming weeks, we will write more in detail about the DNS architecture changes we are currently rolling out. They’re an important step towards increasing the resilience of our infrastructure.
Whether you’re building for the AWS network yourself, or a SparkPost customer who relies on our cloud infrastructure, I hope this explanation of what we’ve learned has been helpful. And of course, please reach out to me or any of the SparkPost team if you’d like to discuss last week’s incident.
VP Engineering and Cloud Operations
Configuration Management and Provisioning
At SparkPost our specific needs have changed over time, along with our understanding of what is the best tool for the job we do. Two areas in particular where this has evolved significantly for us include configuration management and provisioning. After some trial and error we eventually settled on CloudFormation and a mix of Puppet and Ansible. Hopefully what we learned from our experiences can help you select the right tools for your environment.
Puppet vs Ansible
A few years ago we were using Puppet for both OS and Application configuration management. Over time we discovered some challenges with this. First of all, Puppet runs as a scheduled update, not on demand. Secondly, Puppet has no way to test and then roll back if there is a problem. And finally, Puppet is not very accessible to the developers at SparkPost and not very suitable to their use cases – it’s an OPS tool.
We explored using Ansible for application level configuration. While Puppet maintains state, Ansible is good at transforming state. Ansible has fundamentally different behavior to Puppet. We can run Ansible on demand. We also like its flow control so we can test and rollback immediately. Most importantly, our application developers can maintain Ansible playbooks in concert with code and database changes. Overall it integrates very well with our Continuous Integration and Deployment pipeline.
Our next iteration was to gradually split the configuration management problem between the pieces managed by our Ops team with Puppet and pieces managed by the Development teams using Ansible. While this approach “worked”, we found it hard to understand what the actual running config would be, since Puppet would override some config stanzas managed by Ansible. These overlapping responsibilities between Puppet and Ansible were messy and error prone.
Puppet + Ansible
To resolve these problems, we decided to standardize by using Puppet only for the OS-level stuff. This includes system tuning, mounted volumes, local users, authentication, etc. In contrast, we now use Ansible exclusively for all application software deployment and configuration. This greatly simplifies things since we are now using each tool for the purpose it is intended and well suited. A key catalyst to this break-through was our organizational changes, specifically when we broke down the divisions between development and technical operations. Now there is a single Engineering organization at SparkPost, which has helped us overcome organizational silos that had contributed to a suboptimal approach.
Another area where we have evolved our use of tools is cloud provisioning. We first started in AWS by building out all of our provisioning scripts by hand. This worked well until we had to quickly scale out.
We needed better automation. We chose Terraform as the tool to provision resources in AWS since it is a vendor agnostic tool. Terraform uses a layer of abstraction which was initially very attractive but over time became problematic for our use case. First of all, Terraform keeps local state files that need to be distributed or available every time a change is made. Second, Terraform lacked native AWS support at the time for some important aspects of our infrastructure. This required extra code to implement workarounds outside of Terraform. Needless to say, this negated many of the promised benefits of using Terraform. Finally, we simply did not need the complexity and layers of abstraction that come along with Terraform.
We eventually went down the path of creating a Python tool suite that uses CloudFormation for AWS. By this time, CloudFormation was a very mature tool. We wanted to simplify our provisioning by using built in AWS CloudFormation features. We also started leveraging CloudFormation as the source of truth for our infrastructure. This approach results in significantly less code than we needed with Terraform. Additionally, we are able to handle provisioning failures with far less effort since we had removed the extra abstraction layers and workarounds.
Learning to Keep It Simple
After experimenting with various configuration management and provisioning tools and approaches, we learned the lesson of how important it is to not get too attached to any particular tool. We needed to be flexible and make changes in our tool selection and use as our understanding of the specific use cases evolved. Also planning too far ahead and picking tools you might need in the future without practical need in the near term can end up wasting a lot of time. We also experienced first hand the pitfalls of letting organizational structure drive selection and use of tools. These lessons learned helped us come up with a simpler, more modular approach that takes advantage of the technologies most aligned with our specific needs.
You can learn more about our DevOps journey or if you have any questions about cloud provisioning and configuration management in AWS, don’t hesitate to connect on Twitter. Also our SRE team is always hiring.
VP Engineering and Cloud Operations
Many thanks to John Peacock and Leonard Lawton on our SRE team for their input to this blog post.
SparkPost today is synonymous with the concept of a cloud MTA. But you might not know how deep our expertise with MTAs runs. For more than a decade, the SparkPost team has been building the technology that powers some of the most demanding deployments of enterprise MTAs in the world. In fact, more than 25% of the world’s non-spam mail is sent using our MTAs every day.
Those are impressive figures to be sure. So when we say we’re proud that SparkPost has become the world’s fastest-growing email delivery service, we know that one reason for the trust given to us is the credibility that comes from having installations of our Momentum and PowerMTA software deployed in the data centers of the largest Email Service Providers (ESPs) and other high-volume senders such as LinkedIn and Twitter.
As CTO of SparkPost, my team and I also have faced the sizable challenge—albeit a rewarding one—of migrating complex, highly optimized software like MTAs to a modern cloud architecture. Our team’s experience developing and managing high-performance email infrastructure has been a major part of why SparkPost has been successful with that transformation, but so too has been our vision of what a “cloud native” service really entails.
A few years ago, our team and many of our customers recognized that the cloud promised the ability to deliver the performance of our best-in-class messaging with dramatically improved economics and business flexibility. We understood that not only would it be more cost-effective for our customers to get started, but that it also would reduce the ongoing burden on their resources in areas like server maintenance, software maintenance, and deliverability analysis and resolution.
To get there, we knew we needed to do it the right way. Standing up servers in a data center wasn’t an option—because traditional data center models would limit our scalability, reliability, and operational flexibility in all the same ways our customers were trying to avoid!
That’s a big part of why we selected Amazon Web Services (AWS) to provide SparkPost’s underlying infrastructure. Platforms such as AWS, Microsoft Azure, Heroku, and others have many great qualities, but building a cloud-native messaging solution is conceptually a lot more than taking an MTA and installing it on a virtual machine in the sky.
There are times when architecting for the cloud necessarily embodies contradictory requirements. Just consider these architectural challenges of bringing something like an MTA into the cloud, for example:
- Scaling Stateful Systems in the Cloud. One of the primary lures of deploying within a cloud provider is the ability to take advantage of push-button server deployments and auto-scaling. For the majority of AWS customers this is very straightforward; most of them deploy web-based applications of some form, following well established patterns for creating a stateful application using stateless web servers. A mail server, however, is inherently stateful; it implements a store-and-forward messaging protocol delivering to tens of thousands of unique endpoints. In practice some messages may need to be queued for extended periods of time (minutes/hours/days) during normal operation. Thus, like a database, it is significantly harder to handle scaling in the cloud, since typical load-driven scale-up/scale-down logic can’t be applied.
- Limitless Limitations. Cloud infrastructure like AWS doesn’t magically change the laws of physics—even if it does make them a lot easier to manage. Still, every service has a limit, whether published or not. These limits not only affect what instance types you deploy on, but how you have to architect your solution to ensure that it scales in every direction. From published limits on how many IPs per instance you can allocate for sending, to unpublished DNS limitations, every AWS limit needs to be reviewed and planned for (and you have to be ready for the unexpected through monitoring and fault-tolerant architecture).
- IP Reputation Management. A further complication both in general cloud email deployments, but especially in auto-scaling, is managing the dynamic allocation of sending resources without having to warm up new IPs. You need the ability to dynamically coordinate message routing across all your MTAs and to decouple the MTA processing a message from the IP assignment/management logic.
- It Takes a Village. Moving to the cloud is not just a technology hurdle—it took the right people to make sure our customers were successful. We had to bring in expertise in engineering, security, operations, deliverability, and customer care to ensure the success of our customers in a scalable cloud-driven environment.
As I noted earlier, building and deploying a true cloud MTA is a lot more complex than putting our software up on a virtual server. But the end results show why services like SparkPost are so important to how businesses consume technology today.
The cloud can make even the most complex systems feel deceptively simple—which allows the technical and business benefits to be front and center. But if you’re a software engineer or architect building for the cloud, you understand how important solving these complex needs really are to achieve that.
So, if you’re building services like ours, I’m interested in hearing about your experiences and what you’ve run into as you’ve developed for the cloud. Ping me on Twitter, or leave a comment below.
How We Tracked Down Unusual DNS Failures in AWS
We’ve built SparkPost around the idea that a cloud service like ours needs to be cloud-native itself. That’s not just posturing. It’s our cloud architecture that underpins the scalability, elasticity, and reliability that are core aspects of the SparkPost service. Those qualities are major reasons we’ve built our infrastructure atop Amazon Web Services (AWS)—and it’s why we can offer our customers service level and burst rate guarantees unmatched by anyone else in the business.
But we don’t pretend that we’re never challenged by unexpected bugs or limits of available technology. We ran into something like this last Friday, and that incident led to intermittent slowness in our service and delivery delays for some of our customers.
First let me say, the issue was resolved that same day. Moreover, no email or related data was lost. However, if delivery of your emails was slowed because of this issue, please accept my apology (in fact, an apology from our entire team). We know you count on us, and it’s frustrating when we’re not performing at the level you expect.
Some companies are tempted to brush issues like a service degradation under the rug and hope no one notices. You may have experienced that with services you’ve used in the past. I know I have. But that’s not how we like to do business.
I wanted to write about this incident for another reason as well: we learned something really interesting and valuable about our AWS cloud architecture. Teams building other cloud services might be interested in learning about it.
We ran into undocumented practical limits of the EC2 instances we were using for our primary DNS cluster. Sizing cloud instances based on traditional specs (processor, memory, etc.) usually works just as you’d expect, but sometimes that traditional hardware model doesn’t apply. That’s especially true in atypical use cases where aggregate limits can come into play—and there are times you run headlong into those scenarios without warning.
We hit such a limit on Friday when our DNS query volume created a network usage pattern for which our instance type wasn’t prepared. However, because that limit wasn’t obvious from the docs or standard metrics available, we didn’t know we’d hit it. What we observed was a very high rate of DNS failures, which in turn led to intermittent delays at different points in our architecture.
Digging Deeper into DNS
Why is our DNS usage special? Well, it has a lot to do with the way email works, compared to the content model for which AWS was originally designed. Web-based content delivery makes heavy use of what might be considered classic inbound “pull” scenarios: a client requests data, be it HTML, video streams, or anything else, from the cloud. But the use cases for messaging service providers like SparkPost are exceptions to the usual AWS scenario. In our case, we do a lot of outbound pushing of traffic: specifically, email (and other message types like SMS or mobile push notifications). And that push-style traffic relies heavily on DNS.
If you’re familiar with DNS, you may know that it’s generally fairly lightweight data. To request a given HTML page, you first have to ask where that page can be found on the Internet, but that request is a fraction of the size of the content you retrieve.
Email, however, makes exceptionally heavy use of DNS to look up delivery domains—for example, SparkPost sends many billions of emails to over 1 million unique domains every month. For every email we deliver, we have to make a minimum of two DNS lookups, and the use of DNS “txt” records for anti-phishing technologies like SPF and DKIM means DNS also is required to receive mail. Add to that our more traditional use of AWS API services for our apps, and it’s hard to exaggerate how important DNS is to our infrastructure.
All of this means we ran into an unusual condition in which our growing volume of outbound messages created a DNS traffic volume that hit an aggregate network throughput limit on instance types that otherwise seemed to have sufficient resources to service that load. And as denial-of-service attacks on the Dyn DNS infrastructure last year demonstrated, when DNS breaks, everything breaks. (That’s something anyone who builds systems that rely on DNS already knows painfully well.)
The sudden DNS issues triggered a response by our operations and reliability engineering teams to identify the problem. They teamed with our partners at Amazon to escalate on the AWS operations side. Working together, we identified the cause and a solution. We deployed a cluster of larger capacity nameservers with a greater focus on network capacity that could fulfill our DNS needs without running into the redlines for throughput. Fortunately, because all this was within AWS, we could spin up the new instances and even resize existing instances very quickly. DNS resumed normal behavior, lookup failures ceased, and we (and the outbound message delivery) were back on track.
To mitigate against this specific issue in the future, we’re also making DNS architecture changes to better insulate our core components from the impact of encounters with similar, unexpected thresholds. We’re also working with the Amazon team to determine appropriate monitoring models that will give us adequate warning to head off a similar incident before it affects any of our customers.
AWS and the Cloud’s Silver Lining
I don’t want to sugarcoat the impact of this incident on our customers. But our ability to identify the underlying issue as an unexpected interaction of our use case with the AWS infrastructure—and then find a resolution to it in very short order—has a lot to do with how we built SparkPost, and our great relationship with the Amazon team.
SparkPost’s superb operations corps, our Site Reliability Engineering (SRE) team, and our principal technical architects work with Amazon every day. The strengths of AWS’ infrastructure has given us a real leg up optimizing SparkPost’s architecture for the cloud. Working so closely with AWS over the past two years also has taught us a lot about spinning up AWS infrastructure and running quickly, and we also have the benefit of deep support from the AWS team.
If we had to work around a similar limitation in a traditional data center model, something like this could take days or even weeks to fully resolve. That agility and responsiveness are just two of the reasons we’ve staked our business on the cloud and AWS. Together, the kind of cloud expertise our companies share is hard to come by. Amazon has been a great business partner to us, and we’re really proud of what we’ve done with the AWS stack.
SparkPost is the first email delivery service that was built for the cloud from the start. We send more email from a true cloud platform than anyone, and sometimes that means entering uncharted territory. It’s a fundamental truth of computer science that you don’t know what challenges occur at scale until you hit them. We found one on AWS, but our rapid response is a great example of the flexibility the cloud makes possible. It’s also our commitment to our customers.
Whether you’re building your own infrastructure on AWS, or a SparkPost customer who takes advantage of ours, I hope this explanation of what happened last Friday, and how we resolved it, has been useful.
VP Engineering and Cloud Operations
SparkPost has been in the email infrastructure game for 15 years, better known as Message Systems and Port25. Over that time, our software—the MTA that powers our cloud service—has run in the data centers of the world’s top Email Service Providers (ESP), as well as other high-volume senders such as LinkedIn and Twitter.
Why We Chose AWS
When planning the launch of our email API and SMTP service in 2014, we decided not to build out our own data center and instead to build on top of Amazon Web Services (AWS). This choice has given us a lot of flexibility and enabled us to grow our service rapidly. Since we’re not splitting our focus between our core business and the details of operating a data center, we can keep our technical staff focused on building new features and improving the overall customer experience. In early January 2014 we were in beta and handling a few million messages a month. Eighteen months later we are handling tens of billions of messages a month for not just the tens of thousands of active users on our SparkPost cloud offering, but also for very large SparkPost customers including Pinterest, Zillow, and Careerbuilder.
Our #1 goal is to be the leader in cloud email infrastructure. We have no desire to be experts in data centers, much the same way our customers do not want to be experts in running email infrastructure at scale. When building out a data center, by definition, you can only solve the problems you’re facing at that point in time. It can be very difficult to make the adjustments you need as your business and technical needs change rapidly over subsequent years.
In Good Company
The trend in the last few years has been increasingly for companies to leverage public cloud infrastructure instead of building out and maintaining data centers. As told in the Wall Street Journal, in 2011 Zynga spent $100M to move off of AWS and build their own data center. However, by mid 2015 they reversed course and decided to move back to AWS as part of a $100M cost cutting effort. “There’s a lot of places that are not strategic for us to have scale and we think not appropriate, like running our own data centers,” Zynga CEO Mark Pincus told investors on a conference call. “We’re going to let Amazon do that.”
When looking at the costs between using public cloud infrastructure and building out your own it is hard to get a true apples to apples comparison. It can be easy to calculate the hard costs of each but the real challenge is calculating all of the soft costs. Running your own data center requires a large number of highly skilled staff responsible for hardware and networking, redundancy and disaster recovery, data center security, database administration, and hardware lifecycle management. With a public cloud infrastructure these things are taken care of behind the scenes, allowing your engineering team to focus on building out your service.
We let the thousands of experts at AWS manage the infrastructure in a much more reliable, secure, and cost effective way than we would. For example, we use Amazon’s Availability Zones to easily distribute our servers across their fiber linked data centers which provides huge high availability gains that would be hard and very expensive to accomplish on our own. And the level of proactive security measures employed by Amazon’s team is a tremendous value that helps us sleep better at night.
The Real Advantages
However, a server to server cost comparison is missing the point of using the public cloud. The real advantages are not in immediate hardware cost savings. When using the cloud, you only pay for what you use. The flip side is you can get what you need as soon as you need it. We have found this to be very useful in making our capacity planning so much simpler. As we need more capacity we add it just in time. No purchase orders required, no waiting for servers to be shipped and then racked. We can use the AWS console or APIs to spin up new servers or databases in minutes, including automatically using Amazon’s autoscaling capabilities.
This increased speed in provisioning more computing resources was critical to us when Mandrill announced that they were moving away from the transactional email space and recommended SparkPost as an alternative. We had to add capacity to our platform several times over the course of 2016 to keep up with growing demand.
Netflix provides some good insights into their multi-year transition to AWS they completed earlier this year. As Yury Izrailevsky, VP Cloud and Platform Engineering for Netflix notes, they realized significant improvements in service availability along with more efficiency due to the economies of scale and improved utilization rates available in the on-demand public cloud. We’ve seen similar service availability benefits from the underlying stability and high availability features of AWS and the improved utilization efficiencies.
Focus on Strengths
We can reduce our costs even further by purchasing Reserved Instances (RI) on one and three year terms. For capacity we know we will need on an ongoing basis, this significantly reduces our baseline costs. It also maintains the flexibility of on-demand pricing for when we need to spin up a new virtual server to handle peak workloads.
Shifting focus from data centers to building the leading cloud email service is one of the best decisions we’ve made. It greatly reduced our time to market and continues to help fuel our ongoing agility in meeting customer needs. We have found AWS to be a great partner with fantastic technology and services. I encourage you to read more about how we use AWS in our blog post about our Technical Operations stack. If you have any questions about our experience at AWS please reach out on Twitter or our community Slack channel.
VP Engineering, SparkPost