A DNS Performance Incident
At SparkPost, we’re building an email delivery service with high performance, scalability, and reliability. We’ve made those qualities key design objectives, and they’re core to how we engineer and operate our service. In fact, we literally guarantee our service level and burst rates for our Enterprise service level customers.
However, we sometimes encounter technical limitations or operating conditions that have a negative impact on our performance. We recently experienced a challenging situation like this. On May 24, problems with our DNS infrastructure’s interaction with AWS’ network stack resulted in errors, delays, and slow system performance for some of our customers.
When events like this happen, we do everything we can to make things right. We also commit to our customers to be open and transparent about what happened and what we learn.
In this post, I’ll discuss what happened, and what we learned from that incident. But I’d like to begin by saying we accept responsibility for the problem and its impact on our customers.
We know our customers depend on reliable email delivery to support their business, security, and operational needs. We take it seriously when we don’t deliver the level of service our customers expect. I’m very sorry for that, as is our entire team.
Extreme DNS Usage on AWS Network Hits a Limit
Why did this slowdown happen? Our team quickly realized that routine DNS queries from our service were not being answered at a reasonable rate. We traced the issue to the DNS infrastructure we operate on the Amazon Web Services (AWS) platform. Initially we attempted to address query performance by increasing DNS server capacity by 500%, but that did not resolve the situation, and we continued to experience an unexplained and severe throttling. We then repointed DNS services for the vast majority of our customers at local nameservers in each AWS network segment, which were not experiencing performance issues. This is not the AWS-recommended long-term approach for our DNS volume, but we coordinated it with AWS as an interim measure that allowed us to restore service fully for all customers about five hours after the incident began.
I’ve written before about how critical DNS infrastructure is to email delivery, and the ways in which DNS issues can expose bugs or unexpected limits in cloud networking and hosting. In short, email makes extraordinarily heavy use of DNS, and SparkPost makes more use of DNS than nearly any other AWS customer. As such, DNS has an outsized impact on the overall performance of our service.
In this case, the root cause of the degraded DNS performance was another undocumented, practical limit in the AWS network stack. The limit was triggered under a specific set of circumstances, of which actual traffic load was only one component.
Limits like these are to be expected in any network infrastructure. But one area where the cloud provides unique challenges is troubleshooting them when the network stack is itself an abstraction, and the traffic interactions are much more of a shared responsibility than they would be in a traditional environment.
Diagnosing this problem during the incident was difficult for us and the AWS support team alike. It ultimately required several days of effort and assistance from the AWS engineering team after the fact to recreate the issue in a test environment and then identify the root cause.
Working with the AWS Team
Technology stacks aside, we know how much our customers benefit from the expertise of our technical and service teams who understand email at scale inside and out. We actually mean it when we say, “our technology makes it possible—our people make the difference.”
That’s also been true working with Amazon. The AWS team has been essential throughout the process of identifying and resolving the DNS performance problem that affected our service last week. SparkPost’s site reliability engineering team worked closely with our AWS counterparts until we clearly understood the root cause.
Here are some of the things we’ve learned about working together on this kind of problem solving:
- Your AWS technical account manager is your ally. Take advantage of your account team. They’re advocates and guides to navigate AWS internal resources and services. They can reach out to product managers and internal engineering resources within AWS. They can hunt down internal information not readily available in online docs. And they really understand how urgent issues like the one we encountered can be to business operations. If a support ticket or other issue is not getting the attention it deserves don’t hesitate to push harder.
- Educate AWS on your unique use cases. Ensure that the AWS account team—especially your TAM team and solution architect—are involved in as much of your daily workflow as possible. This way, they can learn your needs first hand and represent them inside of AWS. That’s a really important part of keeping the number of unexpected surprises to a minimum.
- Run systematic tests and generate data to help AWS troubleshoot. The Amazon team is going to investigate the situation on their end, and of course they have great tools and visibility at the platform layer to do that. But they can’t replicate your setup, especially when you’ve built highly specialized and complex services like ours. Running systematic tests and providing the data to the AWS team will provide them with invaluable information that can help to isolate an unknown problem to a particular element of the platform infrastructure. And they can monitor things on their end during these tests to gain additional insight into the issue.
- Make it easy for engineers on both teams to collaborate directly. Though your account team is critical, they also know when letting AWS’ engineers and your engineers work together directly will save time and improve communication. It’s to your advantage to make that as easy as possible. Inviting the AWS team into a shared Slack channel, for example, is a great way to work together in real-time—and to document the interactions to help further troubleshooting and reproduce context in the future. Make use of other collaboration tools such as Google docs for sharing findings and plans. Bring the AWS team onto your operations bridge line during incidents and use conference calls for regular check-ins following an incident.
- Understand that you’re in it together. AWS is a great technical stack for building cloud-native services. But one of the things we’ve come to appreciate about Amazon is how openly they work through hard problems when a specialized service like SparkPost pushes the AWS infrastructure into some edge cases. Their team has supported us in understanding root causes, developing solutions, and ultimately taking their learnings back to help AWS itself continue to evolve.
The AWS network and platform is a key part of SparkPost’s cloud architecture. We’ve developed some great knowledge about leveraging AWS from a technical perspective. We’ve also come to realize how important support from the AWS team can be when working to resolve issues in the infrastructure when they do arise.
In the coming weeks, we will write more in detail about the DNS architecture changes we are currently rolling out. They’re an important step towards increasing the resilience of our infrastructure.
Whether you’re building for the AWS network yourself, or a SparkPost customer who relies on our cloud infrastructure, I hope this explanation of what we’ve learned has been helpful. And of course, please reach out to me or any of the SparkPost team if you’d like to discuss last week’s incident.
VP Engineering and Cloud Operations
How We Tracked Down Unusual DNS Failures in AWS
We’ve built SparkPost around the idea that a cloud service like ours needs to be cloud-native itself. That’s not just posturing. It’s our cloud architecture that underpins the scalability, elasticity, and reliability that are core aspects of the SparkPost service. Those qualities are major reasons we’ve built our infrastructure atop Amazon Web Services (AWS)—and it’s why we can offer our customers service level and burst rate guarantees unmatched by anyone else in the business.
But we don’t pretend that we’re never challenged by unexpected bugs or limits of available technology. We ran into something like this last Friday, and that incident led to intermittent slowness in our service and delivery delays for some of our customers.
First let me say, the issue was resolved that same day. Moreover, no email or related data was lost. However, if delivery of your emails was slowed because of this issue, please accept my apology (in fact, an apology from our entire team). We know you count on us, and it’s frustrating when we’re not performing at the level you expect.
Some companies are tempted to brush issues like a service degradation under the rug and hope no one notices. You may have experienced that with services you’ve used in the past. I know I have. But that’s not how we like to do business.
I wanted to write about this incident for another reason as well: we learned something really interesting and valuable about our AWS cloud architecture. Teams building other cloud services might be interested in learning about it.
We ran into undocumented practical limits of the EC2 instances we were using for our primary DNS cluster. Sizing cloud instances based on traditional specs (processor, memory, etc.) usually works just as you’d expect, but sometimes that traditional hardware model doesn’t apply. That’s especially true in atypical use cases where aggregate limits can come into play—and there are times you run headlong into those scenarios without warning.
We hit such a limit on Friday when our DNS query volume created a network usage pattern for which our instance type wasn’t prepared. However, because that limit wasn’t obvious from the docs or standard metrics available, we didn’t know we’d hit it. What we observed was a very high rate of DNS failures, which in turn led to intermittent delays at different points in our architecture.
Digging Deeper into DNS
Why is our DNS usage special? Well, it has a lot to do with the way email works, compared to the content model for which AWS was originally designed. Web-based content delivery makes heavy use of what might be considered classic inbound “pull” scenarios: a client requests data, be it HTML, video streams, or anything else, from the cloud. But the use cases for messaging service providers like SparkPost are exceptions to the usual AWS scenario. In our case, we do a lot of outbound pushing of traffic: specifically, email (and other message types like SMS or mobile push notifications). And that push-style traffic relies heavily on DNS.
If you’re familiar with DNS, you may know that it’s generally fairly lightweight data. To request a given HTML page, you first have to ask where that page can be found on the Internet, but that request is a fraction of the size of the content you retrieve.
Email, however, makes exceptionally heavy use of DNS to look up delivery domains—for example, SparkPost sends many billions of emails to over 1 million unique domains every month. For every email we deliver, we have to make a minimum of two DNS lookups, and the use of DNS “txt” records for anti-phishing technologies like SPF and DKIM means DNS also is required to receive mail. Add to that our more traditional use of AWS API services for our apps, and it’s hard to exaggerate how important DNS is to our infrastructure.
All of this means we ran into an unusual condition in which our growing volume of outbound messages created a DNS traffic volume that hit an aggregate network throughput limit on instance types that otherwise seemed to have sufficient resources to service that load. And as denial-of-service attacks on the Dyn DNS infrastructure last year demonstrated, when DNS breaks, everything breaks. (That’s something anyone who builds systems that rely on DNS already knows painfully well.)
The sudden DNS issues triggered a response by our operations and reliability engineering teams to identify the problem. They teamed with our partners at Amazon to escalate on the AWS operations side. Working together, we identified the cause and a solution. We deployed a cluster of larger capacity nameservers with a greater focus on network capacity that could fulfill our DNS needs without running into the redlines for throughput. Fortunately, because all this was within AWS, we could spin up the new instances and even resize existing instances very quickly. DNS resumed normal behavior, lookup failures ceased, and we (and the outbound message delivery) were back on track.
To mitigate against this specific issue in the future, we’re also making DNS architecture changes to better insulate our core components from the impact of encounters with similar, unexpected thresholds. We’re also working with the Amazon team to determine appropriate monitoring models that will give us adequate warning to head off a similar incident before it affects any of our customers.
AWS and the Cloud’s Silver Lining
I don’t want to sugarcoat the impact of this incident on our customers. But our ability to identify the underlying issue as an unexpected interaction of our use case with the AWS infrastructure—and then find a resolution to it in very short order—has a lot to do with how we built SparkPost, and our great relationship with the Amazon team.
SparkPost’s superb operations corps, our Site Reliability Engineering (SRE) team, and our principal technical architects work with Amazon every day. The strengths of AWS’ infrastructure has given us a real leg up optimizing SparkPost’s architecture for the cloud. Working so closely with AWS over the past two years also has taught us a lot about spinning up AWS infrastructure and running quickly, and we also have the benefit of deep support from the AWS team.
If we had to work around a similar limitation in a traditional data center model, something like this could take days or even weeks to fully resolve. That agility and responsiveness are just two of the reasons we’ve staked our business on the cloud and AWS. Together, the kind of cloud expertise our companies share is hard to come by. Amazon has been a great business partner to us, and we’re really proud of what we’ve done with the AWS stack.
SparkPost is the first email delivery service that was built for the cloud from the start. We send more email from a true cloud platform than anyone, and sometimes that means entering uncharted territory. It’s a fundamental truth of computer science that you don’t know what challenges occur at scale until you hit them. We found one on AWS, but our rapid response is a great example of the flexibility the cloud makes possible. It’s also our commitment to our customers.
Whether you’re building your own infrastructure on AWS, or a SparkPost customer who takes advantage of ours, I hope this explanation of what happened last Friday, and how we resolved it, has been useful.
VP Engineering and Cloud Operations
What It’s Like to Scale a Fast-Growing Cloud Service
Building a scalable email infrastructure takes a lot of thought, testing, and planning. For some businesses, the thought of scaling your email from thousands of emails a day to hundreds of millions per day is not only daunting, but challenging. Rather than building your own data center and email infrastructure, many companies are turning to providers who offer email as a service.
Our VP of Engineering, Chris McFadden, recently sat down with Software Engineering Daily podcast host, Jeffrey Meyerson, to talk about what it’s like to architect and scale an email service that grows with its customer base.
Building a Scalable Email Infrastructure
In this podcast, Chris talks about the layers of infrastructure that a message passes through before it makes it into the recipient’s inbox. He also goes in depth with how SparkPost’s infrastructure is set up, why we’re built on AWS, and the importance of micro-services in our deployment process.
Meyerson delves into what it’s like to a be a developer using email as a service, what are the different conditions and configuration points in the API that a developer would want to interact. Hear first hand how we made the decision to move to the cloud and the thought process behind developing our API. You’ll also gain insights on how we handle burst rates and the tooling around our compliance and deliverability.
Whether you’re a developer, responsible for scalable email infrastructure at your company, or interested in what’s in other people’s tech stacks, you’ll want to listen to this podcast.
Listen now: Software Engineering Daily Podcast
Inbox Marketer is one of Canada’s leading email marketing agencies, serving Fortune 500 and SMB clients from its headquarters in Toronto. We sat down recently with Inbox Marketer’s Chief Privacy Officer and Deliverability Manager Matthew Vernhout to discuss what kinds of challenges his team faces as a fast-growing services provider in the highly competitive digital marketing space, and how Message Systems is helping them maintain vigorous growth.
Among its key business challenges: scaling email volumes to keep up with the growth of its customer base. Inbox Marketer’s rapid growth required a scalable infrastructure, but its incumbent MTA couldn’t scale without significant investments in hardware and code development. Momentum from Message Systems helped solve that challenge in short order.
Read the whole case study.
When you’re a wildly successful business – especially an online business – wildly enormous messaging loads come with the territory. To handle the kind of digital messaging traffic that fast-growing businesses need to cope with (we’re talking millions per day and beyond) you need industrial-strength messaging infrastructure that can scale on demand. The problem is, most small businesses start off with the least expensive messaging infrastructure they can find, which usually means open source MTA technology.
Avoid the Open Source Trap
To meet increasing volumes with open source, you need to incrementally add hardware and staff, which gets expensive fast. Worse still, open source provides poor visibility into your messaging operations. There’s no way to capture data for triggered messaging, and no easy way to parse email disposition data to manage your sender reputation with ISPs. This lack of capabilities prevents you from using business rules and logic to create the kinds of email operations that drive engagement in today’s communications environment.
Messaging Infrastructure that Scales On Demand
Adopting an on-premises messaging solution like Momentum from Message Systems is not only a far more cost-effective approach, but also it gives you the power, scalability and flexibility to grow your business into a fully realized global brand. It’s no coincidence that top cloud, social media and e-business leaders like Facebook, Match.com, LinkedIn, PayPal, Groupon and so many other online innovators are Message Systems customers. Companies that adopt Message Systems right from the point of launch, or very early in their development, gain enormous advantages through superior scalability, deliverability and overall engagement rates. Here are some of them:
Meet Growing Volume Requirements
Carrier-grade performance means you can easily handle higher volumes of email, mobile and social messages – millions per hour on each server – as your business grows and demand increases.
Increase Your Agility
Scriptable policy engine and API toolkit make it easy to create fresh new message-based offerings, like triggered notifications, that drive traffic and increase engagement.
Make Scalability a Business Advantage
With the ability to scale and handle fast-growing message volumes, messaging becomes a business advantage you can use to outpace competitors and create barriers to entry for potential adversaries.
If you’re interested in learning more about why the ability to scale on demand is so important to your business, have a look at our free eBook, The High Cost of Free Messaging Software. Learn how open source MTAs can be hurting, rather than helping, your business.
Read the original TechTarget article here.
A company’s road to success is often a bumpy one. Surviving as a small scrappy start-up and moving on to the big leagues involves a series of choices and investments, ones that could either take you to the top… or hamstring your growth over the long term. With limited resources at your disposal, the decision to buy, or perhaps more importantly, not to buy, can either give you nightmares, or take your company’s growth to its next stage. From a digital messaging perspective, being far-sighted enough to invest in a messaging platform that can scale with your business can save you a lot of future heartache when you realize that your open source or cheap MTA is crippling your growth.
One such company that made the bold decision to invest in their future was Infusionsoft, an all-in-one online solution for contact management, CRM, marketing automation and e-commerce. Infusionsoft recognized that providing their customers with the best messaging platform available to deliver targeted mail was crucial to their continued success. And in 2008, they placed their bets on Message Systems’ Momentum.
James Thompson, messaging operations director for Infusionsoft, made the call based on three critical criteria: redundancy, scalability and uptime. In times of crisis, James knew he needed to be able to depend on a system that would not go down, and cost Infusionsoft its hard-won reputation and customers.
Originally on Postfix, an open source software, Thompson had short-listed Port 25 and Message Systems. Ultimately, he went with Message Systems because of the user-friendly software, the value placed on support and advanced analytics data.
With the switch, Momentum began to pay for itself immediately:
“What we actually found is that once we implemented Message Systems primarily for their email delivery technology, we were able to reduce those seven machines that were running Postfix down to basically two email servers.”
“Before we were pushing about 10 million with seven severs, we pushed over 98 million in the same time period of about a week, on those exact same two servers. So the scalability of the system has been awesome for us.”
For anyone looking at investing in the future, Thompson advises them to think of the big picture.
“If redundancy and up time are big issue for anybody else looking for this service or anything similar, enterprise support is definitely a must in this world.”
And with Message Systems, an enterprise-level messaging platform and support, is definitely what you will get.
With Message Systems in place to safeguard their reputation since 2008, Infusionsoft has been able to grow from strength to strength, and establish themselves as recognized leader in the industry – and we look forward to many more years of successful partnership with them.
Read the original TechTarget article here.
Interested to find out more about why Infusionsoft chose Message Systems over cheaper alternatives like Port25? Read the case study for more details on the value that they saw – and got with Momentum.