A DNS Performance Incident
At SparkPost, we’re building an email delivery service with high performance, scalability, and reliability. We’ve made those qualities key design objectives, and they’re core to how we engineer and operate our service. In fact, we literally guarantee our service level and burst rates for our Enterprise service level customers.
However, we sometimes encounter technical limitations or operating conditions that have a negative impact on our performance. We recently experienced a challenging situation like this. On May 24, problems with our DNS infrastructure’s interaction with AWS’ network stack resulted in errors, delays, and slow system performance for some of our customers.
When events like this happen, we do everything we can to make things right. We also commit to our customers to be open and transparent about what happened and what we learn.
In this post, I’ll discuss what happened, and what we learned from that incident. But I’d like to begin by saying we accept responsibility for the problem and its impact on our customers.
We know our customers depend on reliable email delivery to support their business, security, and operational needs. We take it seriously when we don’t deliver the level of service our customers expect. I’m very sorry for that, as is our entire team.
Extreme DNS Usage on AWS Network Hits a Limit
Why did this slowdown happen? Our team quickly realized that routine DNS queries from our service were not being answered at a reasonable rate. We traced the issue to the DNS infrastructure we operate on the Amazon Web Services (AWS) platform. Initially we attempted to address query performance by increasing DNS server capacity by 500%, but that did not resolve the situation, and we continued to experience an unexplained and severe throttling. We then repointed DNS services for the vast majority of our customers at local nameservers in each AWS network segment, which were not experiencing performance issues. This is not the AWS-recommended long-term approach for our DNS volume, but we coordinated it with AWS as an interim measure that allowed us to restore service fully for all customers about five hours after the incident began.
I’ve written before about how critical DNS infrastructure is to email delivery, and the ways in which DNS issues can expose bugs or unexpected limits in cloud networking and hosting. In short, email makes extraordinarily heavy use of DNS, and SparkPost makes more use of DNS than nearly any other AWS customer. As such, DNS has an outsized impact on the overall performance of our service.
In this case, the root cause of the degraded DNS performance was another undocumented, practical limit in the AWS network stack. The limit was triggered under a specific set of circumstances, of which actual traffic load was only one component.
Limits like these are to be expected in any network infrastructure. But one area where the cloud provides unique challenges is troubleshooting them when the network stack is itself an abstraction, and the traffic interactions are much more of a shared responsibility than they would be in a traditional environment.
Diagnosing this problem during the incident was difficult for us and the AWS support team alike. It ultimately required several days of effort and assistance from the AWS engineering team after the fact to recreate the issue in a test environment and then identify the root cause.
Working with the AWS Team
Technology stacks aside, we know how much our customers benefit from the expertise of our technical and service teams who understand email at scale inside and out. We actually mean it when we say, “our technology makes it possible—our people make the difference.”
That’s also been true working with Amazon. The AWS team has been essential throughout the process of identifying and resolving the DNS performance problem that affected our service last week. SparkPost’s site reliability engineering team worked closely with our AWS counterparts until we clearly understood the root cause.
Here are some of the things we’ve learned about working together on this kind of problem solving:
- Your AWS technical account manager is your ally. Take advantage of your account team. They’re advocates and guides to navigate AWS internal resources and services. They can reach out to product managers and internal engineering resources within AWS. They can hunt down internal information not readily available in online docs. And they really understand how urgent issues like the one we encountered can be to business operations. If a support ticket or other issue is not getting the attention it deserves don’t hesitate to push harder.
- Educate AWS on your unique use cases. Ensure that the AWS account team—especially your TAM team and solution architect—are involved in as much of your daily workflow as possible. This way, they can learn your needs first hand and represent them inside of AWS. That’s a really important part of keeping the number of unexpected surprises to a minimum.
- Run systematic tests and generate data to help AWS troubleshoot. The Amazon team is going to investigate the situation on their end, and of course they have great tools and visibility at the platform layer to do that. But they can’t replicate your setup, especially when you’ve built highly specialized and complex services like ours. Running systematic tests and providing the data to the AWS team will provide them with invaluable information that can help to isolate an unknown problem to a particular element of the platform infrastructure. And they can monitor things on their end during these tests to gain additional insight into the issue.
- Make it easy for engineers on both teams to collaborate directly. Though your account team is critical, they also know when letting AWS’ engineers and your engineers work together directly will save time and improve communication. It’s to your advantage to make that as easy as possible. Inviting the AWS team into a shared Slack channel, for example, is a great way to work together in real-time—and to document the interactions to help further troubleshooting and reproduce context in the future. Make use of other collaboration tools such as Google docs for sharing findings and plans. Bring the AWS team onto your operations bridge line during incidents and use conference calls for regular check-ins following an incident.
- Understand that you’re in it together. AWS is a great technical stack for building cloud-native services. But one of the things we’ve come to appreciate about Amazon is how openly they work through hard problems when a specialized service like SparkPost pushes the AWS infrastructure into some edge cases. Their team has supported us in understanding root causes, developing solutions, and ultimately taking their learnings back to help AWS itself continue to evolve.
The AWS network and platform is a key part of SparkPost’s cloud architecture. We’ve developed some great knowledge about leveraging AWS from a technical perspective. We’ve also come to realize how important support from the AWS team can be when working to resolve issues in the infrastructure when they do arise.
In the coming weeks, we will write more in detail about the DNS architecture changes we are currently rolling out. They’re an important step towards increasing the resilience of our infrastructure.
Whether you’re building for the AWS network yourself, or a SparkPost customer who relies on our cloud infrastructure, I hope this explanation of what we’ve learned has been helpful. And of course, please reach out to me or any of the SparkPost team if you’d like to discuss last week’s incident.
VP Engineering and Cloud Operations
How We Tracked Down Unusual DNS Failures in AWS
We’ve built SparkPost around the idea that a cloud service like ours needs to be cloud-native itself. That’s not just posturing. It’s our cloud architecture that underpins the scalability, elasticity, and reliability that are core aspects of the SparkPost service. Those qualities are major reasons we’ve built our infrastructure atop Amazon Web Services (AWS)—and it’s why we can offer our customers service level and burst rate guarantees unmatched by anyone else in the business.
But we don’t pretend that we’re never challenged by unexpected bugs or limits of available technology. We ran into something like this last Friday, and that incident led to intermittent slowness in our service and delivery delays for some of our customers.
First let me say, the issue was resolved that same day. Moreover, no email or related data was lost. However, if delivery of your emails was slowed because of this issue, please accept my apology (in fact, an apology from our entire team). We know you count on us, and it’s frustrating when we’re not performing at the level you expect.
Some companies are tempted to brush issues like a service degradation under the rug and hope no one notices. You may have experienced that with services you’ve used in the past. I know I have. But that’s not how we like to do business.
I wanted to write about this incident for another reason as well: we learned something really interesting and valuable about our AWS cloud architecture. Teams building other cloud services might be interested in learning about it.
We ran into undocumented practical limits of the EC2 instances we were using for our primary DNS cluster. Sizing cloud instances based on traditional specs (processor, memory, etc.) usually works just as you’d expect, but sometimes that traditional hardware model doesn’t apply. That’s especially true in atypical use cases where aggregate limits can come into play—and there are times you run headlong into those scenarios without warning.
We hit such a limit on Friday when our DNS query volume created a network usage pattern for which our instance type wasn’t prepared. However, because that limit wasn’t obvious from the docs or standard metrics available, we didn’t know we’d hit it. What we observed was a very high rate of DNS failures, which in turn led to intermittent delays at different points in our architecture.
Digging Deeper into DNS
Why is our DNS usage special? Well, it has a lot to do with the way email works, compared to the content model for which AWS was originally designed. Web-based content delivery makes heavy use of what might be considered classic inbound “pull” scenarios: a client requests data, be it HTML, video streams, or anything else, from the cloud. But the use cases for messaging service providers like SparkPost are exceptions to the usual AWS scenario. In our case, we do a lot of outbound pushing of traffic: specifically, email (and other message types like SMS or mobile push notifications). And that push-style traffic relies heavily on DNS.
If you’re familiar with DNS, you may know that it’s generally fairly lightweight data. To request a given HTML page, you first have to ask where that page can be found on the Internet, but that request is a fraction of the size of the content you retrieve.
Email, however, makes exceptionally heavy use of DNS to look up delivery domains—for example, SparkPost sends many billions of emails to over 1 million unique domains every month. For every email we deliver, we have to make a minimum of two DNS lookups, and the use of DNS “txt” records for anti-phishing technologies like SPF and DKIM means DNS also is required to receive mail. Add to that our more traditional use of AWS API services for our apps, and it’s hard to exaggerate how important DNS is to our infrastructure.
All of this means we ran into an unusual condition in which our growing volume of outbound messages created a DNS traffic volume that hit an aggregate network throughput limit on instance types that otherwise seemed to have sufficient resources to service that load. And as denial-of-service attacks on the Dyn DNS infrastructure last year demonstrated, when DNS breaks, everything breaks. (That’s something anyone who builds systems that rely on DNS already knows painfully well.)
The sudden DNS issues triggered a response by our operations and reliability engineering teams to identify the problem. They teamed with our partners at Amazon to escalate on the AWS operations side. Working together, we identified the cause and a solution. We deployed a cluster of larger capacity nameservers with a greater focus on network capacity that could fulfill our DNS needs without running into the redlines for throughput. Fortunately, because all this was within AWS, we could spin up the new instances and even resize existing instances very quickly. DNS resumed normal behavior, lookup failures ceased, and we (and the outbound message delivery) were back on track.
To mitigate against this specific issue in the future, we’re also making DNS architecture changes to better insulate our core components from the impact of encounters with similar, unexpected thresholds. We’re also working with the Amazon team to determine appropriate monitoring models that will give us adequate warning to head off a similar incident before it affects any of our customers.
AWS and the Cloud’s Silver Lining
I don’t want to sugarcoat the impact of this incident on our customers. But our ability to identify the underlying issue as an unexpected interaction of our use case with the AWS infrastructure—and then find a resolution to it in very short order—has a lot to do with how we built SparkPost, and our great relationship with the Amazon team.
SparkPost’s superb operations corps, our Site Reliability Engineering (SRE) team, and our principal technical architects work with Amazon every day. The strengths of AWS’ infrastructure has given us a real leg up optimizing SparkPost’s architecture for the cloud. Working so closely with AWS over the past two years also has taught us a lot about spinning up AWS infrastructure and running quickly, and we also have the benefit of deep support from the AWS team.
If we had to work around a similar limitation in a traditional data center model, something like this could take days or even weeks to fully resolve. That agility and responsiveness are just two of the reasons we’ve staked our business on the cloud and AWS. Together, the kind of cloud expertise our companies share is hard to come by. Amazon has been a great business partner to us, and we’re really proud of what we’ve done with the AWS stack.
SparkPost is the first email delivery service that was built for the cloud from the start. We send more email from a true cloud platform than anyone, and sometimes that means entering uncharted territory. It’s a fundamental truth of computer science that you don’t know what challenges occur at scale until you hit them. We found one on AWS, but our rapid response is a great example of the flexibility the cloud makes possible. It’s also our commitment to our customers.
Whether you’re building your own infrastructure on AWS, or a SparkPost customer who takes advantage of ours, I hope this explanation of what happened last Friday, and how we resolved it, has been useful.
VP Engineering and Cloud Operations
An overview of the email API
With apologies to all who appreciate the particular genius of the 1967 film The Graduate, “I want to say one word to you. Just one word. Are you listening? APIs.”
These days, if you belly up to any bar (or spirits-free meet-up, if you’re so inclined) frequented by tech industry types, chances are you’re going to hear certain buzzwords. I can almost guarantee that “the cloud” and “API” are going to be among them. Sure, roll your eyes. I’ve been there, and I’ve been among those that smiled at a recent Internet meme that declared, “There is no cloud; it’s just someone else’s computer.”
But, when I really think about how tech and its use in the real world has evolved over the past decade, I stop dismissing those words as mere jargon. After all, almost every digital service I use today actually lives “in the cloud.” The bits that make up this blog actually live in the cloud, not a specific server my team maintains. So do the emails I get from readers and the music I listen to while writing. So do most of the web sites I visit and do business with. Unlike the early days of the Internet, when I first began working in this industry, server outages are all but a memory. In fact, the notion of a discrete “server” has all but been eliminated for most applications.
How did this happen? A lot of things made it possible, but the evolution of APIs are key. APIs (“application programming interfaces”) are the fundamental method by which all the virtual infrastructure embodied by the cloud is interconnected. The cloud could not exist without them. Even among business users, “API” has become a ubiquitous part of today’s technology vernacular. But, how APIs actually work, and their key role in the cloud revolution sometimes is taken for granted. We assume they just work.
A simple analogy is with the power outlets in the wall of your office or home. These receptacles provide a standardized interface to connect an appliance to the power network. Simple, right? Yes, it is. But look a little more closely, and that standard gets a little more complicated. If you have an older house, some of your outlets might not support the grounding function of three-pronged plugs or the modern, polarized version of two-prong plugs. Moreover, your electric clothes drier may require a special, oversized plug that connects to circuits that pull more power. Of course, if you’re traveling to the UK, you’ll need a special adapter to plug in your laptop, because it’s a different standard over there. Oh, heck, might as well bring the whole adapter kit with seven types of plugs to accommodate the other countries you’ll be visiting, too. And there are all kinds of extra connection standards for industrial applications that go well beyond what you or I encounter in our everyday experience.
APIs have a similar quality. They are the standard way for one piece of software to plug into—to invoke the functionality of—another piece of software. APIs connect disparate systems, services, and technologies. They are, in short, what makes the virtual infrastructure of the cloud possible. And email APIs are how any app or service can add email without reinventing the wheel.
However, APIs historically were highly idiosyncratic, with very little standardization among platforms. They were clumsy to code, difficult to invoke, and often poorly performing with limited scalability. It’s not a surprise, then that programmers and IT decision-makers alike often treated them as an afterthought to their overall technology implementation and were loathe to rely upon them in mission-critical contexts. In light of the constraints of both hardware resources and API performance, most developers chose to keep everything under one roof in a monolithic codebase optimized for a specific hardware environment.
So what changed that moved APIs from little-loved feature of monolithic applications to the all-but-invisible linchpin of the modern cloud? Three major developments are responsible for the upending of this dynamic and making today’s architecture possible:
- The rise of the Internet as a ubiquitous web that connects nearly every computer (or other electronic devices, whether televisions, phones, refrigerators, and thermostats, inventory control systems, or factory equipment) removes one historical constraint: “always on” connectivity.
- The exponential growth in the performance and capacity (and mirrored by plummeting costs) of computer hardware and storage devices removes another limit: economies of scale.
- A codification (both formal and de facto) of several design patterns and best practices for describing, invoking, and transmitting information among diverse software systems provided the final piece of the puzzle: API standardization.
Together, these forces have enabled massively scaled cloud platforms. In turn, these platform-as-a-service offerings form the basis of virtualized computing stacks for countless applications across every industry and consumer market. And, yes, a good email API makes it possible to add email to nearly any app and to have some confidence that it just works.
OK, so what does that mean in the real world? Well, it means on the subway home from work, I can open my mobile app and expect to see the exact same data I touched at the office. It means the data I entered in one app get distributed to several other systems without requiring duplicate entry or manually synchronize records. It means the store I just visited doesn’t need to worry about hosting their own infrastructure just to email me a purchase receipt. It all just works, and it’s all in the cloud.
That’s pretty amazing, if you really think about it. And that’s why an email API matters.
How is your day touched by APIs and the cloud? I’d love to hear about it!
By the way, want to learn a little more about the role of an email API and how they’ve changed the way businesses use email? Read “Email Evolved: Why the Cloud and Modern APIs Matter for the Future of Data-Driven Marketing.” And, if you’re wondering why a good email API and cloud architecture makes a difference, check out this blog post Why ESPs Struggle to Deliver Data-Driven Email. (Spoiler: it’s because ESPs aren’t really API-driven.)
Continuing on my new years resolution to share what I’ve learned and put those learnings into practice, I thought I’d dig into the subject of security. One of the things that I learned was that security is very important for everyone, but particularly for customers who are moving away from hosting their own infrastructure and entrusting their assets to a cloud provider. The learning is clear — but putting it into practice is the next step.
As it happens, a large number of features that we’ve rolled out over the past six months were, in fact, security related. This includes:
- Adding a maximum number of log-in attempts before the system times out.
- Two-Factor Authentication.
- Whitelisting API Keys allowed to inject messages.
- Implementing OAuth2 for Webhooks.
- Adding an option for Single Sign On (SSO) on SparkPost Elite accounts.
- Adding Roles-based access controls, more specifically a Reporting-Only role.
And those are just the customer-facing ones. We were looking at our overall cloud email security practices, even before hiring Steven Murray, our CISO. And he’s making changes — features, internal functionality, processes — to make sure security continues to be a high priority. For example, we’ve instituted intrusion detection to make sure we’re keeping our systems locked down.
The things we recommend our customers do to improve cloud email security when using SparkPost:
- Use strong passwords!
- Make sure every user enables Two-Factor Authentication when accessing the SparkPost account. This is the single biggest deterrent from attempts to hack into your account and it’s easy to do.
- Assign roles to your users. If all they’re doing is looking at reports, then making them a Reporting-Only user.
- Make sure to change the password on any shared accounts on a regular basis.
- Set up your engagement tracking domains as https (Elite accounts).
Looking ahead, we will be adding support for more Single Sign On identity providers, rotation of DKIM keys, and continually looking at how we store and access data without impacting performance.
What are your most pressing security concerns?
Jay Henderson, general manager of IBM’s Marketing Cloud business, was a featured speaker at SparkPost’s annual Insight user conference. In his talk, he delivered a great overview of how marketing is adapting to new technology and business models.
His highlights include:
- Noisy marketing chatter is cheap. And ineffective. The number of campaigns and emails and messages we encounter has soared. The right content is critical to breaking though the clutter.
- Mobile technology is ubiquitous. Mobile marketing isn’t a niche—in fact, it’s the most mainstream medium possible. Email and other marketing tools need to be mobile-first, not mobile-maybe.
- Cross-channel experiences are not optional. See above—customers use mobile all the time. But then, they walk into a store. Or switch to a web site at their desk. Does the experience you offer follow effortlessly, or at all?
- Marketers must manage their portfolio of marketing technologies—the martech stack—strategically. In a period of innovation, it’s easy to wind up with a jumble of technologies that barely hold together. The balance between streamlining technology investments while still pursuing competitive advantage and differentiation is a perennial challenge for all of us.
- Successful marketers cultivate a culture that bridges technology and creativity. Better-performing marketing teams collaborate with their technology providers and have the operational resources to make the most of martech.
I’ve included the slides from Jay’s keynote below.
Jay’s presentation stuck with me in the subsequent weeks, and I kept thinking about the implications of these changes for my profession. What does it mean to be a marketer today?
In most regards, there’s never been a better time to be a marketer. The technology’s awesome and has made all kinds of things possible that we couldn’t do before. The business models of growing businesses, be they cloud-based services or real-world retail, are increasingly dependent upon marketing expertise to understand customer behavior and engagement. The portion of marketing budgets that are considered strategic rather than discretionary is increasing. The amount of “marketing” that we all experience is going up, up, and up.
Yep. Good times.
But why, then, do I sometimes sense some unspoken anxiety from my peers at professional conferences and meetups? Why does it seem that a lot of us are “faking it ’til we make it?” Hey, no shame. I have those feelings from time to time, too. I think a lot of it is just human nature in the midst of change. Some of us welcome it, and some of us fear it, but there’s no denying that the marketing worldview has changed in lots of meaningful ways.
Maybe you’ve heard the comment author William Gibson made at the dawn of the modern Internet era: “The future is already here—it’s just not very evenly distributed.” And perhaps, at first, the different facets of digital and data-driven marketing seemed like specialized skills. But this change has been underway for a long time, and it’s now become so obvious that it’s impossible to miss.
Over the two-plus decades since Gibson made his quip about the future and the web ushered in commercial use of the Internet, several technology stacks have arisen (and sometimes fallen) to enable an ever more intimate (read: data-driven) customer marketing relationship.
So, for marketers, the future indeed already is here. It’s all about the long-term shift in how businesses think about connecting with customers. Conceptually, one of the biggest changes has been an explicit reorientation away from a relatively static notion of marketing—for example, that a customer’s decision to buy is based on a lightning strike of the right combination of the four P’s of product, price, promotion, and place. Instead, most of us today understand that a customer’s experience with a brand or a company really occurs in many steps over time. That experience over time is the “customer journey” that we strive to perfectly fulfill.
But one thing that sometimes gets lost in this acknowledgement of the primacy of the customer journey is that we marketers have been on a journey of our own. Sometimes that journey requires a fresh perspective on our craft. Other times, it means becoming facile with new technologies. And throughout, it depends upon bringing more “science” and empirical decision-making to our creative “art” of communication. But above all, it means not standing still.
By the way, do check out Jay Henderson’s presentation that I embedded at the beginning of this post. His ideas definitely are worth your time.
Do you agree? What are the changes you’ve experienced as a marketer? I’d love to hear from you!
Check out more from Insight 2015:
- Global Data Privacy Insight from 4 International Email Experts
- #SendLikeABoss – The Best of Insight 2015
Every day when I pick up my kids from school, I ask them what they’ve learned that day. They proceed to tell me what they did—in class, after school, what they had for lunch, who they played with at recess. But getting them to articulate what they learned is a lot harder. So in the spirit of setting an example, I thought I’d report on what I learned this year as a product manager for SparkPost.
First, let me back up: it’s been a year of remarkable change and growth for our company. We made the leap from our origins as an established, packaged software vendor to a software-as-a-service operation. We architected an entirely virtualized, cloud-based infrastructure. We built and launched our core SparkPost offering. We expanded upon that foundation to introduce the SparkPost Elite service with dedicated instances and service level agreements to suit the world’s most demanding senders. We built out a world-class operations, deliverability, and customer success team. And, we changed our brand from Message Systems to SparkPost to better reflect all of these changes.
But those are things that we did. What did we learn? Here are four lessons about doing business in the cloud that really hit home for me this year.
- Offering a cloud service means more than engineering a technology stack. It requires a deep understanding of how customers actually integrate technology into their business processes. It also means publicly staking a claim with the right product/market fit and countering a new group of competitors. All in the open.
- Another key lesson for us at SparkPost has been just how critically important it is to reduce friction throughout the customer lifecycle, from selling to onboarding to daily ease-of-use and support. In plain language: the cloud means we need, more than ever, to make it easy for customers do business with us. In our market of high-volume, high-value email, we want to make it drop-dead easy for legitimate senders, while freezing out spammers and phishers. Ultimately, dealing with the bad guys in the email world is where our rock-star compliance and deliverability teams give us a real competitive advantage. But as a product manager, I can assure you that it takes a lot attention to detail to get that balance just right.
- The cloud changes everything, including the business model. If you’ve spent any time in the traditional software industry, you know how big, perpetual license deals are the name of the game. But there’s a reason why the business model for cloud businesses is called “software-as-a-service.” Services aren’t a one-and-done deal; instead, our accountants report recurring revenue as the primary metric for our shareholders. For customers, that’s good news: less up-front capital expenditures, more bite-sized spending, and a real incentive for the company—that’s us—to keep customers happy and earn that recurring revenue.
- And this brings me to the thing I think about every day of the year. Of course I want to develop a product that has the most compelling features in the industry. Of course I want to see my product beat out competitors on the biggest deals. But the discipline that the recurring revenue model enforces on us means that customer retention (and that really means customer satisfaction) is simply crucial. To be frank, the same simplicity that makes the cloud so compelling also makes it pretty easy for a customer to switch to a new service provider. So, that means that I am always working to make SparkPost better-performing, easier to use, cost-effective, and a step ahead of my competitors in all the ways that matter to our customers, including email deliverability.
That last lesson is the most important thing any company needs to remember, and doing business in the cloud simply makes it all the more obvious. So, what I learned in 2015 (and will keep focused on for 2016) really is a reminder of what I and my colleagues have always believed: keeping our customers happy is the key to our success. It’s not the technology, and it’s not the marketing, or anything else except you. So a heart-felt thank you from all of us at SparkPost and from me personally. I’m looking forward to an awesome 2016.
—Irina, Cloud Queen 👸
Message Systems CEO Phillip Merrick unveils the company’s new SparkPost email services offerings at Insight, the Message Systems User Conference in San Diego, November 12th, 2014.