At SparkPost we take pride in providing the highest quality technical solutions to the world’s most demanding senders. It’s a position that we’ve been grateful to hold since the beginning when we started out building the infrastructure software that powers most of the world’s ESPs. Now that we’ve been in the service provider business ourselves, we’ve found new technical challenges to overcome.

Recently, we had a customer who approached us with a challenging ask. One of the world’s largest news providers, they have the challenge to convey breaking news to their subscriber base of over 25 million subscribers. The nature of breaking news being what it is, it is critical for them to be able to reach their entire subscriber base in as short a time as possible, so that the news comes from them before it is picked up and retweeted or resent by others.  

When this customer first signed on with us, they wanted a SLA supporting 1 million messages per minute. Recently though, with the pace of current events around the world picking up, they came back and looked to double that guaranteed rate. To support a guaranteed rate in excess  of 2 million messages per minute, we needed to go back to the drawing board on how we handle scalability inside the SparkPost platform. In the end, we beat our 2m/m mark and were able to sustain injections from a single customer at over 2.m/m in our production environment while normal customer production workloads were being sustained concurrently.

In approaching this scalability work, we focused on collecting better telemetry inside our service components and prioritizing some autoscaling changes. If you appreciate highly technical details, here is how we approached this process: 

Telemetry

A key to any performance engineering problem is understanding exactly where your challenges are.  We had two areas within the SparkPost infrastructure we were relatively blind to: low level network stats and disk i/o. With low level network stats, we didn’t know that these were even a gap until we started getting backups in the chain from our Transmissions API microservice (which receives messages from customers) to our internal AWS Application Load Balancers (ALBs) to the MTA servers which handle final delivery.  

Under high single-customer throughput we saw high API latency between the ALBs  and the MTAs which was hard to track down. AWS ALBs are managed services with limited introspection ability, so we needed to rely on telemetry from both sides of our application to try and determine the cause. We put together a set of data collection scripts which scraped all our Elastic Container Service (ECS) instance stats from both sides, which finally gave us sufficient visibility to the low-level TCP stats to allow us to escalate to AWS with enough granularity to get them to address some internal ALB issues.

The second area where we didn’t have enough telemetry data was disk I/O. We firmly believe in not taking any shortcuts in the guarantee that if we accept a message for delivery that we will never lose it, which requires persisting every message to disk as part of reception as well as guaranteeing accuracy and completeness of logging. This makes our MTAs a demanding disk consumer. The AWS provided data on this is fine for many purposes, but when you are heavily reliant on disk performance as SparkPost is, sometimes the available data from AWS is not enough. EBS volumes have very strict throughput and IOPS caps, which can cause heavy degradation to your application if you start bumping up against those limits while you scale.  This led us to developing wrappers around iostat that we can push into our monitoring platform for better visibility around throughput and IOPS limits. We did eventually bump our EBS volume sizes to gain better IOPS and performance as we were bumping against these limits, causing higher than desired API latency on our MTAs.

Autoscaling

API latency is a continual challenge for cloud service providers. We strive to have our API latencies as low as possible, to allow our customers to use our services in a cost-effective manner without continually increasing their client resources. This applies equally to minimizing latency between internal components in our service. We do comprehensive measurement of both internal and external services, including monitoring on ALB and target group response times and internal API request times which we log into Circonus for historical time series analysis.

SparkPost has done a lot to lower its API latency on Transmission API requests, including re-architecture of that API into an autoscaling microservice container fleet. We measure API latency for our services from the time we receive the first byte of data from the customer until the full response leaves our network. This causes client- specific aspects like transit time and message/payload size to be reflected in our latency numbers. We feel like this produces the most accurate reflection of actual customer experience, which is the ultimate goal of our optimizations.

This work accelerated our timeline for autoscaling many of our downstream systems. Our externally facing endpoints are already auto-scaled, but it was helpful to implement predictive preemptive scaling to smoothly handle very short, high throughput bursts. To further prevent slowdowns and outages on critical services while we continue to improve our autoscaling, it was necessary to implement back-pressure within internal systems including our MTAs. This allows our MTA servers to gracefully and intentionally degrade performance to shift load onto better performing nodes. Auto-scaling store-and-forward MTAs which are inherently stateful is a complex and interesting topic, and we’ll tackle how we approach that in a future post.

How We Test

So how can customers leverage this work to achieve better performance on SparkPost? While every customer workload is slightly different, there are lessons that can be learned from how we test our own performance. When we run tests for single-recipient mails (i.e. where every recipient has their own unique payload), we get the best results with a highly concurrent injector.  This helps mitigate all of the unavoidable per-message latencies like network transit times. Our REST service will outperform our SMTP endpoints, if for no other reason than the REST submission is one request/one response, whereas SMTP is a very ‘chatty’ protocol with multiple protocol exchanges to send a single message.

Message latency is also highly dependent on payload size. Part of this is that bigger packets require more processing time, but at the TCP level, a larger payload requires more TCP segments which results in more roundtrips, which provides a multiplicative impact on the unavoidable network transit times.

For high throughput, we typically recommend as much concurrency as possible.  When we test for maximum performance, we use the tool wrk/wrk2, and configure each instance to run with between 30-75 sessions with 64 concurrent connections. During some of our high volume testing, this led to around 25k connections to SparkPost, reaching volumes of 2.3 million messages per minute. As we scale further we take that injector setup and scale it across multiple nodes. To simulate real customer workloads better, we have also set up these injectors in other cloud providers in other regions to mimic customer infrastructure as much as possible. This highly concurrent and flexible setup has provided solid results.  

While you would of course never use wrk2 as your actual injector, the architecture does translate well to a highly performant injection process.

We’re really proud of the work that we’ve done to be able to support massive burst SLAs for our customers. Taking on exciting and ambitious projects like these not only help our customers achieve their goals, but also help us understand and improve our network for all customers.

~ Nathan