There are many ways to obtain metadata about your transmissions sent via SparkPost. We built a robust reporting system with over 40 different metrics to help you optimize your email deliverability. At first, we attempted to send metadata to our customers via carrier pigeons to meet customer demand for a push-based event system. We soon discovered that the JSON the birds delivered was not as clean as customers wanted. That’s when we decided to build a scalable Webhooks infrastructure using more modern technologies.
Like our reporting, the webhook infrastructure at SparkPost begins with what we call our Event Hose. This piece of the Momentum platform generates the raw JSON data that will eventually reach your webhook endpoint. As Bob detailed in his Reporting blogpost, after every message generation, bounce event, delivery, etc., Momentum logs a robust JSON object describing every quantifiable detail (we found unquantifiable details didn’t fit into the JSON format very well) of the event that occurred.
Each of these JSON event payloads are loaded into an amqp-based RabbitMQ exchange. This exchange will fan the messages out to the desired queue, including the queue which will hold your webhooks traffic. We currently use RabbitMQ as a key part of our application’s infrastructure stack to queue and reliably deliver message. We use a persistent queue to ensure that RabbitMQ holds each message until it’s delivered to your consumer. In addition, the system we’ve built is ready to handle failures, downtime, and retries.
Between RabbitMQ and your consumer, we have an ETL process that will create batches of these JSON events for each webhook you have created. We believe in the “eat your own dogfood” philosophy for our infrastructure. So our webhooks ETL process will call out to our public webhooks API to find out where to send your batches. Additional headers or authentication data may be added to the POST request. Then the batch is on its way to your consumer.
If your webhooks consumer endpoint responds to the POST request in a timely manner with an HTTP 200 response, then the ETL process will acknowledge and remove the batch of messages from RabbitMQ. If the batch fails to POST to your consumer for any reason (Timeout, 500 server error, etc), it will be added to a RabbitMQ delayed queue. This queue will hold the batch for a certain amount of time (we retry batches using an increasing backoff strategy based on how many times it has been attempted). After the holding time has elapsed, the ETL process will receive the already-processed batch to send to your endpoint again. This retry process is repeated until either your consumer has accepted the batch with a 200 response, or the maximum number of retries has been reached.
As each batch is attempted, the ETL also sends updates to the webhook API with status data about each batch. We keep track of the consumer’s failure code, number of retries and batch ID. If your webhook is having problems accepting batches, you can access this status data via the webhook API. You can also access it through the UI by clicking “View Details” in your webhook’s batch status row.
Webhooks are an extremely useful part of the SparkPost infrastructure stack. They allow customers to receive event-level metadata on all of their transmissions in a push model. While we’re operating on RabbitMQ today, we’re always looking at more modern cloud-based message queueing technologies, such as SQS, to see what can best help us meet our customers’ needs.
–Jason Sorensen, Lead Data Scientist
At SparkPost we send over 25% of all non-spam email, but how do we account for all of those messages? Humans! Our well-equipped team calculates over 40 metrics and gives you the ability to dissect it in any way. Well, that was our first stab at this capability but we quickly realized it was too much work for them to do. Joking aside, scalable reporting is a challenging problem because we have to process millions of events per minute and ensure they can be queried within a timely manner. I hope to explain in some detail how we’ve solved this problem.
Sending email is great and we tell you we’re the best email service out there. However, if you’re like me, you need some hard evidence of this claim. You may not realize it but there’s a lot that happens when sending an email: Momentum (our email engine) receives a message (injection), the ISP receives it (delivery), the recipient’s mailbox is full (soft bounce), and so on. Seeing this at the 10,000-foot glance is great for reporting back to your boss, but you’ll also need to diagnose problems by inspecting the lifecycle of one message. We have solutions for both. And the architectural designs we made allow us to easily add different uses of this data. We implemented a strategy called ETL (extract, transform and load). This pattern is what powers Metrics, Message Events, Webhooks and Suppressions. I’ll walk you through this process starting with the Event Hose.
The event hose is responsible for keeping track of all the different events that can occur during the course of sending messages (injection, delivery, bounce, rejection, etc.). It logs these events as they occur in JSON to an exchange in RabbitMQ. By offloading the queuing of these events from Momentum, it provides a good separation and allows our operations team to scale this cluster of servers independently. If you are familiar with the pub-sub model, the event hose acts as the publisher. However, without subscribers, these messages would float into the ether. What are our subscribers and how do they work you ask? Read on!
The Metrics ETL is the first of a collection of subscribers in this stack. It is a clustered Node.js process that binds to a RabbitMQ queue. This process receives messages from the queue as they are emitted and batches them up by transforming the data to adhere to the schema within a database called Vertica.
Message Events ETL
Like the Metrics ETL this is a clustered Node.js process. However, it binds to a different queue and has its own independent control over processing the data. It also loads into Vertica but into a more flexible schema called a flex table. Think of it as MongoDB on steroids. As mentioned in the intro, there are other uses of this data that I will not get into today. If you use webhooks or suppressions, it has different processes and logic to process this data.
We spent a great deal of time vetting analytic database solutions to fit our many different use cases. The big advantage are projections, similar to a materialized views, which provide the ability to store raw event data and model very complex queries for all the different ad-hoc drill downs and groupings (domain, time series, sending pool, etc). Lastly, it is horizontally scalable and allows us to easily add new nodes as our load and data set increases.
HTTP API Layer
As the processes I described above loads the data, users can retrieve data from several API endpoints. As explained in How SparkPost Built the Best Email API for Developers, we use RESTful web APIs. The Metrics API provides a variety of endpoints enabling you to retrieve a summary of the data, data grouped by a specific qualifier, or data by event type. This means I can see statistics about my emails at an aggregate level, by a time range, grouped by domain, subaccount, IP pool, sending domain, etc. The capabilities are extensive and our users love the different dissections of the data they can retrieve in almost real-time. We retain this information for six month to allow for trending of the data over time. If you need the data longer than that, we encourage you to set up a webhook and load the data into their respective business intelligence tools.
The Message Events API allows a user to search on the raw events the event hose logs above. The retention of these events is 10 days, and is intended for more immediate debugging of your messages (push, email, etc).
Web User Interface
We built SparkPost for the developer first but we understand that not all of our users are technical. We provide a Reports UI that allows a user to drill down by many different facets like: a recipient domain, sending domain, ip pool and campaign. It is built using the same APIs mentioned above.
I hope this shed some light on how SparkPost processes the large amount of data and makes it available to you. We’re also currently working on re-architecting everything I just talked about. We’ve learned a lot over the first 18 months of SparkPost, especially managing many different tiers of our own infrastructure. I’ve personally spent many hours triaging and fighting fires around RabbitMQ and Vertica. We have decided to leverage a service based message queue in SQS, and are starting to investigate service-based alternatives to Vertica. I plan on writing a follow up to this later in the year, so stay tuned!
Our knowledgeable staff also uses this data to ensure you’re making the best decisions when sending your messages with SparkPost. I encourage you to start using these APIs and the WebUI to start digging into how your messages are performing. It can also be crucial if you get stuck in one of those Lumbergh moments and have to provide an email report to your boss by 5pm on a Friday, or need to dig into why an ISP is bouncing your email. We’ve also seen great uses of our APIs in hackathon projects. So get creative and let us help you build something awesome.
–Bob Evans, Director of Engineering
What’s in Our Technical Operations Stack?
To follow-up our Email Infrastructure Stack post, here’s insight into what we use to provision, manage, and monitor the systems underneath that email infrastructure, our technical operations stack.
Like any properly agile development shop, we iterate a lot and our technical operations stack is no exception. We’re on our third generation of the stack, and nothing is set in stone. We’ll review several technologies we’ve progressed through and where we’re at currently. All our previous technologies were good fits at the time, but they’re not necessarily the right ones to use forever. Odds are we’ll also migrate past most of our current tools as our needs evolve. However, this is where we’ve been and where we are today.
Provisioning Generation 1: Installing in the Cloud
SparkPost as a company has been around for 15 years, providing the on-premise Momentum MTA product. The first request our engineering and ops teams got for what would become SparkPost was essentially, “give us a Momentum installation in the cloud.” Our goal then, as now, was to enable developers to use Momentum to send their mail through a cloud product. However, we didn’t want to waste time making guesses about what the specific needs would be in the cloud. We wanted to get something out there, and improve on it as we went. So the first iteration was simply a basic installation of our on-premise product in the cloud. We also added application and API layers to allow for multiple cloud customers to make use of that email infrastructure.
To achieve this, we spun up a basic set of instances in Amazon’s cloud. Our needs were basic and AWS’s API and CLI interfaces are fairly robust. Therefore, we automated our provisioning with basic bash scripts to create subnets, instances, security groups, and other AWS objects as needed. This method works well for creating a few assets here and there. Unfortunately, it doesn’t take advantage of everything AWS offers and doesn’t scale across a large team managing many assets.
Provisioning Generation 2: Embracing the Cloud
As our user base grew and our needs evolved, we began to re-architect our deployments in a method more appropriate for the cloud. We separated database and other components into their own dedicated tiers and partitioned some portions of the install from others via distinct VPCs.
To support this we needed a provisioning tool that could manage a full view of the environment and the objects inside of it. We evaluated several options here, including AWS Cloud Formation, and ultimately settled on Terraform from HashiCorp. Similar to AWS Cloud Formation and other tools, Terraform allows defining a full AWS environment, with all of its dependent objects, and then apply that configuration to have them all created and connected as defined. We use it to create entire VPCs, complete with peering connections, network routes, security groups, and instances.
Provisioning Generation 3: Living the Cloud
Two years into the life of SparkPost as a public cloud offering, we have enough understanding of our common use cases and expected growth to more fully embrace other cloud technologies to handle provisioning for us.
Our ops and engineering teams have put together a number of Ansible playbooks and Python scripts to further automate provisioning of AWS assets and software deployments. Through these automations we have decreased our time to deploy new assets and upgrade existing ones.
We are further separating our product components into dedicated tiers and configuring those individual tiers as AWS auto scaling groups. We are using Packer, also by HashiCorp, to further standardize our machine images for each tier. AWS CloudWatch monitoring of AWS SQS queues connected to our mail flows and other key traffic indicators drive the auto scaling of clusters. Once complete, this model will allow our deployment footprint to grow and shrink for each part of our technical operations stack as required by current workload demand.
Of course, provisioning of instances is only the first step. Once we launch them, we need to apply configuration, install applications, and handle maintenance.
Since our original launch, we have been primarily using Puppet for automated configuration management. Puppet is one of the best options around for maintaining configuration across a large collection of instances without the kinds of inconsistencies that can otherwise become common when several people are working on them. Also, the ability to insist on specific versions of packages means we know every instance will be kept in sync with the rest and not drift behind the state of the art. Between Puppet’s facter tool and Ruby templating engine there’s very little dynamic configuration that can’t be readily handled.
As reliable as Puppet is at maintaining a given state, it’s not necessarily the best tool for everything related to managing our instances. For changes we want to make at a specific point in time across an entire cluster with a high degree of control, we’ve found Ansible to be a much better fit. Over time, more of our deployments, upgrades, and even our creation and management of AWS objects have moved under Ansible. (We’ll be discussing that in a future article dedicated to our deployment model.)
Once instances are up and running with the right software and configuration, the real ops work starts. We need to monitor services and queues, and handle any potential impacts, preferably before they actually cause an issue.
We started out with Nagios and Check_MK, but in an effort to more fully embrace cloud technologies, we transitioned to Circonus to monitor our services. We actively monitor tens of thousands of data points in our cloud environment to ensure that our infrastructure and software is performing up to our standards. Circonus allows us to access all of our raw data for analysis. The raw data is invaluable in troubleshooting, post-mortem, and future planning activities.
Our on-call personnel receive alerts through OpsGenie when they detect a potential problem in our environment. The on-call personnel can then quickly assess the situation through the monitoring and AWS tools. Afterwards, it can take action to start remediation and repair. In less than 30 minutes, a team spanning multiple time zones will be on RingCentral and working towards a resolution.
Improvements to monitoring, data analysis, and remediation of issues are an on-going effort to enhance our visibility into our environment and reduce the potential of a customer impacting event. Automated remediation is the future of monitoring. We are currently exploring AWS SNS and Lambda as potential tools to help us realize that vision.
Hosting email in the cloud presents unique challenges. It’s inherently a push protocol, not a pull protocol. Subscribers who receive messages from us asked to receive them. Yet, unlike a video streaming service or news site they don’t ask for and then receive content immediately. They subscribe, then continue to receive messages from that subscription until they unsubscribe. If something goes wrong and they don’t receive a message, they may not immediately realize there’s a problem. Combined with the need to ensure users are only receiving email they requested and the related anti-spam and anti-phishing technologies, we must be constantly aware of the state of our delivery flows to a variety of service providers and the reputations of each of our IPs and customer domains with each of those providers. This is in addition to all the normal monitoring that our services are up, functional, and attempting to deliver email.
Fortunately, email was designed to be a robust protocol. While temporary delays and deferrals in delivery are not ideal, they are not the end of the world. And of course, we designed Momentum to deliver email, and that’s where it shines. Detailing everything Momentum does internally to handle fault tolerance and making sure email is delivered is well beyond the scope of this article. Our ops team knows we can rely on it as long as it’s running, patched, and properly configured.
Running in the cloud, we have to deal with additional abstractions on our network stack. We also can’t rely on the same kind of direct access to our external IP address bindings we would have in an on-premise installation. We rely heavily on AWS’ Elastic Load Balancer technology and a Momentum outgoing proxy module developed by our engineering team specifically to address these challenges.
We also use ELBs to manage load balancing and failover for our other tiers, both externally-facing and internal-only services. Keeping the services available to our customers hasn’t been a major problem for SparkPost. Especially because we combine the auto scaling discussed previously and an operations staff that is always on the ball. We currently deliver billions of messages a month. Despite that, we rarely had an outage or service degradation that resulted in the loss of even a single message.
Each of these topics have plenty more detail to them. The bottom line, is that our ops team loves to deliver your email. Moreover, we’re excited to make that more reliable and cost effective with each generation of new cloud technology. For more information on our technical operations stack, or other questions, check out our DevHub. Also feel free to hit us up on Slack or Twitter if you’d like to chat.
–Jeremy & Nick