At SparkPost we send over 25% of all non-spam email, but how do we account for all of those messages? Humans! Our well-equipped team calculates over 40 metrics and gives you the ability to dissect it in any way. Well, that was our first stab at this capability but we quickly realized it was too much work for them to do. Joking aside, scalable reporting is a challenging problem because we have to process millions of events per minute and ensure they can be queried within a timely manner. I hope to explain in some detail how we’ve solved this problem.
Sending email is great and we tell you we’re the best email service out there. However, if you’re like me, you need some hard evidence of this claim. You may not realize it but there’s a lot that happens when sending an email: Momentum (our email engine) receives a message (injection), the ISP receives it (delivery), the recipient’s mailbox is full (soft bounce), and so on. Seeing this at the 10,000-foot glance is great for reporting back to your boss, but you’ll also need to diagnose problems by inspecting the lifecycle of one message. We have solutions for both. And the architectural designs we made allow us to easily add different uses of this data. We implemented a strategy called ETL (extract, transform and load). This pattern is what powers Metrics, Message Events, Webhooks and Suppressions. I’ll walk you through this process starting with the Event Hose.
The event hose is responsible for keeping track of all the different events that can occur during the course of sending messages (injection, delivery, bounce, rejection, etc.). It logs these events as they occur in JSON to an exchange in RabbitMQ. By offloading the queuing of these events from Momentum, it provides a good separation and allows our operations team to scale this cluster of servers independently. If you are familiar with the pub-sub model, the event hose acts as the publisher. However, without subscribers, these messages would float into the ether. What are our subscribers and how do they work you ask? Read on!
The Metrics ETL is the first of a collection of subscribers in this stack. It is a clustered Node.js process that binds to a RabbitMQ queue. This process receives messages from the queue as they are emitted and batches them up by transforming the data to adhere to the schema within a database called Vertica.
Message Events ETL
Like the Metrics ETL this is a clustered Node.js process. However, it binds to a different queue and has its own independent control over processing the data. It also loads into Vertica but into a more flexible schema called a flex table. Think of it as MongoDB on steroids. As mentioned in the intro, there are other uses of this data that I will not get into today. If you use webhooks or suppressions, it has different processes and logic to process this data.
We spent a great deal of time vetting analytic database solutions to fit our many different use cases. The big advantage are projections, similar to a materialized views, which provide the ability to store raw event data and model very complex queries for all the different ad-hoc drill downs and groupings (domain, time series, sending pool, etc). Lastly, it is horizontally scalable and allows us to easily add new nodes as our load and data set increases.
HTTP API Layer
As the processes I described above loads the data, users can retrieve data from several API endpoints– we use RESTful web APIs. The Metrics API provides a variety of endpoints enabling you to retrieve a summary of the data, data grouped by a specific qualifier, or data by event type. This means I can see statistics about my emails at an aggregate level, by a time range, grouped by domain, subaccount, IP pool, sending domain, etc. The capabilities are extensive and our users love the different dissections of the data they can retrieve in almost real-time. We retain this information for six month to allow for trending of the data over time. If you need the data longer than that, we encourage you to set up a webhook and load the data into their respective business intelligence tools.
The Message Events API allows a user to search on the raw events the event hose logs above. The retention of these events is 10 days, and is intended for more immediate debugging of your messages (push, email, etc).
Web User Interface
We built SparkPost for the developer first but we understand that not all of our users are technical. We provide a Reports UI that allows a user to drill down by many different facets like: a recipient domain, sending domain, ip pool and campaign. It is built using the same APIs mentioned above.
I hope this shed some light on how SparkPost processes the large amount of data and makes it available to you. We’re also currently working on re-architecting everything I just talked about. We’ve learned a lot over the first 18 months of SparkPost, especially managing many different tiers of our own infrastructure. I’ve personally spent many hours triaging and fighting fires around RabbitMQ and Vertica. We have decided to leverage a service based message queue in SQS, and are starting to investigate service-based alternatives to Vertica. I plan on writing a follow up to this later in the year, so stay tuned!
Our knowledgeable staff also uses this data to ensure you’re making the best decisions when sending your messages with SparkPost. I encourage you to start using these APIs and the WebUI to start digging into how your messages are performing. It can also be crucial if you get stuck in one of those Lumbergh moments and have to provide an email report to your boss by 5pm on a Friday, or need to dig into why an ISP is bouncing your email. We’ve also seen great uses of our APIs in hackathon projects. So get creative and let us help you build something awesome.
–Bob Evans, Director of Engineering