There are many ways to obtain metadata about your transmissions sent via SparkPost. We built a robust reporting system with over 40 different metrics to help you optimize your email deliverability. At first, we attempted to send metadata to our customers via carrier pigeons to meet customer demand for a push-based event system. We soon discovered that the JSON the birds delivered was not as clean as customers wanted. That’s when we decided to build a scalable Webhooks infrastructure using more modern technologies.
Like our reporting, the webhook infrastructure at SparkPost begins with what we call our Event Hose. This piece of the Momentum platform generates the raw JSON data that will eventually reach your webhook endpoint. As Bob detailed in his Reporting blogpost, after every message generation, bounce event, delivery, etc., Momentum logs a robust JSON object describing every quantifiable detail (we found unquantifiable details didn’t fit into the JSON format very well) of the event that occurred.
Each of these JSON event payloads are loaded into an amqp-based RabbitMQ exchange. This exchange will fan the messages out to the desired queue, including the queue which will hold your webhooks traffic. We currently use RabbitMQ as a key part of our application’s infrastructure stack to queue and reliably deliver message. We use a persistent queue to ensure that RabbitMQ holds each message until it’s delivered to your consumer. In addition, the system we’ve built is ready to handle failures, downtime, and retries.
Between RabbitMQ and your consumer, we have an ETL process that will create batches of these JSON events for each webhook you have created. We believe in the “eat your own dogfood” philosophy for our infrastructure. So our webhooks ETL process will call out to our public webhooks API to find out where to send your batches. Additional headers or authentication data may be added to the POST request. Then the batch is on its way to your consumer.
If your webhooks consumer endpoint responds to the POST request in a timely manner with an HTTP 200 response, then the ETL process will acknowledge and remove the batch of messages from RabbitMQ. If the batch fails to POST to your consumer for any reason (Timeout, 500 server error, etc), it will be added to a RabbitMQ delayed queue. This queue will hold the batch for a certain amount of time (we retry batches using an increasing backoff strategy based on how many times it has been attempted). After the holding time has elapsed, the ETL process will receive the already-processed batch to send to your endpoint again. This retry process is repeated until either your consumer has accepted the batch with a 200 response, or the maximum number of retries has been reached.
As each batch is attempted, the ETL also sends updates to the webhook API with status data about each batch. We keep track of the consumer’s failure code, number of retries and batch ID. If your webhook is having problems accepting batches, you can access this status data via the webhook API. You can also access it through the UI by clicking “View Details” in your webhook’s batch status row.
Webhooks are an extremely useful part of the SparkPost infrastructure stack. They allow customers to receive event-level metadata on all of their transmissions in a push model. While we’re operating on RabbitMQ today, we’re always looking at more modern cloud-based message queueing technologies, such as SQS, to see what can best help us meet our customers’ needs.
–Jason Sorensen, Lead Data Scientist