How can Webhooks be easier, and searching event data (AKA Message Events) maybe even greater? We’ll try to answer in this post and open source some code along the way.

Shouting “Show me the data!” will earn you funny looks from most people, but not from us here at SparkPost. We are all about the data, both internally as we decide what to build, and externally when we’re delivering event data to you via Webhooks or Message Events.

Tom Cruise may actually want to see the money, but for our customers, data is king. Many of them make heavy use of our Webhooks (push model) to receive batches of event data via HTTP POST. Others prefer to use our Message Events endpoint, which is a pull model – you’re querying the same events, although data retention is limited to 10 days, as of this writing.

Now I don’t know about you, but whenever I hear that something is limited, the first thing I want to do is find a way around that limitation. The second thing is to show other people how I did it. In this post, I’m going to show you how to bypass our Message Events data retention limit by rolling your own low-cost queryable event database.

Building Blocks of a Service

The vision here is to ingest batches of event data, delivered by SparkPost’s Webhooks, and then be able to query that data, ideally for free. At least for cheap. Luckily, there are published best practices for doing the first part. One way to keep costs down (at least initially) is to use the AWS free tier, which is the way we’ll go in this post.

First, I’ll walk through the services I ended up using, and then briefly discuss what else I tried along the way, and why that didn’t make the cut. Almost everything in this system is defined and deployed using CloudFormation, along with pieces from the AWS Serverless Application Model (SAM). Under the hood, this uses API Gateway as an HTTP listener, and Node.js Lambda functions to “do stuff” when requests are received or in response to other interesting events. More on that later.

According to the best practices linked above, we need to return 200 OK ASAP, before doing any processing of the request body, where the event data is. So we’ll run a Lambda to extract the event data and batch id from the HTTP request and save it to S3. At this point, we’re capturing the data but can’t-do a whole lot with it just yet.

Databases and Event Data

There are all sorts of options out there when it comes to databases. I chose RDS PostgreSQL since it’s a (somewhat) managed service that’s eligible for the AWS free tier. Also, I’m already familiar with it, and had some automatic partitioning code lying around that would be better as open source.

Now seems like a good time to talk about what didn’t make the cut, especially since there were so many interesting options to choose from. The first database-y thing I considered was Athena, which would let us query directly against S3. Right out of the gate, unfortunately, there’s a snag: Athena isn’t eligible for the free tier, it’s priced based on the amount of data scanned by each query. We get a raw JSON feed from the Webhook, so optimizing the storage of that data to be cost-effective to the query would be its own project.

Another database I didn’t use is Dynamo, which would have been super convenient since AWS SAM bakes in support for it. Event data in combination with the types of queries the system needed to support isn’t a great fit for Dynamo though since it doesn’t allow the number of secondary indexes we’d need in order to efficiently support the wide range of queries that Message Events provides. Dynamo would definitely have been the low-stress option. Using RDS meant I had to poke around a bit more in AWS networking land than I had planned to.

Connecting the Data Dots

Our event data is stored in S3, and we’ve chosen a database. Triggers aren’t just for databases, thankfully, and S3 lets you configure Lambda functions to run for various types of events. We’ll fire our next Lambda when a file is created in the bucket that our Webhook listener writes to. It’ll read the batch of event data, and load it into our database, which closes the loop. We’re now asynchronously loading event data sent via Webhook into our database.

The only missing piece now is a way to search for specific types of events. We can implement this using AWS SAM as well, which gives us some nice shortcuts. This last Lambda is essentially a translator between query parameters and SQL. There are quite a few options for query builders in Node, and I picked Squel.js, which was a good balance between simplicity, dependencies, and features.

This system now achieves what it set out to – we’re storing event data provided via Webhook, following best practices, and can query the data using a familiar interface. And if you need to, it’s straightforward to customize by updating the query_events Lambda to add new ways to pull out the data you need, and indexes can be added to the database to make those custom queries faster.

Why Tho, and What Next?

SparkPost sends a lot of data along with our events. For example, transmission metadata lets our customers include things like their own internal user id with each email. Event data such as opens and clicks will now include that user id, making it easier to tie things together.

Because every customer uses features like metadata differently, it’s nigh impossible for us to give everyone exactly the type of search options they’d like. Running your own event database means you’re free to implement custom search parameters. Many of our larger customers already have systems like this, whether it’s a third party tool or something they built themselves. This project aims to lower the barriers to entry, so anyone with a moderate level of familiarity with AWS and the command line can operate their own event database more easily.

There are a few things I’d like to do next, for example, setting up authentication on the various endpoints, since as things are now, they’re open to the public. I discuss a solution to this in the repo, since exposing your customer’s email addresses to the public is a no-no.

I’d also like to perform some volume testing on this system. The free tier RDS database in this setup has 20GB of storage, I’m curious to see how quickly that would fill up. It would also be nice to complete the CloudFormation conversion. Currently, the database is managed separately from the CF stack, and creating the required tables and stored procedures requires punching a hole through the firewall, er, security group. It would be nice to standardize and automate that step as well, instead of requiring mouse clicks in the AWS console.

Thanks for reading! Give us a shout on Twitter, and star, fork or submit a PR on Github if you enjoyed the post. We’d love to hear about what you build!

– Dave Gray, Principal Software Engineer


We love it when developers use SparkPost webhooks to build awesome responsive services. Webhooks are great when you need real-time feedback on what your customers are doing with their messages. They work on a “push” model – you create a microservice to handle the event stream.

Did you know that SparkPost also supports a “pull” model Message Events API that enables you to download your event data for up to ten days afterwards? This can be particularly useful in situations such as:

  • You’re finding it difficult to create and maintain a production-ready microservice. For example, your corporate IT policy might make it difficult for you to have open ports permanently listening;
  • You’re familiar with batch type operations and running periodic workloads, so you don’t need real-time message events;
  • You’re a convinced webhooks fan, but you’re investigating issues with your almost-working webhooks receiver microservice, and want a reference copy of those events to compare.

If this sounds like your situation, you’re in the right place! Now let’s walk through setting up a really simple tool to get those events.

Design goals

Let’s start by setting out the requirements for this project, then translate them into design goals for the tool:

  • You want it easy to customize without programming.
  • SparkPost events are a rich source of data, but some event-types and event properties might not be relevant to you. Being selective gives smaller output file sizes, which is a good thing, right?
  • Speaking of output files, you want event data in the commonly-used csv  file format. While programmers love JSON, CSV is easier for non-technical users (and results in smaller files).
  • You want to set up your SparkPost account credentials and other basic information once and once only, without having to redo them each time it’s used. Having to remember that stuff is boring.
  • You need flexibility on the event date/time ranges of interest.
  • You want to set up your local time-zone once, and then work in that zone, not converting values manually to UTC time. Of course, if you really want to work in UTC, because your other server logs are all UTC, then “make it so.”
  • Provide some meaningful comfort reporting on your screen. Extracting millions of events could take some time to run. I want to know it’s working.

Events, dear programmer, events …

Firstly, you’ll need Python 3 and git installed and working on your system.  For Linux, a simple procedure can be found in our previous blog post. It’s really this easy:

For other platforms, this is a good starting point to get the latest Python download; there are many good tutorials out there on how to install.

Then get the sparkyEvents code from Github using:

We’re the knights who say “.ini”

Set up a sparkpost.ini  file as per the example in the Github README file here.

Replace <YOUR API KEY> with a shrubbery your specific, private API key.

Host is only needed for SparkPost Enterprise service usage; you can omit for

Events is a list, as per SparkPost Event Types; omit the line, or assign it blank, to select all event types.

Properties can be any of the SparkPost Event Properties. Definitions can split over lines using indentation, as per Python .ini file structure, which is handy as there are nearly sixty different properties. You can select just those properties you want, rather than everything; this keeps the output file to just the information you want.

Timezone can be configured to suit your locale. It’s used by SparkPost to interpret the event time range from_time and to_time that you give in command-line parameters. If you leave this blank, SparkPost will default to using UTC.

If you run the tool without any command-line parameters, it prints usage:

from_time and to_time are inclusive, so for example if you want a full day of events, use time T00:00 to T23:59.

Here’s a typical run of the tool, extracting just over 18 million events. This run took a little over two hours to complete.

That’s it! You’re ready to use the tool now. Want to take a peek inside the code? Keep reading!

Inside the code

Getting events via the SparkPost API

The SparkPost Python library doesn’t yet have built-in support for the message-events endpoint. In practice the Python requests library is all we need. It provides inbuilt abstractions for handling JSON data, response status codes etc and is generally a thing of beauty.

One thing we need to take care of here is that the message-events endpoint is rate-limited. If we make too many requests, SparkPost replies with a 429 response code. We play nicely using the following function, which sleeps for a set time, then retries:

Practically, when using event batches of 10000 I didn’t experience any rate-limiting responses even on a fast client. I had to deliberately set smaller batch sizes during testing, so you may not see rate-limiting occur for you in practice.

Selecting the Event Properties

SparkPost’s events have nearly sixty possible properties. Users may not want all of them, so let’s select those via the sparkpost.ini file. As with other Python projects, the excellent ConfigParser library does most of the work here. It supports a nice multi-line feature:

“Values can also span multiple lines, as long as they are indented deeper than the first line of the value.”

We can read the properties (applying a sensible default if it’s absent), remove any newline or carriage-return characters, and convert to a Python list in just three lines:

Writing to file

The Python csv library enables us to create the output file, complete with the required header row field names, based on the fList we’ve just read:

Using the DictWriter class, data is automatically matched to the field names in the output file, and written in the expected order on each line. restval="  ensures we emit blanks for absent data, since not all events have every property. extrasaction=ignore ensures that we skip extra data we don’t want.

That’s pretty much everything of note. The tool is less than 150 lines of actual code.

You’re the Master of Events!

So that’s it! You can now download squillions of events from SparkPost, and can customize the output files you’re getting. You’re now the master of events!

—Steve Tuck, Senior Messaging Engineer

Big Rewards Blog Footer

SP_UseCases_ThumbnailThis past week, SparkPost released its long-anticipated enhancement to our API: individual recipient-level Message Events. While it has been possible to get individual recipient-level data via webhooks since we launched back in April, this feature provides users with an additional mechanism to get message events data. This new API endpoint provides our customers a way to pull data – on demand – for use in internal reporting, triggering message generation, or other programmatic processing of event data.

Some highlights:

* You can get all events or specific events, such as bounces, deliveries, or clicks

* You can filter the data by date range, subject line, campaign, or just about any other field

* Message events data is available for 10 days. Of course, rolled up data is available via our metrics endpoints or the UI reports for much longer.

So what kinds of can you do with the message events?

For transactional messages:

Let’s look at a customer service example. A customer calls your support line because they didn’t receive an important notification, such as an alert or a receipt or a password reset. You can build a simple UI that allows your customer service rep to search for an individual recipient to determine if he received the message. The API call would look something like this:

curl -XGET -H ‘Authorization: <API_KEY_HERE>’

If the recipient received the message — and maybe even opened it, the customer service rep can provide a time/date to help them find it. Inboxes are pretty full these days — it’s easy to lose messages.

For marketing messages:

Use the data to build a customer profile database — who’s engaged (opened / clicked within 90 days), who isn’t, and use that to drive additional messaging. You could further break it down by campaign — are certain users more responsive to certain campaigns than others? This is the foundation for segmentation analysis. 

Here’s a sample call to get opened or clicked within a specific day in the Central Time Zone (because, why not?)

curl -XGET -H ‘Authorization: <your key>,click&from=2015-09-02T00:00&to=2015-09-02T23:59&timezone=America/Chicago.

Take it even further and tie it to existing data you already have about the recipient, such as purchase history or website data, for a richer profile. 

How will you use message events? Tell us in the comments.

-By Irina Doliov, Cloud Queen
Follow on Twitter: @idoliov

New API Feature_Message EventsWe just released a new feature in our API: Message Events, which allows users to pull down recipient-level data. There has been a lot of demand for such an API, and I’m very proud to say implementing this was not as hard as you may think. Why? Because SparkPost is built using a microservice architecture and we are continually building additional microservices.  All of the event data is written to a message queue, and we create Node.js ETL processes that interface with a specific queue and implement specific business logic.  The advantage of this is we can reuse our event stream data for many different services: Metrics, Message Events, Event Webhooks, and Suppression Lists.

It was also very exciting to use a relatively new feature in Vertica called Flex Tables.  Think of it as MongoDB, but an enterprise quality version.  It did have to be tuned to support such high throughput, but since we had previous knowledge of Vertica for the Metrics API, it was straightforward and attainable in a few days. 

Also, we have built our first API that supports pagination. We are leveraging HATEOS (Hypertext As The Engine Of Application State) to build this feature. The intention is you can build applications that can chain REST calls. In the case of pagination, by default we return the most recent 1000 results in a given page. You can use the query parameter per_page to change this to a maximum of 10,000.  The data that is returned from the API will give you the results as well as total_count and links:

It allows you to programatically page through results by using the rel of first, previous, next, last. As you make subsequent requests the links returned in the response payload will dynamically change:

The primary use case is you can build a Web UI that can support dynamic paging, keeping the hard part of tracking number of pages/results out of your application and simply show/hide elements based on what is returned in the response payload.  At SparkPost we will leverage this in the future by building out our own Web UI.

One internal debate we had was the best query parameters to use: per_page/page and limit/offset.  Sure, limit/offset is the implementation behind the scenes but we wanted to provide a better user experience by translating limit/offset to more human readable/consumable names.

With the addition of Message Events to our API, we’ve added the power of searching for message event data based on recipients, campaigns, and more. Take a look at the documentation and as always, let us know what you think at

Learn more about microservices, Node.js,  ETL and HATEOAS here:,_transform,_load

-Bob Evans, Manager, Application Engineering