How can Webhooks be easier, and searching event data (AKA Message Events) maybe even greater? We’ll try to answer in this post and open source some code along the way.

Shouting “Show me the data!” will earn you funny looks from most people, but not from us here at SparkPost. We are all about the data, both internally as we decide what to build, and externally when we’re delivering event data to you via Webhooks or Message Events.

Tom Cruise may actually want to see the money, but for our customers, data is king. Many of them make heavy use of our Webhooks (push model) to receive batches of event data via HTTP POST. Others prefer to use our Message Events endpoint, which is a pull model – you’re querying the same events, although data retention is limited to 10 days, as of this writing.

Now I don’t know about you, but whenever I hear that something is limited, the first thing I want to do is find a way around that limitation. The second thing is to show other people how I did it. In this post, I’m going to show you how to bypass our Message Events data retention limit by rolling your own low-cost queryable event database.

Building Blocks of a Service

The vision here is to ingest batches of event data, delivered by SparkPost’s Webhooks, and then be able to query that data, ideally for free. At least for cheap. Luckily, there are published best practices for doing the first part. One way to keep costs down (at least initially) is to use the AWS free tier, which is the way we’ll go in this post.

First, I’ll walk through the services I ended up using, and then briefly discuss what else I tried along the way, and why that didn’t make the cut. Almost everything in this system is defined and deployed using CloudFormation, along with pieces from the AWS Serverless Application Model (SAM). Under the hood, this uses API Gateway as an HTTP listener, and Node.js Lambda functions to “do stuff” when requests are received or in response to other interesting events. More on that later.

According to the best practices linked above, we need to return 200 OK ASAP, before doing any processing of the request body, where the event data is. So we’ll run a Lambda to extract the event data and batch id from the HTTP request and save it to S3. At this point, we’re capturing the data but can’t-do a whole lot with it just yet.

Databases and Event Data

There are all sorts of options out there when it comes to databases. I chose RDS PostgreSQL since it’s a (somewhat) managed service that’s eligible for the AWS free tier. Also, I’m already familiar with it, and had some automatic partitioning code lying around that would be better as open source.

Now seems like a good time to talk about what didn’t make the cut, especially since there were so many interesting options to choose from. The first database-y thing I considered was Athena, which would let us query directly against S3. Right out of the gate, unfortunately, there’s a snag: Athena isn’t eligible for the free tier, it’s priced based on the amount of data scanned by each query. We get a raw JSON feed from the Webhook, so optimizing the storage of that data to be cost-effective to the query would be its own project.

Another database I didn’t use is Dynamo, which would have been super convenient since AWS SAM bakes in support for it. Event data in combination with the types of queries the system needed to support isn’t a great fit for Dynamo though since it doesn’t allow the number of secondary indexes we’d need in order to efficiently support the wide range of queries that Message Events provides. Dynamo would definitely have been the low-stress option. Using RDS meant I had to poke around a bit more in AWS networking land than I had planned to.

Connecting the Data Dots

Our event data is stored in S3, and we’ve chosen a database. Triggers aren’t just for databases, thankfully, and S3 lets you configure Lambda functions to run for various types of events. We’ll fire our next Lambda when a file is created in the bucket that our Webhook listener writes to. It’ll read the batch of event data, and load it into our database, which closes the loop. We’re now asynchronously loading event data sent via Webhook into our database.

The only missing piece now is a way to search for specific types of events. We can implement this using AWS SAM as well, which gives us some nice shortcuts. This last Lambda is essentially a translator between query parameters and SQL. There are quite a few options for query builders in Node, and I picked Squel.js, which was a good balance between simplicity, dependencies, and features.

This system now achieves what it set out to – we’re storing event data provided via Webhook, following best practices, and can query the data using a familiar interface. And if you need to, it’s straightforward to customize by updating the query_events Lambda to add new ways to pull out the data you need, and indexes can be added to the database to make those custom queries faster.

Why Tho, and What Next?

SparkPost sends a lot of data along with our events. For example, transmission metadata lets our customers include things like their own internal user id with each email. Event data such as opens and clicks will now include that user id, making it easier to tie things together.

Because every customer uses features like metadata differently, it’s nigh impossible for us to give everyone exactly the type of search options they’d like. Running your own event database means you’re free to implement custom search parameters. Many of our larger customers already have systems like this, whether it’s a third party tool or something they built themselves. This project aims to lower the barriers to entry, so anyone with a moderate level of familiarity with AWS and the command line can operate their own event database more easily.

There are a few things I’d like to do next, for example, setting up authentication on the various endpoints, since as things are now, they’re open to the public. I discuss a solution to this in the repo, since exposing your customer’s email addresses to the public is a no-no.

I’d also like to perform some volume testing on this system. The free tier RDS database in this setup has 20GB of storage, I’m curious to see how quickly that would fill up. It would also be nice to complete the CloudFormation conversion. Currently, the database is managed separately from the CF stack, and creating the required tables and stored procedures requires punching a hole through the firewall, er, security group. It would be nice to standardize and automate that step as well, instead of requiring mouse clicks in the AWS console.

Thanks for reading! Give us a shout on Twitter, and star, fork or submit a PR on Github if you enjoyed the post. We’d love to hear about what you build!

– Dave Gray, Principal Software Engineer

 

Configuration Management and Provisioning

At SparkPost our specific needs have changed over time, along with our understanding of what is the best tool for the job we do. Two areas in particular where this has evolved significantly for us include configuration management and provisioning. After some trial and error we eventually settled on CloudFormation and a mix of Puppet and Ansible. Hopefully what we learned from our experiences can help you select the right tools for your environment.

Puppet vs Ansible

A few years ago we were using Puppet for both OS and Application configuration management. Over time we discovered some challenges with this. First of all, Puppet runs as a scheduled update, not on demand. Secondly, Puppet has no way to test and then roll back if there is a problem. And finally, Puppet is not very accessible to the developers at SparkPost and not very suitable to their use cases – it’s an OPS tool.

We explored using Ansible for application level configuration. While Puppet maintains state, Ansible is good at transforming state. Ansible has fundamentally different behavior to Puppet. We can run Ansible on demand. We also like its flow control so we can test and rollback immediately. Most importantly, our application developers can maintain Ansible playbooks in concert with code and database changes. Overall it integrates very well with our Continuous Integration and Deployment pipeline.

Blurred Lines

Our next iteration was to gradually split the configuration management problem between the pieces managed by our Ops team with Puppet and pieces managed by the Development teams using Ansible. While this approach “worked”, we found it hard to understand what the actual running config would be, since Puppet would override some config stanzas managed by Ansible. These overlapping responsibilities between Puppet and Ansible were messy and error prone.

Puppet + Ansible

To resolve these problems, we decided to standardize by using Puppet only for the OS-level stuff. This includes system tuning, mounted volumes, local users, authentication, etc. In contrast, we now use Ansible exclusively for all application software deployment and configuration. This greatly simplifies things since we are now using each tool for the purpose it is intended and well suited. A key catalyst to this break-through was our organizational changes, specifically when we broke down the divisions between development and technical operations. Now there is a single Engineering organization at SparkPost, which has helped us overcome organizational silos that had contributed to a suboptimal approach.

Terraform

Another area where we have evolved our use of tools is cloud provisioning. We first started in AWS by building out all of our provisioning scripts by hand. This worked well until we had to quickly scale out.

We needed better automation. We chose Terraform as the tool to provision resources in AWS since it is a vendor agnostic tool. Terraform uses a layer of abstraction which was initially very attractive but over time became problematic for our use case. First of all, Terraform keeps local state files that need to be distributed or available every time a change is made. Second, Terraform lacked native AWS support at the time for some important aspects of our infrastructure. This required extra code to implement workarounds outside of Terraform. Needless to say, this negated many of the promised benefits of using Terraform. Finally, we simply did not need the complexity and layers of abstraction that come along with Terraform.

AWS CloudFormation

We eventually went down the path of creating a Python tool suite that uses CloudFormation for AWS. By this time, CloudFormation was a very mature tool.  We wanted to simplify our provisioning by using built in AWS CloudFormation features. We also started leveraging CloudFormation as the source of truth for our infrastructure. This approach results in significantly less code than we needed with Terraform. Additionally, we are able to handle provisioning failures with far less effort since we had removed the extra abstraction layers and workarounds.

Learning to Keep It Simple

After experimenting with various configuration management and provisioning tools and approaches, we learned the lesson of how important it is to not get too attached to any particular tool. We needed to be flexible and make changes in our tool selection and use as our understanding of the specific use cases evolved. Also planning too far ahead and picking tools you might need in the future without practical need in the near term can end up wasting a lot of time.  We also experienced first hand the pitfalls of letting organizational structure drive selection and use of tools. These lessons learned helped us come up with a simpler, more modular approach that takes advantage of the technologies most aligned with our specific needs.

You can learn more about our DevOps journey or if you have any questions about cloud provisioning and configuration management in AWS, don’t hesitate to connect on Twitter.  Also our SRE team is always hiring.

 

Chris McFadden
VP Engineering and Cloud Operations

Many thanks to John Peacock and Leonard Lawton on our SRE team for their input to this blog post.