This is My Code
It was an exciting experience to be part of the newly designed addition to the This is my Architecture series from AWS that highlights specific code segments using AWS integrations. The video interview was a great addition to the time spent at AWS re:Invent in Las Vegas last year.
Code snippets can be extremely hard to parse, especially without a proper context. The code demonstrated in this video is actually part of a larger module file. We had to reduce the number of lines of code to be displayed while we were setting up the shoot to obtain a reasonable viewing quality on the projection behind us.
To provide some additional context for this video – we have an ETL process that streams very verbose email delivery event data into a Redis data store. By storing this data we can then perform lookups when “out-of-band” events come in, such as email open/clicks and enrich these events containing sparse data (essentially just a message id) with the verbose data from the delivery event. This enables our customers to perform rich analytic queries against events without having to do these complicated correlations on the data.
As mentioned in the video – Redis is fast, but not a cost-effective storage system for large amounts of data. Since email open and click events tend to generally happen within a short time window of the delivery event, we looked at what time windows would achieve the best trade-offs for cost efficiency of storing data in Redis vs S3 and found that between 8 and 12 hours of data in Redis (and long term storage in S3) worked best.
The Lambda-cron job highlighted in this video is responsible for moving the data from Redis to S3. Each minute of data is pulled out of Redis at a time and stored to S3. In order to optimize S3 retrieval times we actually enforce a maximum size that each S3 file can be and chunk the minutes worth of data accordingly. This data is then stored in S3 with file names that can be used to identify the file containing an event in a single lookup.
Briefly explained, we achieve this by sorting the events by ID in memory. Each chunk appends the maximum ID as part of the file name. When we look up an event by ID, we simply open the file that has the smallest ID greater than our event ID. We found that doing an S3 list on the directory can actually take as long as the S3 GET operation for our data. We work around this by storing the S3 file structure in our Redis cluster, as the relative size of the data is miniscule.
You can view the entire interview below
Lead Data Scientist