SparkPost prides itself on having the best email deliverability team in the industry. A large part of our success is due to ensuring our team members have deep insight into our customers’ sending and deliverability, especially any unexpected changes in key customer metrics or ISP handling of a customer’s mails.
In support of this, the data science team has worked on various anomaly detection approaches to provide insight for our Premier and Enterprise customers’ anomalous sending behavior, acceptance rates at ISPs, and engagement rates at ISPs.
Acceptance Rate Anomalies
Typical approaches to anomaly detection look into data observations that deviate from the others, often using a mean and standard deviation approach or the interquartile range (IQR) method. While these approaches work well for datasets that either have a gaussian distribution, or in the case of IQR, could have a skewed distribution, they do not lend themselves as well to data that is highly time-dependent and presents challenges around long term trends (drift) and periodic patterns (seasonality).
Acceptance rates are a good example of data that work well with a typical anomaly detection approach. Anomalies are straight-forward to detect because rates typically stay consistent over time. When a customer’s acceptance rate falls outside of three standard deviations of their historical acceptance rate mean, it can be identified as an anomaly like the example in Figure 1.
By checking acceptance rates frequently across our large customer base, it also allows us to identify if anomalies are popping up at many customers for the same ISP at the same time, indicating a larger global delivery issue.
However, many of the other data points we monitor do not fit as cleanly into a mean and standard deviation approach. Handling the seasonal component and regularly changing customer behaviors, and ensuring the approach is generalizable across customers, factored into our research. We experimented with several anomaly detection algorithms available and settled on the approaches detailed below.
Sending Behavior Anomalies
Detecting sending behavior anomalies is an interesting challenge because sending behavior is so unique across our customers’ mail streams. Generally, our customers have unique, yet patterned, sending behavior that follows a daily, weekly, or monthly cadence. When a customer’s traffic pattern deviates from this, it’s often indicative of a change on their side – maybe something has changed within their sending application, or perhaps there’s a strategy change. We pride ourselves in working hand-in-glove with customers, so being aware of changes in their environment – whether intentional or not – is important to us.
There is no one size fits all approach for our customers given the variety and regularly evolving sending behavior. What is normal versus an anomaly is highly dependent on the specific customer and the day and time. In addition to finding a generalizable approach, we needed a solution that could account for sending patterns that change over time. It needed to be an online algorithm that could update itself and stay current. We settled on a robust random cut forest algorithm (RRCF) to detect anomalies like the example in Figure 2.
RRCF is an unsupervised algorithm that can be used to detect anomalous data points from otherwise patterned streaming data. The algorithm is a collection of trees, each with a set number of data observations, and as a new data input gets inserted into each tree, it is scored on how the depth of the tree changes. Those scores are averaged across the trees to provide an anomaly score.
The algorithm works well for our purposes because we were able to apply it to all of our customers’ mail streams regardless of their sending patterns and the parameters did not require individual tuning per customer. It also works as an online model, each customer’s algorithm gets updated as it sees new data, keeping the model up to date so we do not need to be concerned about the algorithm being outdated should sending behaviors change.
An equally interesting challenge was detecting engagement anomalies. We were interested in engagement rate as a proxy to identify whether an ISP is quarantining or spam filtering mail, but the expected noisiness of engagement rate data presented a challenge. Early on we realized that we did not get much value out of alerting on one-off engagement anomalies, as we had been prioritizing detecting in the other metrics. Engagement rates tend to be noisy given changes in content and recipients, but for our purposes, that noisiness is generally not indicative of a larger problem. Rather, we found that we cared more about when there was a sustained change in the engagement rate over time that reflected an underlying shift in ISP behavior toward the mail stream as in the example in Figure 3. We settled on change point detection to identify these shifts.
Change point detection splits times series data into segments and detects when the statistical properties of the data changes. We experimented with various unsupervised change point detection methodologies to identify an approach that worked well across all our customers for both abrupt, but sustained, changes as well as more gradual shifts.
An additional tricky aspect of engagement rate monitoring is that it cannot be measured on the same short-term frequencies as acceptance rate and sending behavior. We often do not have a complete view into an engagement rate until a couple of days later. To handle this, we calculate the rate after enough time has passed for recipients to engage at a level that follows the same pattern of the eventual engagement rate. This works well with offline change point detection because, with each data point, we can recheck for change points across recent weeks and pick up on change points that may reflect more gradual changes that we did not identify initially.
While acceptance rates, sending behavior, and engagement rates are all customer metrics we monitor over time, they present different challenges and require different approaches in anomaly detection. To ensure the anomalies detected are valuable, for each customer we check each metric at the appropriate facet level, for example by ISP and IP Pool/Sending Domain. We use airflow pipelines to evaluate the metric at regular frequent intervals, and if an anomaly is detected our customer success team member is alerted through Slack and can investigate, ensuring they have deep insight into our customers’ sending and deliverability.