Editor’s Note: This post is co-authored by Ezra Freedman, Christina Jaworsky, and Eric Silberstein.
When we launched AI-powered Customer Lifetime Value in June, we said there were two things we would work on next. The first was connecting Customer Lifetime Value (CLV) predictions to our segmentation engine. That’s now live and you can read about it here.
The second was improving our mathematical CLV prediction model, and that work went live last week. In this post, we’ll show some examples of the new model in action. We’ll also discuss how we validated the model and show improvements over our previous academic model. The enhancements improve churn prediction for customers. With access to such a large volume of e-commerce data, we were able to develop a model that is tailored to predicting purchasing in an e-commerce setting. Our new model is out-predicting published academic models of customer behavior.
Klaviyo computes a churn risk prediction for each customer. (We use the terms “churn risk” and “probability of churn” interchangeably.) The example customer below is currently predicted to have a 21% risk of churn. The colored bar shows what would have predicted for days in the past all the way back to the customer’s first purchase in March 2016.
New model in action
Let’s dive in with an extreme example that illustrates limitations in our old model from the academic literature that we’ve now solved. We’ll look at Phones Forever (anonymized name), an online store for cell phone accessories. They are fortunate to have many repeat, low churn-risk customers. However, like most businesses, they also have a ton of one-and-done, high churn-risk customers. The new churn model is able to differentiate between these two types of customers with much higher accuracy. In the academic model, churn prediction increased too slowly over time. Churn prediction started at around 20%, and even after 15 months without a purchase, it had only crept up to 25%.
The new model is better at modeling customers with one purchase in their history. Using the new Klaviyo model for the same customer, we see they initially have a medium risk for churn that decays to a 96% churn risk prediction after 15 months of not making a purchase. Realistic estimates for churn probability over time for one-time customers differentiates the new Klaviyo model from the models implemented in the academic literature.
The Klaviyo model predictions also improve our prediction quality for repeat customers. The new model learns when one-and-done customers transition to being repeat customers:
The customer above is classified as medium-high risk for churn after their first purchase. After their second, risk decreases a little. After their third, they are thoroughly medium. After all their remaining purchases, they start as low risk gradually increasing as the time since last purchase increases.
Here’s another example:
Each time a purchase is made, churn probability decreases. During long gaps between purchases, churn probability increases over time. The model can identify when it has been too long since a loyal customer’s last purchase and shows their churn risk go from low to high. In the example above, it’s been about nine months since the customer’s last purchase, so it is very likely they have churned.
Let’s look at a loyal customer with fewer and less frequent placed orders. For the customer below, the same general behavior is observed. But, the model learns the longer purchase frequency, so churn risk goes up slower between purchases. It has also been about nine months since this customer’s last purchase, however, we can see that their churn probability increased at a slower rate. The churn risk is 85%, 10% lower than the customer above. Nine months is a shorter gap for this customer, who typically waits 118 days between orders, than for the customer above, who waits 42 days between orders.
How do we know the new Klaviyo model performs better than the academic model?
We tested all of our models on e-commerce data from Klaviyo customers. For each company, we constructed a set of training and validation customers. We randomly assigned 80% of customers to the training group. The remaining 20% were used in the validation set. We withheld one year of data from training and used our model to predict what would happen during that year. Then, we compared the predictions to what actually happened.
Evaluating the accuracy of probability of churn for a single individual is impossible. When a customer is assigned a probability of churn of 75%, we can never tell if that prediction was accurate or not because they either do or don’t return to make a purchase. However, if we have 100 customers with a probability of churn of 75% and 75 never return, our prediction is likely accurate. If 95% never return, the prediction is inaccurate. By grouping our customers by their probability of churn, we can measure accuracy by comparing the number of customers expected to churn to the number who actually churned in each group.
Binning customers by probability of churn shows where the new Klaviyo model outdid the academic model. Below, we show the churn categories and predictions for Best Bag Bargains (anonymized name), a company that sells handbags. The Klaviyo model is much better at identifying customers who have 90%+ probability of churn. The academic model is overly optimistic – assigning a medium 40-70% probability of churn to a large number of customers. When we compare these predictions to reality, we see that 88-97% of these medium risk customers churn, showing the academic model cannot differentiate between medium risk customers and high risk customers. In contrast, the Klaviyo model assigns a much smaller subset of customers to the 40-70% group and the prediction is more accurate: 58-71% of these medium risk customers churn. The Klaviyo model correctly identifies that most customers should be assigned an 80-100% churn rate.
To compare model performance, we needed to put a single number on how well or poorly the different models did at predicting churn probability. We binned customers by their predicted churn rate. Customers were separated into 10 groups of churn probability: a 0-10% chance group, a 10-20% chance group, continuing all the way to a 90-100% chance group. Then, we counted how many of each group made a purchase during the holdout period. A well performing model will predict well in each group: the 0-10% group should have 5% churn, the 10-20% group should have a 15% churn rate, all the way to the 90-100% churn group which should have a 95% churn rate. Well performing models have a low misclassification rates for all bins.
To count misclassifications, we evaluate how “confused” the models are for each prediction bin. Since we are measuring a probability, we don’t know exactly how many false positives and negatives there are in our predictions, so instead we find the bounds on our predictions. The lower bound is the least possible rate of confusion between the prediction and the performance. The upper bound is the maximum mismatch possible. We show a visual calculation and formulas of this in the FAQ below. In the example above, we show the new Klaviyo model confuses between 3-22% (40-246) customers. This is a huge improvement over the academic model which confuses between 36-53% (406-587) customers. Even in the worst possible case, the Klaviyo model beats the academic model by a significant margin. And Best Bag Bargains isn’t an exception: company after company showed massive improvements in accuracy with the new Klaviyo model.
Looking at our 700 test companies, the lower bound decreases for the Klaviyo model:
In this plot, we show the confusion scores for 700 randomly chosen companies. The score on the x-axis is the fraction of customers put in the wrong churn bin. For example, a score of 0.2 means that 20% of customers were misclassified. The y-axis shows the number of companies with this score. The academic model has a long tail of high confusion scores, with some companies seeing as much as 80% of their customers misclassified. The new Klaviyo model has a confusion of less than 10% for almost all companies. Overall, more than half of companies will see at least a two times decrease in confusion scores. The new Klaviyo model is very good at correctly assigning churn probabilities to customers. Our model will allow you to identify high, medium, and low probability of churn customers. By differentiating between these different types of customers, our new model lets you target your customers differently based on how likely they are to return.
FAQ about using churn probability
After exporting my data, I made a list of customers with churn probability between 80-90%. How many customers should I expect to see make another purchase?
The average churn probability will be around 85%, so 15% of customers in this segment should return as customers.
I see that a customer has an 87% chance of churn and yet they are expected to make 3 purchases in the next year. How is that possible?
Churn probability only predicts the likelihood the customer will not come back. The expected number of purchases shows the expected value of the number of purchases the customer will make. To calculate expected value, we take the summation of the number of purchases times the probability of that number of purchases. So, even though this customer has an 87% chance of never making a purchase again, if they have a 13% chance of making 23 purchases in the future, that adds up to an expected value of 3 purchases. (.87*0 + .13*23 = 3)
Most of my customers have a 99% chance of churn. How can I use this?
If your business has a lot of one-and-done customers, your customers have a high probability of churn. Your lower probability of churn customers are likely your most valuable customers. Churn risk predictions let you identify the small segment of low churn-risk customers even if most of your customers never return. The model also learns from your data – so if your business changes over time and can recapture more customers, the model will update and learn from your new data to predict churn accurately for your customers.
FAQ about evaluating models
(We hope you’ll find these questions interesting. You don’t need to know any of this stuff to use our model!)
Can you explain the upper and lower bounds in more detail?
When we calculate the confusion score for each bin, we find the maximum and minimum number of correct predictions possible given the expected churn and the actual churn. Take a look at this diagram:
We predict 85% churn for 100 customers. We actually see a 90% churn rate. In the best case, we have 5% error, the difference between how many customers we expected to churn and how many churned, absolute_value(#predicted_to_churn – #actual_churn) = absolute_value(85-90) = 5. In the worst case, we have the maximum possible mismatch between our prediction and what actually happened. All our customers who we predict to return don’t and our customers who return all were predicted to churn. The upper bound, the maximum mismatch possible, is #total – absolute_value(#actual_didn’t_churn – #predicted_to_churn) = 100 – absolute_value(10-85) = 100-75 = 25.
Why don’t you show the graph for the upper bound?
The upper bound is a less meaningful bound on accuracy. The upper bound is large for churn predictions between 25-75% even if we see the same percentage of churn in the actual data. We’ll explain visually for the most extreme example, 50% churn:
Unlike the 85% prediction, most customers are not expected to have the same behavior. This means there are more possible mismatches, so in the worst case, we see more error.
We plot the upper bounds below. The Klaviyo model still beats the academic model. However, these results are less meaningful because the upper bound for both models is very large for churn predictions around 50%.
If you have ideas or want to share your thoughts about our work, our data science team would love to hear from you. Please send feedback to email@example.com.Back to Blog Home