GET AAMPE
Schaun Wheeler
Author
No items found.

The world of marketing has a bit of a love affair with Markov models. Attribution is a really important topic in marketing, and Markov models have allowed marketers to move past first-touch/last-touch attribution and other similarly simplistic heuristics. So when people find out Aampe uses propensity models to enable more intelligence targeting, they often ask if we’re using a Markov model.

We’re not. This post explains why.

Different models for difference purposes

We do a lot of modeling at Aampe, so let’s be clear about our propensity modeling in particular - what it is and what it isn’t. The core of Aampe's AI is a contextual bandit that has been modified to handle counterfactual signals. (If you want more information on what that means, I gave a talk at Fifth Elephant that explains it.) The bandit predicts user preferences for particular messaging choices based on parallelized experiments and adaptation. The bandit is a learning by doing model. That’s important, because machine learning models don’t learn by doing, so they’re only as good as the data you feed into them. Aampe’s core learning infrastructure actively generates the data the model needs to perform well. That’s much more than propensity modeling.

Propensity models are standard, bread-and-butter, learning-by-observing machine learning. We look at users' past behavior and predict the probability that they're going to do some future behavior of interest. These models are important because of how they allow our users to interact with the learning-by-doing parts of our system by setting context and conditions on their messaging experiments. 

Say you’re interested in getting users on an e-commerce app to add something to their cart. There are typically three situations a user could be in (as far as the cart is concerned) at any given time:

  1. They already added to cart and then purchased. You might want to recommend something based on that purchase, or maybe make them feel good about the purchase they’ve already made.
  2. They added to cart but haven’t purchased. This is a typical “abandoned cart” scenario. You might encourage or incentivize them to complete the purchase.
  3. They haven’t added anything to their cart. This is the realm of general marketing or lifecycle messaging, just trying to figure out something that might interest them.

A good propensity model that predicts add to cart creates a fourth scenario:

  1. They haven’t added anything to their cart, but they’re doing other things on the app that indicate that an add-to-cart would make a lot of sense in the near future. You can message them knowing that they’re “in market” for the behavior you want to encourage. 

We create propensity models for any measurable app behavior - carts, purchases, profile management, wishlist curation, social media sharing, subscriptions, etc. One of our most common propensity models is retention propensity, looking at users now and predicting whether they're still going to be active on the app in a month. All other scenarios are based on past actions. Propensity-driven messaging is based on future actions. That’s what makes it powerful. 

We could make these predictions using a Markov model if we wanted to. We choose not to, because we don't believe it's the best way to solve the problem. Instead, we use a Naive Bayes approach. 

Here’s an analogy to explain the difference between Markov and Naive Bayes

Say we were looking at a person in Seattle and wanted to predict whether they were going to travel to London. A Markov approach would calculate the probability of traveling from each airport to every other airport - so there'd be a Seattle-to-Denver probability, a Denver-to-Chicago probability, a Chicago to Barcelona probability, a Barcelona-to-London probability, and so on. We could use those "transition probabilities" to simulate lots of different travel patterns: start the model at Seattle, then randomly pick a destination (weighted by transition probability, so a destination with a higher transition probability would have a higher chance of being picked). Then, from that new destination, pick the next destination. Then the next. Do that lots of times for lots of iterations and see how often you end up in London. That's your probability of traveling from Seattle to London. In a Markov model approach, the focus is on the current state or location of the "traveler". The model considers the present state and makes predictions based on the probabilities of transitioning from the current state to the next state.

Now let's go back to that guy in Seattle and try a different approach. Look at that traveler's history. Where are all of the places he's been in the last month? What's the probability that someone who traveled to each of those places traveled to London later? The probability for each place is going to be pretty small by itself, but we can treat them all as independent (not because they're actually independent, but just for the sake of convenience). So he traveled to Denver once. That has a small probability of also going to London. He traveled to New York City. That has a higher probability - add it to the Denver probability. Hey, it looks like he actually traveled to London once a few weeks ago. That's a high probability - add it to the rest. By "naively" treating all of those past destinations as independent and just adding up the information from each one, we can arrive at a probability that the guy is going to travel to London in the near future. In a Naive Bayes approach, we look at individual history and the model constructs a story about the future based on those histories.

So both Markov model and Naive Bayes look at history. Markov distills those histories into general rules about different “states” (locations, in our analogy) a user can be in, and makes you simulate travel across those states to get your final probability predictions. Naive Bayes uses those histories to calculate conditional probabilities of reaching the end state, and then adds up all those probabilities to get a final prediction.

Our retention propensity models rely on Naive Bayes because the type of problem our model is trying to solve, and type of data available to solve it, is sparse, high-dimensional, and imbalanced. Those are important concepts, so let's unpack them.

Why sparse data is a problem

Data is sparse if there is a lot of missing data - just because one user has a value for x doesn't mean another user is going to also have a value for x. Using our traveler example: not everyone goes to Denver. Not everyone goes to Chicago. There's going to be tons of cases where transition probabilities for different states are based on very different travelers' histories.

Markov models can have a difficult time handling sparsity. That's because they rely on the assumption that the probability of moving to any state depends only on the previous state. Data sparsity, combined with this assumption, can lead to  several problems:

  • Inaccurate Estimation: If you're missing a lot of data, you can't model transitions accurately, which leads to poor overall performance.
  • Unreliable State Sequences: the paths the simulation takes can really vary depending on which rare states the path happens to hit.
  • Overfitting: the model may miss many possibilities entirely just because they're not common, making predictions that look more confident than they deserve to be.
  • Sensitivity to Initial Conditions: Markov models are, in general, sensitive to the initial conditions. Starting at a rare state can lead to very different outcomes than starting at a common state.

Why high-dimensional data is a problem

Data is high-dimensional if there are a lot of different "features" being used to make predictions. Using our traveler example: there's a whole lot of airports. There's a huge number of paths a traveler could take to get from Seattle to London (or from and to any other destinations). High-dimensionality can cause overfitting and sensitivity to initial conditions, just like data sparsity does, and high-dimensionality often actually causes data sparsity directly. There are other challenges:

  • Increased Data Requirements: the more states you have, the more data you need to accurately model transitions between states.
  • Computational Complexity: High-dimensional Markov models are more demanding. A large transition matrix requires a lot more memory and processing power. It's expensive.
  • Increased Sensitivity to Noise: High-dimensional Markov models are more susceptible to noise in the data. Small perturbations or measurement errors propagate throughout the rest of the system. It's a mess.

Our models deal in app events - everything from completing a purchase to viewing a product to removing an item from a wish list. In the case of many of our customers, that's hundreds of different events the model needs to take into account, and often we create hundreds more by pre-processing the event names into semantic components that can catch relationships between events. That is very high-dimensional data. And, of course, many users don't do many of the events. The data is sparse. Using a Markov model to handle this kind of situation just isn't a good idea.

Why imbalanced data is a problem

“Imblanced data” is the machine learning term for when one outcome is much, much more likely than the other. In our case. For retention modeling: a user is more likely to churn than retain. That’s just the plain truth.

Markov models become unstable when faced with greatly imbalanced outcomes. If state A is visited much more frequently, the transitions between states may primarily involve state A. This can lead to a lack of data on transitions involving state B, making it challenging to accurately estimate transition probabilities for the rare state. As a result, the model may have limited predictive power for the rare state. A Markov model's performance will be disproportionately influenced by the more common state, potentially leading to skewed predictions and a limited ability to detect or understand the rare state's behavior. Many ML algorithms have a hard time handling imbalanced datasets because the algorithm learns quickly that it can get good accuracy by just predicting the most common outcome, which isn't a great solution in most cases.

Naive Bayes does well where Markov doesn’t

Markov models require us to estimate the transition probabilities between all states. Using Naive Bayes, on the other hand,we count up how many times each user does each of app event (we look at a span of 30 days), then we wait some times (by default, another 30 days), then we monitor time after that (the next 30 days) to see if they come back to the app. That's the outcome our model optimizes for. Imagine a big list of users where they’re labelled TRUE if they retain, and FALSE if they churn - our model tries to learn rules to predict the correct label. 

Naive Bayes works very well with sparse data, and can handle extremely high-dimensional data without breaking a sweat. Also, the specific flavor of the Naive Bayes family of classifiers we use is called Complement Naive Bayes, and is uniquely powerful in its ability to handle imbalanced datasets. Instead of focusing on the most common outcome, it concentrates on the minority outcome (users who retain, in the case of our retention model), and calculates the probabilities of a given feature (app activity count) appearing in the non-retained group and then flips those probabilities to favor the retained group.

So the output of the model is a prediction between 0% and 100% indicating the estimated probability that a user will be retained over the period of a month. We can then use that model to enable messaging decisions (such as the decision to trigger specific messages) based on the user's propensity to retain (or buy, or add to cart, or curate their wishlist, or any other behavior measured through app events). We can also output probabilities for each individual feature directly, to give insights into which app events more impact retention.

Tool choice matters because data is your only window into your customer’s lives and preferences

The choice between Markov models and Naive Bayes isn't just a matter of personal preference. It's about selecting the right tool for the job. We opted for Naive Bayes because it handles well the exact kind of data our customers face: sparse, high-dimensional, and imbalanced. Our propensity models provide our users with data-driven options for engaging with the users in unique ways, which optimizes messaging decisions while at the same time providing valuable insights into user behavior. In the world of customer engagement, the ability to adapt to customers' lives depends upon adapting to the specific characteristics of the data that represents those lives.

This browser does not support inline PDFs. Download the PDF to view it.

Marketers love Markov models. Here's why we chose Naive Bayes instead.

Markov models vs. Naive Bayes: How to scale propensity models for in-app behavior and retention

The world of marketing has a bit of a love affair with Markov models. Attribution is a really important topic in marketing, and Markov models have allowed marketers to move past first-touch/last-touch attribution and other similarly simplistic heuristics. So when people find out Aampe uses propensity models to enable more intelligence targeting, they often ask if we’re using a Markov model.

We’re not. This post explains why.

Different models for difference purposes

We do a lot of modeling at Aampe, so let’s be clear about our propensity modeling in particular - what it is and what it isn’t. The core of Aampe's AI is a contextual bandit that has been modified to handle counterfactual signals. (If you want more information on what that means, I gave a talk at Fifth Elephant that explains it.) The bandit predicts user preferences for particular messaging choices based on parallelized experiments and adaptation. The bandit is a learning by doing model. That’s important, because machine learning models don’t learn by doing, so they’re only as good as the data you feed into them. Aampe’s core learning infrastructure actively generates the data the model needs to perform well. That’s much more than propensity modeling.

Propensity models are standard, bread-and-butter, learning-by-observing machine learning. We look at users' past behavior and predict the probability that they're going to do some future behavior of interest. These models are important because of how they allow our users to interact with the learning-by-doing parts of our system by setting context and conditions on their messaging experiments. 

Say you’re interested in getting users on an e-commerce app to add something to their cart. There are typically three situations a user could be in (as far as the cart is concerned) at any given time:

  1. They already added to cart and then purchased. You might want to recommend something based on that purchase, or maybe make them feel good about the purchase they’ve already made.
  2. They added to cart but haven’t purchased. This is a typical “abandoned cart” scenario. You might encourage or incentivize them to complete the purchase.
  3. They haven’t added anything to their cart. This is the realm of general marketing or lifecycle messaging, just trying to figure out something that might interest them.

A good propensity model that predicts add to cart creates a fourth scenario:

  1. They haven’t added anything to their cart, but they’re doing other things on the app that indicate that an add-to-cart would make a lot of sense in the near future. You can message them knowing that they’re “in market” for the behavior you want to encourage. 

We create propensity models for any measurable app behavior - carts, purchases, profile management, wishlist curation, social media sharing, subscriptions, etc. One of our most common propensity models is retention propensity, looking at users now and predicting whether they're still going to be active on the app in a month. All other scenarios are based on past actions. Propensity-driven messaging is based on future actions. That’s what makes it powerful. 

We could make these predictions using a Markov model if we wanted to. We choose not to, because we don't believe it's the best way to solve the problem. Instead, we use a Naive Bayes approach. 

Here’s an analogy to explain the difference between Markov and Naive Bayes

Say we were looking at a person in Seattle and wanted to predict whether they were going to travel to London. A Markov approach would calculate the probability of traveling from each airport to every other airport - so there'd be a Seattle-to-Denver probability, a Denver-to-Chicago probability, a Chicago to Barcelona probability, a Barcelona-to-London probability, and so on. We could use those "transition probabilities" to simulate lots of different travel patterns: start the model at Seattle, then randomly pick a destination (weighted by transition probability, so a destination with a higher transition probability would have a higher chance of being picked). Then, from that new destination, pick the next destination. Then the next. Do that lots of times for lots of iterations and see how often you end up in London. That's your probability of traveling from Seattle to London. In a Markov model approach, the focus is on the current state or location of the "traveler". The model considers the present state and makes predictions based on the probabilities of transitioning from the current state to the next state.

Now let's go back to that guy in Seattle and try a different approach. Look at that traveler's history. Where are all of the places he's been in the last month? What's the probability that someone who traveled to each of those places traveled to London later? The probability for each place is going to be pretty small by itself, but we can treat them all as independent (not because they're actually independent, but just for the sake of convenience). So he traveled to Denver once. That has a small probability of also going to London. He traveled to New York City. That has a higher probability - add it to the Denver probability. Hey, it looks like he actually traveled to London once a few weeks ago. That's a high probability - add it to the rest. By "naively" treating all of those past destinations as independent and just adding up the information from each one, we can arrive at a probability that the guy is going to travel to London in the near future. In a Naive Bayes approach, we look at individual history and the model constructs a story about the future based on those histories.

So both Markov model and Naive Bayes look at history. Markov distills those histories into general rules about different “states” (locations, in our analogy) a user can be in, and makes you simulate travel across those states to get your final probability predictions. Naive Bayes uses those histories to calculate conditional probabilities of reaching the end state, and then adds up all those probabilities to get a final prediction.

Our retention propensity models rely on Naive Bayes because the type of problem our model is trying to solve, and type of data available to solve it, is sparse, high-dimensional, and imbalanced. Those are important concepts, so let's unpack them.

Why sparse data is a problem

Data is sparse if there is a lot of missing data - just because one user has a value for x doesn't mean another user is going to also have a value for x. Using our traveler example: not everyone goes to Denver. Not everyone goes to Chicago. There's going to be tons of cases where transition probabilities for different states are based on very different travelers' histories.

Markov models can have a difficult time handling sparsity. That's because they rely on the assumption that the probability of moving to any state depends only on the previous state. Data sparsity, combined with this assumption, can lead to  several problems:

  • Inaccurate Estimation: If you're missing a lot of data, you can't model transitions accurately, which leads to poor overall performance.
  • Unreliable State Sequences: the paths the simulation takes can really vary depending on which rare states the path happens to hit.
  • Overfitting: the model may miss many possibilities entirely just because they're not common, making predictions that look more confident than they deserve to be.
  • Sensitivity to Initial Conditions: Markov models are, in general, sensitive to the initial conditions. Starting at a rare state can lead to very different outcomes than starting at a common state.

Why high-dimensional data is a problem

Data is high-dimensional if there are a lot of different "features" being used to make predictions. Using our traveler example: there's a whole lot of airports. There's a huge number of paths a traveler could take to get from Seattle to London (or from and to any other destinations). High-dimensionality can cause overfitting and sensitivity to initial conditions, just like data sparsity does, and high-dimensionality often actually causes data sparsity directly. There are other challenges:

  • Increased Data Requirements: the more states you have, the more data you need to accurately model transitions between states.
  • Computational Complexity: High-dimensional Markov models are more demanding. A large transition matrix requires a lot more memory and processing power. It's expensive.
  • Increased Sensitivity to Noise: High-dimensional Markov models are more susceptible to noise in the data. Small perturbations or measurement errors propagate throughout the rest of the system. It's a mess.

Our models deal in app events - everything from completing a purchase to viewing a product to removing an item from a wish list. In the case of many of our customers, that's hundreds of different events the model needs to take into account, and often we create hundreds more by pre-processing the event names into semantic components that can catch relationships between events. That is very high-dimensional data. And, of course, many users don't do many of the events. The data is sparse. Using a Markov model to handle this kind of situation just isn't a good idea.

Why imbalanced data is a problem

“Imblanced data” is the machine learning term for when one outcome is much, much more likely than the other. In our case. For retention modeling: a user is more likely to churn than retain. That’s just the plain truth.

Markov models become unstable when faced with greatly imbalanced outcomes. If state A is visited much more frequently, the transitions between states may primarily involve state A. This can lead to a lack of data on transitions involving state B, making it challenging to accurately estimate transition probabilities for the rare state. As a result, the model may have limited predictive power for the rare state. A Markov model's performance will be disproportionately influenced by the more common state, potentially leading to skewed predictions and a limited ability to detect or understand the rare state's behavior. Many ML algorithms have a hard time handling imbalanced datasets because the algorithm learns quickly that it can get good accuracy by just predicting the most common outcome, which isn't a great solution in most cases.

Naive Bayes does well where Markov doesn’t

Markov models require us to estimate the transition probabilities between all states. Using Naive Bayes, on the other hand,we count up how many times each user does each of app event (we look at a span of 30 days), then we wait some times (by default, another 30 days), then we monitor time after that (the next 30 days) to see if they come back to the app. That's the outcome our model optimizes for. Imagine a big list of users where they’re labelled TRUE if they retain, and FALSE if they churn - our model tries to learn rules to predict the correct label. 

Naive Bayes works very well with sparse data, and can handle extremely high-dimensional data without breaking a sweat. Also, the specific flavor of the Naive Bayes family of classifiers we use is called Complement Naive Bayes, and is uniquely powerful in its ability to handle imbalanced datasets. Instead of focusing on the most common outcome, it concentrates on the minority outcome (users who retain, in the case of our retention model), and calculates the probabilities of a given feature (app activity count) appearing in the non-retained group and then flips those probabilities to favor the retained group.

So the output of the model is a prediction between 0% and 100% indicating the estimated probability that a user will be retained over the period of a month. We can then use that model to enable messaging decisions (such as the decision to trigger specific messages) based on the user's propensity to retain (or buy, or add to cart, or curate their wishlist, or any other behavior measured through app events). We can also output probabilities for each individual feature directly, to give insights into which app events more impact retention.

Tool choice matters because data is your only window into your customer’s lives and preferences

The choice between Markov models and Naive Bayes isn't just a matter of personal preference. It's about selecting the right tool for the job. We opted for Naive Bayes because it handles well the exact kind of data our customers face: sparse, high-dimensional, and imbalanced. Our propensity models provide our users with data-driven options for engaging with the users in unique ways, which optimizes messaging decisions while at the same time providing valuable insights into user behavior. In the world of customer engagement, the ability to adapt to customers' lives depends upon adapting to the specific characteristics of the data that represents those lives.

This browser does not support inline PDFs. Download the PDF to view it.