GET AAMPE
Schaun Wheeler
Author
No items found.

This post is about a geeky model-training issue that has very substantial business impact.

It’s about conversion events

If you have an app, there's lots of things you want your users to do, but for most apps there's just one or two things that you really want your users to do.

These are conversion events. Depending on the type of app, the conversion event could be:

  • Complete an order
  • Play a game
  • Add money to a wallet
  • Take out a micro-loan
  • Complete a shift
  • Complete a lesson
  • Watch a video

If you're hyper-focused just on engagement, a conversion event could just be visiting a particular kind of page. 

The point is:

  1. There's usually just one point - maybe a small handful at most - at which you get real, tangible, business-impact value from your users.
  2. Compared to all the rest of the activity on your app, those impactful events are really rare.

Event rarity is a modeling challenge as well as a business challenge

The core of Aampe's reinforcement learning system is a model that estimates the probability that each individual user on your app will respond positively when presented with a particular messaging choice.

  • A “messaging choice” is any way the timing, copy, channel, or other aspect of the message sent to that user differs from messages sent to other users.
  • “Respond positively” means the user move meaningfully along the path toward a conversion event. 

That requires us to feed information into the model about messages we sent, and whether those messages resulted in a success or failure.

If we define success only as "the user did the conversion event", we end up with a severe class-imbalance problem. If the model's training data is 98% failure and 2% success, sure, there are ways to oversample or undersample or re-weight to technically balance the classes, but that 2% just doesn't have a whole lot of variation in it to stretch it that far.

On the other hand, if we define success very broadly - just doing anything on the app - we might not run into the inverse problem where failure cases are wildly underrepresented (though we have seen that happen for some apps), but that also means the model won't really focus in and optimize for the things you most value as a business.

So we developed a way to balance the need for good positive label coverage with the need to focus on the model on the conversion event. Here's how it works:

  1. For each event in the app, calculate the probability that a user who does that event will do the conversion event within a reasonably close timeframe. For most of our customers, we look to see if the conversion happened within 24 hours of the previous event. For some customers, we lower that to as little as four hours. Sort the events from highest probability (the conversion event itself) to lowest.
  2. Calculate the “coverage” you would get - the percentage of total messages that would be labeled as a success - if you defined success as a user doing the highest-probability event. (We simulate random message sends to do the estimation). Then look how that coverage would change is success was defined as a user doing either of the top two events. Then the top three. And so on. So your coverage will increase with the addition of each event, but some events will cause it to increase more than others.
  3. Calculate the percent change in coverage that you achieve by adding in each subsequent event. If you plot these changes, you'll be able to easily identify change points - spikes in the coverage growth.
  4. Sort all changepoints from biggest to smallest. There will be a few events that create a lot of change, and most events will create little if any. Use your preferred knee-detection algorithm (which could just be your eyes, if you want) to choose a cutoff.

Now use that new set of events to calculate the coverage you'll get from defining success as the top 1, top 2, top 3, etc. events.

To make things more concrete: we have an e-commerce customer that has 365 distinct events, 309 of which have a non-zero probability of leading to the "order_completed" conversion event. We ran those 309 events through the above procedure:

Here's what we got:

                      

If we'd used only the "order_completed" event to define our success class for our model, our positive class would have made up only around 3% of our training data. Our customer called out add_to_cart as another event of interest, but that only gets us up to 10%. By including all 8 events that our procedure found, we get up to almost 30% coverage using only events that have at least a 40% chance of leading to a conversion event within 24 hours.

A class imbalance of 30% positive / 70% negative can easily be handled by a ML classifier through established sampling and re-weighting corrections.

We could have picked events that had a higher probability, but they wouldn't have raised our coverage substantially. We could have picked events that raised our coverage to as high as 45%, but those events had very little correspondence to an eventual conversion. This procedure balances the need for coverage with the need for focus, and allows our model to learn faster and better.

This browser does not support inline PDFs. Download the PDF to view it.

Class imbalance is a common problem in modeling that can't always be solved through sampling or reweighting.

Solving model class imbalance through class construction based on coverage changepoints

This post is about a geeky model-training issue that has very substantial business impact.

It’s about conversion events

If you have an app, there's lots of things you want your users to do, but for most apps there's just one or two things that you really want your users to do.

These are conversion events. Depending on the type of app, the conversion event could be:

  • Complete an order
  • Play a game
  • Add money to a wallet
  • Take out a micro-loan
  • Complete a shift
  • Complete a lesson
  • Watch a video

If you're hyper-focused just on engagement, a conversion event could just be visiting a particular kind of page. 

The point is:

  1. There's usually just one point - maybe a small handful at most - at which you get real, tangible, business-impact value from your users.
  2. Compared to all the rest of the activity on your app, those impactful events are really rare.

Event rarity is a modeling challenge as well as a business challenge

The core of Aampe's reinforcement learning system is a model that estimates the probability that each individual user on your app will respond positively when presented with a particular messaging choice.

  • A “messaging choice” is any way the timing, copy, channel, or other aspect of the message sent to that user differs from messages sent to other users.
  • “Respond positively” means the user move meaningfully along the path toward a conversion event. 

That requires us to feed information into the model about messages we sent, and whether those messages resulted in a success or failure.

If we define success only as "the user did the conversion event", we end up with a severe class-imbalance problem. If the model's training data is 98% failure and 2% success, sure, there are ways to oversample or undersample or re-weight to technically balance the classes, but that 2% just doesn't have a whole lot of variation in it to stretch it that far.

On the other hand, if we define success very broadly - just doing anything on the app - we might not run into the inverse problem where failure cases are wildly underrepresented (though we have seen that happen for some apps), but that also means the model won't really focus in and optimize for the things you most value as a business.

So we developed a way to balance the need for good positive label coverage with the need to focus on the model on the conversion event. Here's how it works:

  1. For each event in the app, calculate the probability that a user who does that event will do the conversion event within a reasonably close timeframe. For most of our customers, we look to see if the conversion happened within 24 hours of the previous event. For some customers, we lower that to as little as four hours. Sort the events from highest probability (the conversion event itself) to lowest.
  2. Calculate the “coverage” you would get - the percentage of total messages that would be labeled as a success - if you defined success as a user doing the highest-probability event. (We simulate random message sends to do the estimation). Then look how that coverage would change is success was defined as a user doing either of the top two events. Then the top three. And so on. So your coverage will increase with the addition of each event, but some events will cause it to increase more than others.
  3. Calculate the percent change in coverage that you achieve by adding in each subsequent event. If you plot these changes, you'll be able to easily identify change points - spikes in the coverage growth.
  4. Sort all changepoints from biggest to smallest. There will be a few events that create a lot of change, and most events will create little if any. Use your preferred knee-detection algorithm (which could just be your eyes, if you want) to choose a cutoff.

Now use that new set of events to calculate the coverage you'll get from defining success as the top 1, top 2, top 3, etc. events.

To make things more concrete: we have an e-commerce customer that has 365 distinct events, 309 of which have a non-zero probability of leading to the "order_completed" conversion event. We ran those 309 events through the above procedure:

Here's what we got:

                      

If we'd used only the "order_completed" event to define our success class for our model, our positive class would have made up only around 3% of our training data. Our customer called out add_to_cart as another event of interest, but that only gets us up to 10%. By including all 8 events that our procedure found, we get up to almost 30% coverage using only events that have at least a 40% chance of leading to a conversion event within 24 hours.

A class imbalance of 30% positive / 70% negative can easily be handled by a ML classifier through established sampling and re-weighting corrections.

We could have picked events that had a higher probability, but they wouldn't have raised our coverage substantially. We could have picked events that raised our coverage to as high as 45%, but those events had very little correspondence to an eventual conversion. This procedure balances the need for coverage with the need for focus, and allows our model to learn faster and better.

This browser does not support inline PDFs. Download the PDF to view it.