The most fun I have as a data scientist is when I learn about something that solves a problem I’ve had for a while. One of those problems I’ve run into in several jobs now is figuring out what I’ll call “distributedness.”
Say we’ve got two users on your food delivery app: Jones and Fletcher. Jones and Fletcher are both avid foodies, but their tastes are different. Where Fletcher often samples new restaurants, Jones tends to order from the same cafe every time she chooses to treat herself. These are obviously very different behaviors - one user distributes her engagement across lots of categories, the other engages only with one.
It’s not hard to tell Jones apart from Fletcher here - you can count the unique number of restaurants in their event histories. But it’s trickier when you have a case like a third user we’ll call Bonnie. Bonnie has a couple of favorite restaurants that she orders from frequently, but she’s also inclined to try new things. She and Fletcher might conceivably have the same distinct count of restaurants in their event histories, but perhaps Fletcher only ever visits a restaurant once, whereas Bonnie goes back to her favorites several times even as she tries new ones. (and let’s face it, most people are Bonnies!)
These are different kinds of customer, with different messaging strategy implications! Maybe Fletcher’s having trouble finding a restaurant she likes and needs a bump towards a highly-rated restaurant she hasn’t tried before. Maybe Bonnie’s showing us her preference for spicy food, alongside her enjoyment of trying new things, and could benefit from a recommendation for Marvel’s Hot Wings.
Whatever messaging strategy we take, we need to first find a way to characterize how much a user distributes their activity across categories.
I’ve tried a few different statistical approaches to this general problem in the past, none of which quite captured what I wanted.
The thing that eventually worked for me was Gini impurity.
Wikipedia defines Gini impurity as measuring "how often a randomly chosen element of a set would be incorrectly labeled if it were labeled randomly and independently according to the distribution of labels in the set. It reaches its minimum (zero) when all cases in the node fall into a single target category." Formally, we can define the Gini impurity of dataset D as this:
...which means: the Gini impurity is the sum of the squared probabilities of falling into any given class, subtracted from 1.
You might have seen this term in walkthroughs of how a decision tree creates decision rules. Decision trees are often grown by finding splits that minimize the weighted Gini impurity of the child nodes. In other words, at every split in a tree, we’re trying to sort the data into two boxes where all (or at least most of) the data in either box belongs to just one of the classes we’re trying to predict.
We’re not using Gini impurity to help us solve a classification problem here, though. We’re turning it on its head somewhat.
We’re treating each subject’s category frequencies as a dataset, and calculating the Gini impurity for each subject, in order to see how distributed across categories a subject is.
Returning to our restaurant app for demonstration purposes, we can calculate the Gini impurity for each user like this:
(Note that in practice, the empirical probability for going to a restaurant you’ve never been to is 0. So we only need to tally frequencies for restaurants the users have been to.)
How do our users shape up?
I’m mostly happy with this outcome. Jones is clearly the lowest in “distributedness”, with a Gini impurity of 0 - which is intuitively correct, since she only ever goes to one restaurant. Fletcher, who tries a new restaurant every time she orders, has the highest score. Fletcher and Bonnie, despite having the same number of unique restaurants in their history, are differentiable - Bonnie’s concentration on the spicy offerings show up in her having a lower score than Fletcher.
A quick caveat for thinking about this with small numbers of categories
So we’re done, right? Awesome! We have a ranking metric that gives us some insight into our user behavior! What more do we need to capture?
Well…there’s a reason we’ve historically used Gini as something we try to minimize, rather than maximize. (Like in a decision tree, where we’re aiming at low impurity/“distributedness” in the two nodes resulting from a split - we want to find a split that best separates the training data into their classes.)
One clue to this is in Fletcher’s Gini impurity. Her perfectly even chance of visiting any one of the four restaurants is 25%; her Gini impurity is 1 - .25. This is not a coincidence:
When a user’s engagement is distributed perfectly evenly over the categories, their Gini impurity varies according to the number of distinct categories:
Basically, the Gini impurity of a highly pure dataset approaches 0. But a highly impure dataset can vary in its Gini impurity, depending on how many distinct categories are in it. The difference is especially stark between a highly impure dataset with lower numbers of categories; once you get into higher numbers of categories, it tapers off some.
This might have real implications for characterizing behavior. Is someone who went to ten different restaurants once (Gini = 0.9) different in kind from someone who went to six different restaurants once (Gini = 0.833)? From someone who went to four (Gini = 0.750)? Essentially, the fewer distinct categories you have, the purer an evenly-distributed user might look (when the category numbers are low).
You might say yes, they’re different enough in kind that it’s okay; or you might be fine with the “having little signal” users getting lumped in with the “not very distributed” users. Or you might want to transform the Gini impurity somewhat to handle your few-category dataset.
But that’s for another blog post.
Back to the show!
How has this actually mattered to Aampe?
The core infrastructure of Aampe is a reinforcement learning algorithm as applied to user experience. That means that when we’re sending a message (or otherwise influencing user experience), we want to optimally balance exploring the space of what a user might prefer, and exploiting what we already know they prefer. Our agents find messaging formulas they think users will be responsive to, and send them at times they think the users will like (with some exploration built in once in a while, just in case a better context can be found).
So having something like the Gini impurity in our toolkit can help us to introspect the choices the model makes. Suppose we calculated the Gini impurity for our contacts with respect to the frequencies that they were sent messages from certain kinds of formulas. Like this:
WITH formulawise_info AS (
-- get the basic counts of how many times someone was sent a message for a particular formula, from which we will derive the gini
SELECT
contact_id
, formula_name
, COUNT(1) AS n_msgs
FROM messages
GROUP BY 1, 2
), pre_gini AS (
-- calculate the gini index (part 1...because you can't use analytic functions in an aggregate)
SELECT
contact_id
-- what's the squared probability of a person getting any given formula?
, POW(SAFE_DIVIDE(n_msgs, SUM(n_msgs) OVER(PARTITION BY contact_id)), 2) AS psq
FROM formulawise_info
), gini AS (
-- calculate the gini index (part 2)
SELECT
contact_id
, 1 - SUM(psq) AS gini_impurity
FROM pre_gini
GROUP BY 1
)
SELECT
contact_id
, gini_impurity
, AVG(message_freshness_score) AS avg_freshness
, AVG(copy_score) AS avg_copy_score
FROM gini
JOIN messages USING(contact_id)
GROUP BY 1, 2
""")
We see a few neat patterns here between how “distributed” our contacts are across formulas and their copy.
We track the freshness of a message - briefly, this is a metric of how similar this message was in verbiage to previous messages. (Higher is better; people get bored looking at the same thing over and over.) We can see that messages are the most fresh to people with the lowest Gini score. This is most likely an artifact of contacts just starting out in the system - having fewer messages will lead to having fewer distinct categories, which we know depresses the Gini when numbers are very low, and simultaneously everything you see will be fresh to you if you’ve just had a few messages. You’ll also see that the lowest Gini bucket also has the lowest copy score, which is consistent with the “few-messages” theory - the AI doesn’t know what brand new contacts like yet.
So let’s leave the first bucket aside for now. The rest of what we see is a clear cluster of low-freshness, high-copy-score users, and a cluster of higher-freshness, lower-copy-score users. You may have intuited this already, but the Gini impurity of a user with respect to the formula they’ve been sent is related to this in a big way.
What we’re seeing is
- a high-Gini group, who are very “distributed” across formulas - the AI seems to still be exploring the formula space for these users. This is reflected in fresher messages (of course it’s fresh, it’s a whole new formula!) and lower copy score (it’s just not sure about any formula for this guy!).
- a low-Gini group, who are less “distributed” across formulas, because the AI has found what they liked and is generally exploiting it. They’ve got less fresh messages (because the AI is perseverating on the successful formulas, meaning the verbiage is going to get old faster), and better copy scores (the AI is pretty dang sure about this formula for this guy!).
The value Aampe adds is personalization at scale, which means that the agents responsible for this are doing some extraordinarily computationally complex work. I’m always wishing for new windows into our agents, to help us understand their decision-making and improve on it.
I guess you could say Gini granted my wish?