Jul 5, 2024
Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

Jul 5, 2024
Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

Jul 5, 2024
Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

Jul 5, 2024
Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

There are generally limited options for clustering large numbers of records. A typical Aampe customer has up to 400 app events ("product viewed", "purchase completed", etc.) instrumented, for sometimes as many as 100 million end users. Clustering 300 features is simple, even with 100 million records - something like mini-batch k-means can do the trick.

Clustering records (grouping those 100 million users based on complete-profile similarity) gets more complicated.

Many clustering algorithms need to compute pairwise distances...even with 1 million records, that's no joke. That's why my current favorite clustering approach for event data involves:

  1. An event-event distance matrix based on temporal proximities.

  2. Hierarchical clustering to group events.

  3. Assignment of users to event clusters based on event activation.

These days, I calculate the distance matrix as a modified Jaccard distance by picking temporal bounds - I count the intersection of any two events that happen within 24 hours but not within 1 hour of each other. I use that lower bound to filter out event relationships that are just part of an automated workflow - for example, checkout started, payment selected, address selected, and purchase, and checkout completed tend to be very tightly connected because once you start a checkout, you're likely to do the rest of the flow; I omit those relationships because I'm more interested in event relationships across sessions.

I've always preferred hierarchical clustering because it's versatile and introspectable.

I can choose however many clusters I need - and switch to a different number of clusters quickly and even interactively if I want - but I can also easily look at higher-level clusters to aid in interpretation.

After that, I assign users to clusters based on the number of times they've done the events that make up each cluster. I put those event counts through a tf-idf transformation to prevent clusters with very common events from looking like they are more popular. Of course, this means that each user will be a member of several clusters. The thing is: that's true of most clustering approaches. It's just that most clustering approaches hide those multiple associations - they pick cluster centroids, measure each record's distance from each centroid, and assign the cluster of the closest centroid. But I can also get each user's second-best, third-best, etc. cluster if I need them.

All of this gives clusters with intuitive definitions and flexible membership criteria to aid introspection (you can see a typical basic output in the table above). Also, those intuitive definitions help when you calculate success metrics (click rates, purchase rates, etc.) for each cluster - you can easily rank clusters based on the success metrics, identify intervention possibilities based on cluster events, and target those interventions based on cluster membership.

0

Related

Shaping the future of marketing with Aampe through innovation, data.

May 29, 2025

Schaun Wheeler

Discover how Aampe's semantic-associative agents adapt to varying user contexts, from fully known to unknown users, ensuring personalized experiences through continuous learning and contextual imputation.

May 29, 2025

Schaun Wheeler

Discover how Aampe's semantic-associative agents adapt to varying user contexts, from fully known to unknown users, ensuring personalized experiences through continuous learning and contextual imputation.

May 29, 2025

Schaun Wheeler

Discover how Aampe's semantic-associative agents adapt to varying user contexts, from fully known to unknown users, ensuring personalized experiences through continuous learning and contextual imputation.

May 29, 2025

Schaun Wheeler

Discover how Aampe's semantic-associative agents adapt to varying user contexts, from fully known to unknown users, ensuring personalized experiences through continuous learning and contextual imputation.

May 28, 2025

Schaun Wheeler

Explore how Aampe's semantic-associative agents differ from traditional multi-armed bandit models. Learn about their multi-dimensional action space and non-ergodic learning approach that tailors user experiences without generalization.

May 28, 2025

Schaun Wheeler

Explore how Aampe's semantic-associative agents differ from traditional multi-armed bandit models. Learn about their multi-dimensional action space and non-ergodic learning approach that tailors user experiences without generalization.

May 28, 2025

Schaun Wheeler

Explore how Aampe's semantic-associative agents differ from traditional multi-armed bandit models. Learn about their multi-dimensional action space and non-ergodic learning approach that tailors user experiences without generalization.

May 28, 2025

Schaun Wheeler

Explore how Aampe's semantic-associative agents differ from traditional multi-armed bandit models. Learn about their multi-dimensional action space and non-ergodic learning approach that tailors user experiences without generalization.

May 26, 2025

Schaun Wheeler

Traditional customer engagement tools often constrain strategies by bundling orchestration and analysis within fixed campaign structures. Adopting an agentic approach—separating orchestration from analysis—enables more dynamic, user-centered communication, allowing for nuanced decision-making and broader impact.

May 26, 2025

Schaun Wheeler

Traditional customer engagement tools often constrain strategies by bundling orchestration and analysis within fixed campaign structures. Adopting an agentic approach—separating orchestration from analysis—enables more dynamic, user-centered communication, allowing for nuanced decision-making and broader impact.

May 26, 2025

Schaun Wheeler

Traditional customer engagement tools often constrain strategies by bundling orchestration and analysis within fixed campaign structures. Adopting an agentic approach—separating orchestration from analysis—enables more dynamic, user-centered communication, allowing for nuanced decision-making and broader impact.

May 26, 2025

Schaun Wheeler

Traditional customer engagement tools often constrain strategies by bundling orchestration and analysis within fixed campaign structures. Adopting an agentic approach—separating orchestration from analysis—enables more dynamic, user-centered communication, allowing for nuanced decision-making and broader impact.

May 22, 2025

Schaun Wheeler

Agentic systems are designed to operate within a manageable state space, focusing on relevant variables to make effective decisions. This approach contrasts with traditional AI models that attempt to process vast amounts of data, often leading to inefficiencies and suboptimal performance.

May 22, 2025

Schaun Wheeler

Agentic systems are designed to operate within a manageable state space, focusing on relevant variables to make effective decisions. This approach contrasts with traditional AI models that attempt to process vast amounts of data, often leading to inefficiencies and suboptimal performance.

May 22, 2025

Schaun Wheeler

Agentic systems are designed to operate within a manageable state space, focusing on relevant variables to make effective decisions. This approach contrasts with traditional AI models that attempt to process vast amounts of data, often leading to inefficiencies and suboptimal performance.

May 22, 2025

Schaun Wheeler

Agentic systems are designed to operate within a manageable state space, focusing on relevant variables to make effective decisions. This approach contrasts with traditional AI models that attempt to process vast amounts of data, often leading to inefficiencies and suboptimal performance.

Load More

Load More

Load More

Load More