Scalable Event-Based Clustering for User Segmentation

Scalable Event-Based Clustering for User Segmentation

Agentic Edge

Jul 5, 2024

Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

Agentic Edge

Jul 5, 2024

Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

Agentic Edge

Jul 5, 2024

Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

There are generally limited options for clustering large numbers of records. A typical Aampe customer has up to 400 app events ("product viewed", "purchase completed", etc.) instrumented, for sometimes as many as 100 million end users. Clustering 300 features is simple, even with 100 million records - something like mini-batch k-means can do the trick.

Clustering records (grouping those 100 million users based on complete-profile similarity) gets more complicated.

Many clustering algorithms need to compute pairwise distances...even with 1 million records, that's no joke. That's why my current favorite clustering approach for event data involves:

An event-event distance matrix based on temporal proximities.
Hierarchical clustering to group events.
Assignment of users to event clusters based on event activation.

These days, I calculate the distance matrix as a modified Jaccard distance by picking temporal bounds - I count the intersection of any two events that happen within 24 hours but not within 1 hour of each other. I use that lower bound to filter out event relationships that are just part of an automated workflow - for example, checkout started, payment selected, address selected, and purchase, and checkout completed tend to be very tightly connected because once you start a checkout, you're likely to do the rest of the flow; I omit those relationships because I'm more interested in event relationships across sessions.

I've always preferred hierarchical clustering because it's versatile and introspectable.

I can choose however many clusters I need - and switch to a different number of clusters quickly and even interactively if I want - but I can also easily look at higher-level clusters to aid in interpretation.

After that, I assign users to clusters based on the number of times they've done the events that make up each cluster. I put those event counts through a tf-idf transformation to prevent clusters with very common events from looking like they are more popular. Of course, this means that each user will be a member of several clusters. The thing is: that's true of most clustering approaches. It's just that most clustering approaches hide those multiple associations - they pick cluster centroids, measure each record's distance from each centroid, and assign the cluster of the closest centroid. But I can also get each user's second-best, third-best, etc. cluster if I need them.

All of this gives clusters with intuitive definitions and flexible membership criteria to aid introspection (you can see a typical basic output in the table above). Also, those intuitive definitions help when you calculate success metrics (click rates, purchase rates, etc.) for each cluster - you can easily rank clusters based on the success metrics, identify intervention possibilities based on cluster events, and target those interventions based on cluster membership.

Shaping the future of marketing with Aampe through innovation, data.

See All Posts

Jun 16, 2025

Schaun Wheeler

How Aampe's Agents Use Causal Analysis to Measure Impact Amidst External Messaging

Discover how Aampe's agents employ causal analysis to accurately measure user engagement outcomes, even when influenced by external messages. By isolating the effects of their own actions from other variables, Aampe ensures precise attribution and effective decision-making in a complex messaging environment.

Jun 16, 2025

Schaun Wheeler

How Aampe's Agents Use Causal Analysis to Measure Impact Amidst External Messaging

Jun 16, 2025

Schaun Wheeler

How Aampe's Agents Use Causal Analysis to Measure Impact Amidst External Messaging

Jun 16, 2025

Schaun Wheeler

How Aampe's Agents Use Causal Analysis to Measure Impact Amidst External Messaging

Jun 10, 2025

Schaun Wheeler

Understanding Decision-Making in Agentic Systems

Explore how agentic systems define and execute decisions. This article delves into the five key decision types—Go/No-Go, Context, Creative Policy, Item Recommendation, and Freshness—that guide autonomous agents in delivering personalized user experiences. Learn how these systems prioritize meaningful choices to enhance engagement and effectiveness.

Jun 10, 2025

Schaun Wheeler

Understanding Decision-Making in Agentic Systems

Jun 10, 2025

Schaun Wheeler

Understanding Decision-Making in Agentic Systems

Jun 10, 2025

Schaun Wheeler

Understanding Decision-Making in Agentic Systems

Jun 5, 2025

Schaun Wheeler

Agentic Architecture in Action: Aampe's Dual-Path Personalization Strategy

Explore Aampe's innovative approach to personalization, combining classical recommender systems for item selection with real-time reinforcement learning agents for dynamic message composition. Learn how this dual-path strategy enhances user engagement and content relevance.

Jun 5, 2025

Schaun Wheeler

Agentic Architecture in Action: Aampe's Dual-Path Personalization Strategy

Jun 5, 2025

Schaun Wheeler

Agentic Architecture in Action: Aampe's Dual-Path Personalization Strategy

Jun 5, 2025

Schaun Wheeler

Agentic Architecture in Action: Aampe's Dual-Path Personalization Strategy

Jun 3, 2025

Schaun Wheeler

Evaluating Adaptive Systems: Beyond Short-Term Metrics

Explore why focusing solely on short-term metrics like lift can be misleading when assessing adaptive systems, and discover alternative approaches for meaningful evaluation.

Jun 3, 2025

Schaun Wheeler

Evaluating Adaptive Systems: Beyond Short-Term Metrics

Explore why focusing solely on short-term metrics like lift can be misleading when assessing adaptive systems, and discover alternative approaches for meaningful evaluation.

Jun 3, 2025

Schaun Wheeler

Evaluating Adaptive Systems: Beyond Short-Term Metrics

Explore why focusing solely on short-term metrics like lift can be misleading when assessing adaptive systems, and discover alternative approaches for meaningful evaluation.

Jun 3, 2025

Schaun Wheeler