Jul 5, 2024
Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

Jul 5, 2024
Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

Jul 5, 2024
Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

Jul 5, 2024
Schaun Wheeler

Scalable Event-Based Clustering for User Segmentation

There are generally limited options for clustering large numbers of records. A typical Aampe customer has up to 400 app events ("product viewed", "purchase completed", etc.) instrumented, for sometimes as many as 100 million end users. Clustering 300 features is simple, even with 100 million records - something like mini-batch k-means can do the trick.

Clustering records (grouping those 100 million users based on complete-profile similarity) gets more complicated.

Many clustering algorithms need to compute pairwise distances...even with 1 million records, that's no joke. That's why my current favorite clustering approach for event data involves:

  1. An event-event distance matrix based on temporal proximities.

  2. Hierarchical clustering to group events.

  3. Assignment of users to event clusters based on event activation.

These days, I calculate the distance matrix as a modified Jaccard distance by picking temporal bounds - I count the intersection of any two events that happen within 24 hours but not within 1 hour of each other. I use that lower bound to filter out event relationships that are just part of an automated workflow - for example, checkout started, payment selected, address selected, and purchase, and checkout completed tend to be very tightly connected because once you start a checkout, you're likely to do the rest of the flow; I omit those relationships because I'm more interested in event relationships across sessions.

I've always preferred hierarchical clustering because it's versatile and introspectable.

I can choose however many clusters I need - and switch to a different number of clusters quickly and even interactively if I want - but I can also easily look at higher-level clusters to aid in interpretation.

After that, I assign users to clusters based on the number of times they've done the events that make up each cluster. I put those event counts through a tf-idf transformation to prevent clusters with very common events from looking like they are more popular. Of course, this means that each user will be a member of several clusters. The thing is: that's true of most clustering approaches. It's just that most clustering approaches hide those multiple associations - they pick cluster centroids, measure each record's distance from each centroid, and assign the cluster of the closest centroid. But I can also get each user's second-best, third-best, etc. cluster if I need them.

All of this gives clusters with intuitive definitions and flexible membership criteria to aid introspection (you can see a typical basic output in the table above). Also, those intuitive definitions help when you calculate success metrics (click rates, purchase rates, etc.) for each cluster - you can easily rank clusters based on the success metrics, identify intervention possibilities based on cluster events, and target those interventions based on cluster membership.

0

Related

Shaping the future of marketing with Aampe through innovation, data.

Renewals, holidays, and launches don’t need hardcoded rules. With reward signals, eligibility criteria, and timing action sets, agents adapt naturally to recurring patterns.

Renewals, holidays, and launches don’t need hardcoded rules. With reward signals, eligibility criteria, and timing action sets, agents adapt naturally to recurring patterns.

Renewals, holidays, and launches don’t need hardcoded rules. With reward signals, eligibility criteria, and timing action sets, agents adapt naturally to recurring patterns.

Renewals, holidays, and launches don’t need hardcoded rules. With reward signals, eligibility criteria, and timing action sets, agents adapt naturally to recurring patterns.

Aug 21, 2025

Schaun Wheeler

By modeling statistical relationships between events, agents evaluate directional shifts in behavior—so the same system adapts across every lifecycle stage.

Aug 21, 2025

Schaun Wheeler

By modeling statistical relationships between events, agents evaluate directional shifts in behavior—so the same system adapts across every lifecycle stage.

Aug 21, 2025

Schaun Wheeler

By modeling statistical relationships between events, agents evaluate directional shifts in behavior—so the same system adapts across every lifecycle stage.

Aug 21, 2025

Schaun Wheeler

By modeling statistical relationships between events, agents evaluate directional shifts in behavior—so the same system adapts across every lifecycle stage.

Aug 19, 2025

Schaun Wheeler

You don’t coach by chasing the trophy. You coach by tracking whether each play puts you in a stronger position. The same is true for customer engagement.

Aug 19, 2025

Schaun Wheeler

You don’t coach by chasing the trophy. You coach by tracking whether each play puts you in a stronger position. The same is true for customer engagement.

Aug 19, 2025

Schaun Wheeler

You don’t coach by chasing the trophy. You coach by tracking whether each play puts you in a stronger position. The same is true for customer engagement.

Aug 19, 2025

Schaun Wheeler

You don’t coach by chasing the trophy. You coach by tracking whether each play puts you in a stronger position. The same is true for customer engagement.

Jul 23, 2025

Schaun Wheeler

A/B tests help us see what works on average, but real users aren’t average, their motivations and contexts vary. That’s where agentic learning shines, adapting to individuals over time. The best results come when we layer the two: tests for clarity, agents for personalization.

Jul 23, 2025

Schaun Wheeler

A/B tests help us see what works on average, but real users aren’t average, their motivations and contexts vary. That’s where agentic learning shines, adapting to individuals over time. The best results come when we layer the two: tests for clarity, agents for personalization.

Jul 23, 2025

Schaun Wheeler

A/B tests help us see what works on average, but real users aren’t average, their motivations and contexts vary. That’s where agentic learning shines, adapting to individuals over time. The best results come when we layer the two: tests for clarity, agents for personalization.

Jul 23, 2025

Schaun Wheeler

A/B tests help us see what works on average, but real users aren’t average, their motivations and contexts vary. That’s where agentic learning shines, adapting to individuals over time. The best results come when we layer the two: tests for clarity, agents for personalization.

Load More

Load More

Load More

Load More