People sometimes ask whether our system is a kind of multi-armed bandit. It’s not. But that’s not a bad place to start if you want a familiar reference point.
Our semantic-associative agents use the same basic intuition: take actions, observe outcomes, and update preferences. But two key differences make this something else entirely:
Multi-dimensional action space
In a typical bandit problem, the agent chooses from a flat set of discrete actions—pull arm A, B, or C. Each action is assumed to be atomic and independent. Even when bandits are extended into contextual or combinatorial forms, they still often treat each action as a point in a single, unified decision space.
Real-world decision-making—especially in applications like customer engagement—isn’t like that. You’re not just choosing “an action.” You’re selecting a profile made up of choices across several intersecting dimensions: time of day, day of week, message channel, content theme, offer type, incentive level, tone, subject line, etc. Each of these is its own action set, and the agent must learn how these dimensions interact—both with each other and with user behavior. The task isn’t just to find the best arm, but to learn a combinatorial space of micro-preferences and then select a coherent, deliverable action bundle that fits.
Non-ergodic learning
Most bandit systems assume some form of ergodicity—the idea that statistical insights gained from one user’s behavior can generalize to another. In an ergodic system, learning can be pooled: we assume that averages across time and averages across the population converge. That makes for efficient learning, especially when individual data is sparse.
But user behavior in domains like messaging or content interaction is not ergodic. People differ—not just in preferences, but in responsiveness, habits, intent, timing, and attention. Treating these differences as noise and trying to learn a global average flattens signal that actually matters. Our agents treat each user as their own environment. They don’t generalize across users. They build up individualized models based solely on that user’s interaction history, which lets them preserve and act on genuine behavioral variance instead of averaging it away.
So while it’s tempting to think of this as a fancy bandit setup, that framing misses what’s actually happening. It’s not a variant—it’s a structurally different approach. Bandits are a good metaphor to start with, but the differences are architectural, not cosmetic.
And to be clear: none of this depends on LLMs. An LLM is just an actor—it takes context and produces plausible outputs. Our learning agents run upstream of that. They’re responsible for producing the right context in the first place, based on what they’ve learned about how a particular user responds to different combinations of actions. That context can then drive the LLM, or be used to select from a content library indexed to the same profile space.