Contextual bandits and how learning systems actually behave

Jan 14, 2026
Schaun Wheeler

Contextual bandits are a great idea in general. It’s just that they’re usually implemented to make several specific, poor decisions.

Contextual bandits are one of the workhorses of modern decisioning. They show up in marketing, recommendations, ranking, and experimentation systems because they solve a simple problem: you have several things you could do, you do not know which is best yet, and you only get partial feedback.

That broad usefulness has an odd side effect. People talk about “contextual bandits” as if they are a product category, when they are really a family of decision rules. A lot of systems that feel very different share the same bandit skeleton. A lot of systems that get marketed with bandit language behave nothing alike once you look at what they learn and how they keep that learning.

Aampe belongs in the bandit lineage. It also diverges in the places that determine whether the system behaves like aggregate optimization or like individualized learning.

Bandits in one paragraph

A contextual bandit is a policy that repeats a loop: it observes context, chooses an action from a set of possible actions, later observes some outcome, and finally updates how it will choose next time.

That is the whole concept. Everything that matters in practice sits inside the words “context,” “action,” “outcome,” and “update.”

In many production deployments inside marketing tools, those terms get filled in like this.

  • Context: a feature vector built from user attributes and recent behavior.

  • Action: a concrete choice such as channel, time, frequency, offer, or a specific creative.

  • Outcome: a sparse event such as click, conversion, or revenue within a window.

  • Update: a rule that shifts traffic toward what increases expected reward for users with similar features.

This design delivers steady improvement at the population level. It allocates attention toward options that win on average for a slice of users defined by the feature set.

Do contextual bandits “personalize”?

Conditioning on user features is often called personalization. That label is technically defensible and practically misleading.

The stronger claim people usually mean is individual learning: the system figures out what works for a specific person, retains that knowledge, and uses it even when the population-level optimum points elsewhere. A standard contextual bandit policy does not guarantee that behavior because it is solving a different objective. It is maximizing expected reward. If option B wins for most users in a context bucket, the policy will drift toward B even if a particular user has repeatedly responded better to A. That is what it means to optimize expected value under uncertainty.

That distinction is not academic. It changes the feel of the system. Aggregate optimization makes decisions that look reasonable in dashboards but can be offputting and therefore - int he long run - counterproductive at the the individual level.

“Creative” and other dead-end dimensions

A lot of systems try to keep the action space manageable by decomposing decisions into dimensions: channel, time, offer, creative. This is a good instinct, but it only works if the dimensions carry transferable information.

“Creative” usually does not. Treating each asset as an atomic arm forces the learner to relearn every time copy refreshes, a new variant launches, localization changes, or personalization templates shift. The model is always chasing a moving target - it can never accumulate stable knowledge because the arms are not stable. A system that can generalize needs a different representation of action.

Aampe’s core move is to learn over meaning, per person. Aampe agents still make decisions under uncertainty and still balance exploration against exploitation, similar to most bandit architectures. The difference is what the system is trying to learn.

  • The durable object is the person.

  • The actions are represented in a semantic space.

  • The update loop is designed to attribute incremental change, not just record what happened afterward.

Those three choices force different mechanics.

Per-person persistence across goals

Many decisioning systems scope learning to a use case. Onboarding has one learner. Winback has another. Monetization has a third. Even if they share infrastructure, their learned preferences often do not transfer cleanly when the objective changes.

Aampe agents treat the individual as the persistent unit. Onboarding, retention, monetization, and reactivation are different phases, not different brains. Evidence accumulates about the same user across time. When the goal changes, the system does not start from scratch. It starts from the user model it already built.

Meaning-level actions that transfer

Instead of learning “Creative 17 beats Creative 23,” Aampe learns over attributes such as value proposition, incentive depth, framing, tone, urgency, and product emphasis. New copy lands in that same space.

If you ship a new message that uses the same value proposition and tone as a previously effective message, the system can treat it as a close cousin on day one. It does not require weeks of traffic to discover that the user tends to respond to that kind of framing. If you ship a message with a different tone, the system can explore it deliberately, using what it already knows about that person’s tolerance for novelty.

This is why “creative” is not a meaningful dimension. It is an inventory label. The meaning is what transfers.

Updates that aim at incremental impact

Most lifecycle systems learn from events that happen after a message. That encourages correlation-based learning: users who got message X later did Y.

Aampe agents push the update loop toward episodic judgment. Each send is treated as an episode with a baseline expectation for that user. The system asks whether behavior shifted relative to what that user was likely to do anyway. It uses small behaviors as signal because those behaviors help establish the baseline and the deviation, even when a conversion is rare.

This is not a claim of perfect causality, of course. It’s a claim about what the system optimizes for: estimating incremental lift per person rather than simply chasing whichever option precedes conversions more often.

Borrowing population signal without collapsing into averages

Per-person learning starts sparse. Aampe uses population information as priors so early decisions are sensible. As evidence accumulates for an individual, personal parameters take over. The population is scaffolding. (This is known in the literature as a “hierarchical adaptive contextual bandit”...most contextual bandits on the market aren’t that. HACBs are much harder to design.)

Population tendencies is where many contextual bandit deployments stop. They remain anchored to context buckets because that is where data density lives and where operational complexity stays manageable. We specifically architected Aampe agents to move past that stage.

What to take away

Aampe is not a contextual bandit. The agents are built on the foundation of certain core ideas from the bandit literature. The agents adapt over time like any bandit system, but behave  differently from most bandit configurations because they learn over meaning, store that learning per person, and update based on incremental impact.

Once you define bandits as decision rules rather than as a product category, the landscape becomes easier to reason about. Does the system learn stable things about options in aggregate, or stable things about people that carry forward when options and objectives change? That’s the core question to ask.