LLM "alignment" refers to reducing the gap between LLM output and user expectations. I came across a nice technical overview (LLMs from scratch) comparing the two main methods for LLM alignment: Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO).
My conclusion: both methods show how talk about AI alignment is really missing the boat.
The difference between RLHF and DPO is technical rather than substantial. RHLF takes feedback about how good an output is, whereas DPO takes actual corrections. Both methods input individual user preferences, but output a model that tries to cater to the average preferences of the whole user base.
In human cognition, "procedural" memory is the mechanism behind automatic actions like riding a bike, typing, or speaking. That's why people can run their mouths without really saying anything—our brain can string together words coherently just by remembering what words tend to follow other words. LLMs mimic procedural memory and only procedural memory. That, in a nutshell, is the alignment problem.
Procedural memory is different from other kinds of implicit memory: associative (Pavlov's dogs), non-associative (people who live near train tracks learn to ignore the sound of passing trains), and priming (repeated exposure to a brand's advertising makes you more willing to give that brand a try).
Neither RHLF nor DPO do anything to represent those other kinds of memory.
Behavioral agents, on the other hand, remember:
- which value propositions tends to precede a purchase (associative)
- which messages are syntactically similar, and therefore prone to habituation (non-associative)
- which actions tend to feed into purchase behavior, and are therefore early indicators of purchase behavior (priming)
LLMs can ingest session-specific background information. OpenAI calls this a "system" prompt—the user never sees it, but the LLM acts on it.
Agentic learning produces tagged weights that represent lessons learned and these tagged weights can be fed into a system prompt to give an LLM information not available from procedural memory. This also allows the LLM to treat users as individuals rather than ingredients for an aggregate
When a user asks an LLM what key features to look for when buying a new laptop, instead of learning that users in general prefer a user-friendly response to a technical response, a primed LLM can know that this particular user in this particular session prefers a technical response, even if everyone else in the world prefers a user-friendly response.
Better LLM training can't solve the alignment problem because LLMs mimic only one out of many kinds of human cognitive capabilities. We get closer to alignment when we mimic more kinds of memory.