LLM agents[1] seem reasonably likely to become our first takeover-capable AGIs.[2]  LLMs already have complex "psychologies," and using them to power more sophisticated agents will create even more complex "minds." A profusion of competing goals is one barrier to aligning this type of AGI.

Goal sources for LLM agents can be categorized by their origins from phases of training and operation: 

  1. Goals implicit in predictive base training on human-written texts[3]
  2. Goals implicit in fine-tuning training (RLHF, task-based RL on CoT, etc)[4] 
  3. Goals specified by developer system prompts[5]
  4. Goals specified by user prompts
  5. Goals specified by prompt injections
  6. Goals arrived at through chain of thought (CoT) logic
    • e.g., "I should hide this goal from the humans so they don't stop me"
    • this is effectively "stirring the pot" of all the above sources of goals.
  7. Goal interpretation changes in the course of continuous learning[6]
    • e.g., forming a new belief that the concept "person" includes self-aware LLM agents[7]

How would such a complex and chaotic system have predictable goals in the long term? It might feel as though we should avoid this path to AGI at all costs to avoid this complexity.[8] 

But alignment researchers are clearly not in charge of the path we take to AGI. We may get little say before it is achieved. It seems we should direct much of our effort toward the specific type of AGI we're likely to get first.

The challenge of aligning agents with many goal sources may be addressable through metacognition. Just as humans can notice when they're violating their core values, LLM agents could be designed with metacognitive abilities that help maintain the dominance of specific goals. This suggests two routes to mitigating this problem:[9]

  • Mechanisms and training regimes designed specifically to detect and reject unwanted goals entering the CoT
  • Improved general intelligence applied to reasoning clearly about which goals should be accepted and prioritized, and which goals and belief changes should rejected

There is a good deal more to say about the specifics. Developers will be highly motivated to minimize random or destructive behavior from goals other than their own, long before agents are takeover-capable. Whether their methods succeed will determine whether we survive "phase 1" aligning the first transformative AGIs.[10] More on this in an upcoming post on "System 2 alignment" (linked here when it's out in a few days). 

Finding and specifying any goal that is actually aligned with humans' long-term interests is a separable problem. It is the central topic of alignment theory.[11] Here I want to specifically point to the less-considered problem of multiple sources of competing goals in an LLM agent AGI. I am also gesturing in the direction of a solution, reflective stability[7] as a result of improved metacognition — but those solutions need more careful thought.

  1. ^

    By LLM agents, I mean everything from OpenAI's comically inept Operator to  systems with cognitive subsystems like episodic memory, continuous learning, dedicated planning systems, and extensive algorithmic prompting for executive functioning/planning. See more here. A highly capabable LLM can become a useful agent by repeating the prompt "keep pursuing goal [X] using tools described in [Y] to gather information and take actions."  

  2. ^

    Opinions vary on the odds that LLM agents will become our first transformative AGI.  I'm not aware of any good survey. My argument for short timelines being plausible is a brief statement of how LLM agents may be closer than they appear to reaching human level. Timelines aside, it seems uncontroversial to say this route to takeover-capable AGI seems likely enough to deserve some specific attention for alignment theory. 

  3. ^

    LLMs are Predictors, not Imitators, but they seem to often predict by acting as Simulators. That includes simulating goals and values implied by the human texts they've trained on.

  4. ^

    Goals from fine-tuning covers many different types of fine-tuning. One might divide it into subcategories, like giving-correct-answers (o1 and o3's CoT RL) vs giving-answers-people-like (RLHF, RLAIF, Deep Research's task-based RL on CoT in o3, o1 and o3's "deliberative alignment"). These produce different types of goals or pseudo-goals, and have different levels of risk. But the intent here is just to note the profusion of goals and competition among goals; evaluating each for alignment is a much larger project.

  5. ^

    Goals specified by the developer are what we'd probably like to have as the stable dominant goal. And it's worth noting that, so far, LLMs do seem to mostly do what they're prompted to do. They are trained to predict, and mostly to approximately follow instructions as intended. Tan Zhi Xuan elaborates on that point in this interview. Applications of fine-tuning are thus far only pretty mild optimization. This is cause for hope that prompted goals might remain dominant over all the goals from the other sources, but we should do more than hope!

  6. ^

    To me, continuous learning for LLMs and LLM agents seems not only inevitable but immanent. Continuing to think about aligning LLMs only as stateless, amnesic systems seems to shortsighted. Humans would be incompetent without episodic memory and continuous learning for facts and skills, and there are no apparent barriers to imroving those capacities for LLMs/agents. Retrieval-agumented generation (RAG) and various fine-tuning methods for knowledge and skills exist and are being improved. Even current versions will be useful when integrated into economically viable LLM agents.

  7. ^

    The way belief changes can effectively change goals/values/alignment is one part of what I've termed The alignment stability problem. See Evaluating Stability of Unreflective Alignment for an excellent summary of both what reflective stability is, why we might expect it as general intelligence and long-term planning capabilities improve, and empirical tests we could run to anticipate this danger. 

  8. ^

    Thinking about aligning complex LLM agent "psychologies" might produce an ugh field for sensible alignment researchers. . After studying biases and the brain mechanisms that create them, I think the motivated reasoning that produces ugh fields seems to be the most important cognitive bias (in conjunction with humans' cognitive limitations). Rationalism provides some resistance but not immunity to ugh fields and motivated reasoning. Like other complex questions, alignment work may be strongly affected by motivated reasoning.

  9. ^

    "Understanding its own thinking adequately" to ensure the dominance of one goal might be a very high bar; this is one element of the progression toward “Real AGI” that deserves more investigation. Adherence to goals/values over time is what I've called The alignment stability problem. Beliefs determine how goals/values are interpreted, so reliable stability also requires monitoring belief changes. See Human brains don't seem to neatly factorize from The Obliqueness Thesis, and my article Goal changes in intelligent agents.

  10. ^

    Zvi has recently coined the phrase phase 1 in his excellent treatment of The Risk of Gradual Disempowerment from AI. My own nearly-current thoughts on possible routes through phase 2 are in If we solve alignment, do we die anyway? and its discussion section. This is, as the academics like to say, an area for further research.

  11. ^

    The challenges of specifying and choosing goals for an AGI have been discussed at great length in many places, without reaching a satisfying conclusion. I find I don't know of a good starting point to reference for this complex, crucial literature. I'll mention my personal favorites, And All the Shoggoths Merely Players as a high-level summary of the current debate, and my contribution Instruction-following AGI is easier and more likely than value aligned AGI which I prefer since it's not only arguably an easier alignment target to hit and make work, but the default target developers are actually likely to try for. It as well as all other alignment targets I've found have large unresolved (but not unresolvable) issues.

New Comment
2 comments, sorted by Click to highlight new comments since:

Multi-factor goals might mostly look like information learned in earlier steps getting expressed in a new way in later steps. E.g. an LLM that learns from a dataset that includes examples of humans prompting LLMs, and then is instructed to give prompts to versions of itself doing subtasks within an agent structure, may have emergent goal-like behavior from the interaction of these facts.

I think locating goals "within the CoT" often doesn't work, a ton of work is done implicitly, especially after RL on a model using CoT. What does that mean for attempts to teach metacognition that's good according to humans?

I think you're pointing to more layers of complexity in how goals will arise in LLM agents.

As for what it all means WRT metacognition that can stabilize the goal structure: I don't know, but I've got some thoughts! They'll be in the form of a long post I've almost finished editing; I plan to publish tomorrow.

Those sources of goals are going to interact in complex ways both during training, as you note, and during chain of thought. No goals are truly arising solely from the chain of thought, since that's entirely based on the semantics it's learned from training.