Hierarchical Agency: A Missing Piece in AI Alignment

Jan_Kulveit

The most ambitious direction I've worked on in the past few years is a theory of hierarchical agency (as mentioned in the ACS announcement). Recently, I've become worried that the inferential distance between "the median of current AI safety discourse" and "what I and my colleagues at ACS work on" has become too large. After multiple attempts to write a good introduction to hierarchical agency ended up in perpetual draft form, I decided to try something else: explain it to Claude.

This is a lightly edited transcript of the first part of our conversation.

Claude
Shall we begin? Could you start by explaining what you mean by "hierarchical agency" and why you think it's important for AI safety?

Jan
Broadly, there is a pattern in the world where you can often see agents composed of other agents. Like, corporations and their departments, states and citiziens, and so on.

Claude
So you're looking at nested or layered structures of agency - where agents can contain or be made up of other agents. Let me test my understanding with some examples and their properties:

A corporation as an agent that can make decisions and take actions, but it's composed of departments (also agents) which are in turn composed of individual human agents?
States acting as agents while being composed of various bureaucracies and ultimately individual human officials and citizens?

Would biological examples also fit this pattern? Like how a human body could be seen as an agent composed of organs (which have their own regulatory systems) and cells (which make their own decisions about things like protein production)? I'm curious about what makes this perspective particularly relevant for AI safety

Jan
I hope to get to AI safety relevance later.
Yes, biological examples fit well - particularly multicellular organisms, which demonstrate emergent agency at multiple scales. To better define the boundaries of hierarchical agency, it might help to examine some edge cases and non-examples.

Claude
Let me probe the boundaries systematically by analyzing different types of systems:

A crowd of people in a mall - they're a collection of agents, but are they a cohesive higher-level agent? My guess is no, because there's no real coordination or unified decision-making?
A computer network - it's a connected system of parts, but do the individual computers act as agents in a meaningful way? And does the network itself have agency?
An ecosystem - while it's a system of living agents (animals, plants), would you consider the ecosystem itself an agent?
A social insect colony - this might be closer to your multicellular example, but I'd be curious where you see it fitting or not fitting?

Am I on the right track with these examples? What makes some collections of agents form a meaningful higher-level agent while others don't?

Jan
The question of what makes something a real superagent versus just a bunch of agents hanging out together is super interesting.
I don't have a satisfactory answer yet, but empirically, you can ask it makes sense to model the collective as an agent. I usually operationalize this using Dennett's three stances perspective - do you know these?

Claude
Yes! If I recall correctly, Dennett proposed three levels or stances for understanding/predicting behavior:

Physical stance - predicting based on physical laws
Design stance - predicting based on how something is designed/supposed to function
Intentional stance - predicting by attributing beliefs, desires, rationality (treating as an agent)

So, if I'm understanding you correctly, you're suggesting that what makes something a "higher-level agent" is whether taking the intentional stance towards the collective is actually useful/predictive?

Let me test this understanding:

An ant colony: Treating it as an agent with goals (survive, grow, protect queen) might be more useful than trying to predict its behavior just from individual ant interactions
A random crowd: Probably better predicted by understanding individual human behavior and physical crowd dynamics
A corporation: Often very useful to model as an agent with goals, strategies, beliefs about the market, etc. Is this the kind of distinction you mean?

Jan
You get it. In thinking about hierarchical agency, I'm interested in systems where there are at least two layers where it is useful to think about both layers as agentic, let's say a superagent composed of subagents.

Now, having this concept, I want to make a bold claim: I think a critical missing piece for AI safety and AI alignment is we lack a good mathematical formalism for describing these systems - at least as far as I know.

You would probably ask what I mean by 'good', so I'll spell it outright. My desiderata are

objects at different levels of the hierarchy are of the same type; for a negative example, formalism where on lower layer you have agents, they vote, and the result of aggregation is 'contract', which is different type than agent, is not good enough. In other words, the theory is scale-free.
I want something expressive enough to describe real-world situations;
The objects are type 'agent', i.e. somehow represent intentionality

Claude
Let me break this down to make sure I understand your criteria:

Type consistency across levels:
Could you confirm if this example captures what you mean:
Good: A formalism where both OpenAI (superagent) and employees (subagents) are represented by the same type of mathematical object
Not good: A formalism where individual workers are "agents" but their collective becomes a different mathematical type like "contract"
When you say "expressive enough to describe real-world situations" - could you give an example of the kind of real-world complexity this formalism should be able to capture? I'm imagining things like partial alignment between levels or conflicts between levels
For representing intentionality - does this mean the mathematical objects should somehow capture things like goals, beliefs, and decision-making? Similar to how we might model an agent in decision theory or game theory?

Also, I'm curious - have you seen formalisms that satisfy some but not all of these criteria?

Jan

Yes
Yes: I want to be able to describe situations like conflicts between layers, or interactions where the superagent gains agency at the expense of subagents.
Yes. For example, Scott Garrabrant has a nice stab roughly at this direction, called Cartesian Frames, but the mathematical objects involved are more suitable for representing actions than beliefs, goals or intentionality.

Actually, you have better overview of all mathematical formalisms invented: you read it all, so maybe you can try to come up with some of them and check if they fulfill the criteria?

Claude
Let me think about mathematical formalisms I'm aware of that might be relevant:

Game Theory:

Captures intentionality (preferences, strategies)
Can represent nested games
But: Usually different types at different levels (players vs coalitions vs games themselves)

Category Theory:

Good at compositionality and type consistency
Objects compose into objects of same type
But: Abstract away too much of the intentional content? Real world relevance?

Classical Multi-agent Systems:

Good at representing intentionality
Can model interactions
But: Usually flat structure, not hierarchical?

Markov Decision Processes:

Can capture decision-making
Can be nested (hierarchical MDPs)
But: More focused on actions/policies than beliefs/goals?

Economic Mechanism Design:

Models incentives and preferences
Can handle multiple layers of interaction
But: Different mathematical objects at different levels (agents vs mechanisms)

Am I missing any obvious formalisms? Would combining aspects of these get closer to what you're looking for?

Jan
Clearly related field is public choice. Less obvious, active inference

Claude
Let's explore these:

Public Choice Theory:

Deals with collective decision-making
Models how individual preferences aggregate into institutional decisions
Might offer insights about inter-level dynamics

But: Does it typically treat the collective level as the same type of agent as individuals?

Active Inference:

Models agents as minimizing free energy/surprise
Can handle hierarchical structures (hierarchical predictive coding)

Has a unified mathematical framework across levels

This seems potentially promising!

Let me try to understand why active inference might be relevant here:

It models both perception and action in terms of belief updates
Could potentially represent both subagents and superagents as systems trying to minimize surprise
Might capture interactions between levels as bidirectional prediction/control flows

Am I getting warm? Could you explain what aspects of these frameworks you find relevant to hierarchical agency?

Also, this makes me wonder: are you looking for a formalism that could capture how different levels of agency might have different models/beliefs about the world and each other? Like how a corporation might have a different "view" of reality than its individual employees?

Jan
Yes. You are actually pretty good at this.

What I like about active inference in this context is it tries to express everything as the same type - roughly "beliefs". Also, in its more neuro-inspired versions, assumes bidirectional control/prediction flows is what happens in human brain.

Yes: I want something which is expressive enough to allow different beliefs on different levels. Also I want something allowing to have "goal-oriented beliefs" or "goals" concerning other layers. For example, corporation having an instrumental goal of their employees being more loyal.

Given you get this, I think we can move to your previous question "how is this related to AI safety" - or we should clarify more?

Claude
Yes, I think I understand enough of the core concept to explore its relevance to AI safety.

Let me venture a guess about some potential connections to AI safety:

If we're creating powerful AI systems, they might become part of hierarchical agency structures (either as subagents or superagents). Without a good formal understanding of hierarchical agency, we might miss critical dynamics.
Maybe current alignment approaches are too focused on single-agent scenarios, when in reality AI will exist within complex hierarchical structures?
The ability to have "goal-oriented beliefs" about other layers seems particularly relevant - like how an AI system might develop instrumental goals about its human users/creators.

Am I on the right track?

Jan
Broadly correct direction in case of 1. and 2., but actually I want to describe more ambitious vision.

How I think about it is using a metaphor of "game theory".

Think about how humans understood cooperation and conflict before the mathematical formalization by von Neumann, Schelling, and others. There was lots of informal understanding scattered across diverse domains - military strategy books, gambling theory, board games, even ethical treatises. I suspect most people at the time didn't see this as an area where mathematics could be useful - sure, math worked great for physics, but concepts like cooperation or defection seemed like fuzzy human things. Yet I imagine a mathematically-minded person looking across these domains might have noticed something interesting: similar patterns of incomplete understanding showing up in contexts as different as warfare and board games, suggesting a deeper mathematical structure waiting to be discovered.

Claude
Let me see if I understand the parallel you're drawing:

Before game theory, understanding of strategic interaction was:

Scattered across diverse domains (military, games, ethics)
Largely informal/intuitive
Not obviously "mathematizable"
Had similar patterns that weren't formally unified

You see hierarchical agency in a similar position now:

We see it everywhere (corporations, states, biological systems)
We have informal understanding in different contexts
The patterns are similar but not formally unified
Most people might not see it as something that could/should be mathematized

And you're suggesting that just as game theory provided a unified mathematical framework that:

Revealed deep connections between seemingly different phenomena
Allowed for more rigorous analysis
Led to new insights and predictions
Became crucial for understanding modern systems (economics, nuclear deterrence)

You're aiming for a similar mathematical unification of hierarchical agency?

Is this capturing your vision correctly? And if so, what do you see as the "similarly shaped holes" across different domains that your formalism would help fill?

Jan

Yes, you got this right. The "similarly shaped holes" are a bit of a rabbit hole we can spend a lot of time on. I'd like to connect this to AI safety in a way that hopefully makes the relevance clear.

One of the safety relevant holes is "how AI deals with different objectives". You can imagine in the training AI learns about different goals, and the result is you can think about the AI being composed of various "smaller AIs", tracking different things. For example, let's say part of the AI "cares" or "predicts" or whatever you prefer ... answers oriented toward helpfulness to the human user. Another about "harmlessness". And so on. Further, you can think about the AI undergoing some process where these parts interact, and the "whole" evolves. For example, you can imagine yourself having these various drives, and preparing data for your further training.

Claude
Let me try to unpack this through a few lenses to make sure I understand:

Internal structure of AI systems:

Rather than a single monolithic objective/agent
Multiple "subagents" tracking different objectives
These might interact/negotiate like parts of ourselves do

The whole system emerges from these interactions

So for example:

Part of me wants to be helpful to others
Part wants to protect myself
Part focuses on truth-telling
Part manages social reputation

And these interact to produce behavior?

Training dynamics: If I understand correctly, you're suggesting that:

During training, different "parts" might develop to track different objectives

These parts then interact/evolve

The AI might even participate in shaping its own further training

Let me test my understanding with a concrete example:

When I (Claude) respond to a user, there might be different "subagents" considering:

Helpfulness to the user
Truthfulness/accuracy
Safety considerations
Social appropriateness

And my responses emerge from some interaction between these?

Is this the kind of dynamic you're trying to formalize? And would your formalism help us understand:

How these subagents might develop during training
How they interact to produce behavior
How the system as a whole might evolve
Potential conflicts or alignment issues between levels

Am I getting closer to what you mean by this particular "hole" that needs filling?

Jan
Yes, a bit more the points 2,3 and 4. I hope the formalism may help with problems like
... what Richard Ngo calls "Value systematization" or the problem me I described in "Self-unalignment problem", where the AI faces the problem how to align with self-unaligned human. Do you vaguely remember these?

Claude
I will try to recall and connect these ideas:

From what I remember, "Value systematization" refers to the challenge of how an AI system could/should resolve conflicts or contradictions in human values - not just implementing fixed values, but engaging in some kind of philosophical reasoning about values.

The "Self-unalignment problem" deals with how an AI should handle the fact that humans themselves often have conflicting values, aren't fully aligned with their own stated goals, and change over time.

So if I'm connecting this to hierarchical agency:

These problems involve multiple levels of agency and belief/value structures
They involve complex interactions between levels (AI trying to understand/systematize human values, humans trying to understand their own values)
There's a question of how different "parts" or "levels" should interact and resolve conflicts

Would your formalism help by providing a way to:

Model how different levels of values/beliefs interact
Understand how systematization of values could emerge from these interactions
Analyze what "alignment" even means in such hierarchical systems
Perhaps suggest principles for how an AI should approach these challenges?

Jan
Yes: my guess is a good formalism would help with all of these. In essence, I'm seeking a game theory-esque breakthrough for hierarchical agency — a simple yet powerful formalism that captures the essence of multi-layered intentional interactions.

Claude
So, you believe a good formalism for hierarchical agency could:

Give us precise ways to talk about nested intentionality and belief structures
Provide core concepts/principles about how different levels of agency interact
Help analyze everything from internal AI value conflicts to human-AI alignment
Make previously fuzzy ideas (like "value systematization") more tractable

Would it be fair to say you're looking for the hierarchical agency equivalent of concepts like Nash equilibrium - simple but profound ideas that emerge once you have the right mathematical framework?

I'm curious - do you see any candidate concepts that might play this role? Things that seem like they might be fundamental to understanding hierarchical agency, but we lack the formal tools to properly analyze them yet?

Jan
Yes. I hope to return to the formal concepts later. Less formally, I think something like Holy grail here is to have formalization which is suitable for expressing concepts like 'kindness'.

I often think it makes sense to think about individual human mind using the hierarchical agency frame. And, in this case, empirically, I would say people have different internal architectures. As a human, you (the superagent) can do different things with your internal conflicts. For example, we sometimes try to supress or even delete some of our parts. Some people seem to have something like "internal dictatorship", where a lot of their motivations or parts are suppressed. Personally, I strive for cooperative, win-win relationships, where my different desires "trust the whole".

Ideally, I would like something where we express thing like the superagent being kind to the subagents, or a dynamic of value evolution we would consider fair.

Claude
I'll try to unpack several layers here:

Internal Architecture Patterns:

"Internal dictatorship" - suppressing/controlling subagents
"Internal democracy/cooperation" - seeking win-win solutions
"Internal understanding" - cultivating trust between levels

Am I right that you see these as different ways the superagent-subagent relationship can be structured?

Values at Meta-Level:

If I understand correctly, you're suggesting we need formalism that can express not just object-level values, but meta-level values about how different levels of agency should interact. Like:

Kindness
Fairness in value evolution
Trust between parts and whole

This makes me think about therapeutic approaches to internal conflict. For example:

Internal Family Systems therapy sees internal parts as subagents and emphasizes compassionate leadership from the "Self"
Some meditation practices aim to cultivate a kind awareness of all mental states
Psychological integration vs suppression of traits/desires

Would your formalism aim to capture what makes some of these approaches "healthier" or "more ethical" than others? And could this then inform how we think about AI systems managing their internal conflicts or interacting with humans?

Is this getting at what you mean by expressing values like "kindness" in the context of hierarchical agency?

Jan
Yes.

To be continued.
I'm interested in feedback about the format: is it useful? Would be better if the conversation was more or less condensed? If Claude was simulating someone more or less informed?

^{^}
This has various downsides, like difficulty in coordinating with other people, incomprehensibility to funders, etc.

This was an impressive demonstation of Claude for interviews. Was this one take?

(Also what prompt did you use? I like how your Claude speaks.)

There was some selection of branches, and one pass of post-processing.

It was after ˜30 pages of a different conversation about AI and LLM introspection, so I don't expect the prompt alone will elicit the "same Claude". Start of this conversation was

Thanks! Now, I would like to switch to a slightly different topic: my AI safety oriented research on hierarchical agency. I would like you to role-play an inquisitive, curious interview partner, who aims to understand what I mean, and often tries to check understanding using paraphrasing, giving examples, and similar techniques. In some sense you can think about my answers as trying to steer some thought process you (or the reader) does, but hoping you figure out a lot of things yourself. I hope the transcript of conversation in edited form could be published at ... and read by ...

Overall my guess is this improves clarity a bit and dilutes how much thinking per character there is, creating somewhat less compressed representation. My natural style is probably on the margin too dense / hard to parse, so I think the result is useful.

Do we have a LessWrong tag for "hierarchical agency" or "multi-scale alignment" or something? Should I make one?

I guess make one? Unclear if hierarchical agency is the true name

Yeah i'm confused about what to name it. we can always change it later i guess.

also let me know if you have any posts you want me to definitely tag for it that you think i might miss otherwise

Compositional agency?

Are you working on this because you expect our first AGIs to be such hierarchical systems of subagents?
1. Or because you expect systems in which AGIs supervise subagents?
In either case, isn't the key question still whether the agent(s) at the top of the hierarchy are aligned?
In other areas of complex systems (economics, politics and nations, and notably psychology), mathematical formulations address sub-parts of the systems, but typically are not relied on for an overall analysis. Instead, understanding complex systems requires integrating a number of tools for understanding different parts, levels, and aspects of the system.
1. I worry that the cultural foundations of AI alignment bias the people most serious about it to focus excessively on mathematical/formal approaches.

I expect "first AGI" to be reasonably modelled as composite structure in a similar way as a single human mind can be modelled as composite.
The "top" layer in the hierarchical agency sense isn't necessarily the more powerful / agenty: the superagent/subagent direction is completely independent from relative powers. For example, you can think about Tea Appreciation Society at a university using the hierarchical frame: while the superagent has some agency, it is not particularly strong.
I think the nature of the problem here is somewhat different than typical research questions in e.g. psychology. As discussed in the text, one place where having mathematical theory of hierarchical agency would help is making us better at specifications of value evolution. I think this is the case because a specification would be more robust to scaling of intelligence. For example, compare learning objective
a. specified as minimizing KL divergence between some distributions
b. specified in natural language as "you should adjust the model so the things read are less surprising and unexpected"
You can use objective b. + RL to train/finetune LLMs, exactly like RLAIF is used to train "honesty", for example.
Possible problem with b. is the implicit representations of natural language concepts like honesty or surprise are likely not very stable: if you would train a model mostly on RL + however Claude understands these words, you would probably get pathological results, or at least something far from how you understand the concepts. Actual RLAIF/RLHF/DPO/... works mostly because it is relatively shallow: more compute goes into pre training.

Ah. Now I understand why you're going this direction.

I think a single human mind is modeled very poorly as a composite of multiple agents.

This notion is far more popular with computer scientists than with neuroscientists. We've known about it since Minsky and think about it; it just doesn't seem to mostly be the case.

Sure you can model it that way, but it's not doing much useful work.

I expect the same of our first AGIs as foundation model agents. They will have separate components, but those will not be well-modeled as agents. And they will have different capabilities and different tendencies, but neither of those are particularly agent-y either.

I guess the devil is in the details, and you might come up with a really useful analysis using the metaphor of subagents. But it seems like an inefficient direction.

I strongly support you using this format if it helps you share your thinking. It sounds like we wouldn't be seeing this any time soon without the interview format. It's interesting and I might try it. And I encourage others to do so if helps them share sooner or more efficiently.

Along those lines, I strongly encourage any sort of AI-assisted writing as long as the central ideas are human-generated or at least thoroughly thought-through and endorsed by the human posting them.

This post and every post longer than two paragraphs would really benefit from some sort of summary or TLDR so people can prioritize properly.

Questions/thoughts on the comment posted separately.

In this post Jan Kulveit calls for creating a theory of "hierarchical agency", i.e. a theory that talks about agents composed of agents, which might themselves be composed of agents etc.

The form of the post is a dialogue between Kulveit and Claude (the AI). I don't like this format. I think that dialogues are a bad format in general, disorganized and not skimming friendly. The only case where IMO dialogues are defensible, is when it's a real dialogue: real people with different world-views that are trying to bridge and/or argue their differences.

Now, about the content. I agree with Kulveit that multi-agency is important. I'm not entirely sold on the importance of the "hierarchical agency" frame, but I agree that "when can a system of agents be regarded as a single agent" seems like a question that should be answered. At the very least, a certain type of answer to this question might relieve of us of the need to worry about "what if humans are multi-agents" in the context of alignment (because, arguably, it would be possible to just regard humans as uni-agents anyway).

After reading this post, I came up with the following (extremely simplistic) toy model for hierarchical agency.

Let be the set of possible decisions an agent can make, $W$ the set of "possible worlds", $O$ the set of "outcomes" and $G : D \times W \to Δ O$ the (known) process by which outcomes are generated. Then, an agent that makes decision $d^{*} \in D$ can be ascribed the belief $ζ : D \to Δ W$ and the utility function $u : O \to R$ when $d^{*}$ is the unique maximum of $E_{w \sim ζ (d), o \sim G (d, w)} [u (o)]$ over $d$ . Some decisions cannot be ascribed "intent" at all: for example, if $| W | = 1$ then $d \in D$ be be ascribed intent iff $G (d)$ is an exposed point of the convex hull of the image of $G$ . (See also)

We can now consider a system of $n \in N$ agents with decision sets $D_{1} \dots D_{n}$ and a process $G : \prod_{i} D_{i} \times W \to Δ O$ . For each set $A \subseteq {1 \dots n}$ , we can define $D_{A} := \prod_{i \in A} D_{i}$ and $W_{A} := \prod_{i \notin A} D_{i} \times W$ , and then $G_{A} : D_{A} \times W_{A} \to Δ O$ is defined in the obvious way. We can then ask which agent sets $A$ have "collective intent" and which don't.

To give a simple example, let $D_{1} = D_{2} = {H, T, coin}$ , $| W | = 1$ , $O = {H, T}^{2}$ and $G$ is defined in the obvious way, where $coin$ corresponds to flipping a fair coin to decide between $H$ and $T$ . Then, if both agents choose $c o i n$ then they have intent individually (we can think of them as playing matching pennies, with each agent believing the other one's action to depend on their own action), but not collectively. On the other hand, if they play a pure strategy that they have collective intent as well.

Extending this into a theory that fully engages with all relevant aspects of the problem would require incorporating infra-Bayesianism, Formal Computational Realism, possibly some form of the Algorithmic Descriptive Agency Measure etc, and more generally, first developing the theory of uni-agents. But maybe starting from the "hierarchical agency" end can be useful as well.

Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)

The reason I ask is because embedded agents and agents in multi-agent settings should need compositional world models that include models of themselves and other agents, which implies that hierarchical agency is included in what they would need to solve.

It also relates closely to work Vanessa is doing (as an "ARIA Creator") in learning theoretic AI, related to what she has called "Frugal Compositional Languages" and see this work by @alcatal - though I understand both are not yet addressing on multi-agent world models, nor is it explicitly about modeling the agents themselves in a compositional / embedded agent way, though those are presumably desiderata.

In the section about existing theories that could be related I was missing Luhmann's Social Systems theory. It is not a mathematical theory but otherwise fits the desiderata. This is what o1-preview says about it (after being fed the above dialog up to the theories suggested by Claude:

Introducing Luhmann's Social Systems Theory into our discussion is an intriguing idea. Let's explore how it might help us address the challenge of modeling hierarchical agency with a suitable mathematical formalism.
Overview of Luhmann's Social Systems Theory:
Niklas Luhmann's theory conceptualizes society as a complex set of self-referential social systems composed not of individuals or actions, but of communications. In his framework:
Autopoietic Systems: Social systems are autopoietic, meaning they are self-producing and maintain their boundaries through their own operations. They continuously reproduce the elements (communications) that constitute them.
Functional Differentiation: Society is differentiated into various subsystems (e.g., economy, law, politics), each operating based on its own binary code (profit/non-profit, legal/illegal, power/no power).
Communication as the Fundamental Unit: Communications are the basic elements, and meaning is generated through the network of communications within a system.
Operative Closure and Cognitive Openness: Systems are operationally closed—they can only refer to themselves—but cognitively open, as they observe and are influenced by their environment.
Relating Luhmann's Theory to Hierarchical Agency:
Type Consistency Across Levels:
Agents as Systems: If we consider both subagents and superagents as autopoietic systems of communications, we maintain type consistency. Both levels are constituted by the same fundamental processes.
Scale-Free Structure: Since Luhmann's theory doesn't privilege any particular scale, it aligns with your desideratum for a scale-free formalism.
Expressiveness for Real-World Situations:
Conflict and Integration: Luhmann's concept of structural coupling explains how different systems interact and influence each other without losing their autonomy. This could model conflicts between layers or how a superagent gains agency at the expense of subagents.
Emergent Properties: The emergent nature of social systems in Luhmann's theory parallels how superagents might exhibit properties not reducible to their subagents.
Representation of Intentionality:
System Intentionality: While individual intentions are abstracted away, systems exhibit purposive behavior through their operations. This could be adapted to represent the goals and beliefs of agents at different levels.
Communication of Intentions: Intentionality can be embedded in the communications that constitute the system, allowing us to model beliefs and desires.
Towards a Mathematical Formalism:
While Luhmann's theory is primarily sociological and lacks a mathematical framework, we can attempt to formalize its concepts:
Agent-Based Modeling:
Agents as Communication Processes: Model agents (both subagents and superagents) as processes that generate and respond to communications.
Network Dynamics: Use graph theory to represent communications as edges and agents as nodes, allowing us to analyze the structure and dynamics of interactions.
Category Theory:
Objects and Morphisms: Utilize category theory to model agents as objects and communications as morphisms, preserving type consistency and composability.
Functorial Relationships: Define functors that map between categories of agents at different hierarchical levels.
Dynamic Systems Theory:
Differential Equations: Represent the evolution of communications and agent states over time using differential equations.
Attractors and Stability: Analyze how certain communication patterns lead to stable superagent formations.
Information Theory:
Entropy and Redundancy: Measure the information content of communications, which could correlate with the intentionality and decision-making processes of agents.
Mutual Information: Assess how much information is shared between subagents and superagents, indicating the degree of coupling.
Potential Benefits:
Unified Framework: By treating communications as the foundational elements, we can model both agents and superagents within the same formal structure.
Emergence and Self-Organization: The formalism could capture how complex behaviors emerge from simple interactions, which is crucial for understanding hierarchical agency.
Applicability to AI Systems: For AI alignment, this approach might help in designing AI agents that can form or integrate into higher-level systems without losing alignment with human values.
Challenges:
Abstract Nature: Luhmann's theory is highly abstract, which might make direct mathematical translation difficult.
Intentionality Representation: Modeling intentionality purely through communications may not capture all aspects of beliefs and desires inherent in agents.

The scale problem is so universal and hard to tap into. Taking lessons from physics, I would caution against building a fully generalized framework where agents and subagents function under the same interactions, there are transitions between the micro and macro states where simmetry breaks completely. Complexity also points towards this same problem, emergent behaviour in cellular automata is hardly well predicted from the smaller parts which make up that behaviour.

I just made a twitter list with accounts interested in hierarchical agency (or what i call "multi-scale alignment"). Lmk who should be added

Random but you might like this graphic I made representing hierarchical agency from my post today on a very similar idea. What would you change about it?

I'm glad you wrote this! I've been wanting to tell othres about ACS's research and finally have a good link

A related pattern-in-reality that I've had on my todo-list to investigate is something like "cooperation-enforcing structures". Things like

legal systems, police
immune systems (esp. in suppressing cancer)
social norms, reputation systems, etc.

I'd been approaching this from a perspective of "how defeating Moloch can happen in general" and "how might we steer Earth to be less Moloch-fucked"; not so much AI safety directly.

Do you think a good theory of hierarchical agency would subsume those kinds of patterns-in-reality? If yes: I wonder if their inclusion could be used as a criterion/heuristic for narrowing down the search for a good theory?

Most of the basis of cooperation enforcing structures, I'd argue rests on 2 general principles:

An iterated game, such that there is an equilibrium for cooperation, and
The ability to enforce a threat of violence if a player defects, ideally credibly, and often extends to a monopoly on violence.

Once you have those, cooperative equilibria become possible.

Norms can accomplish this as well - I wrote about this a couple weeks ago.

I basically agree that norms can accomplish this, conditional on the game always being iterated, and indeed conditional on countries being far-sighted enough, almost any outcome is possible, thanks to the folk theorems.

Nice! I'm very late to this. A few thoughts.

Focusing on 'full agents' might be misleading. Humans are fairly agenty as far as things go. Mobs and corporations and countries are a bit less agenty on the whole. So maybe you need some spectrum (or space) of agentness for things to be appropriately similarly-typed.

Game theory and MDPs and so on often treat the population as static. But we spawn (and dissolve) actors all the time. So a full theory here would need to build that in from the start. This has the nice property that it might be a foundation to describe a hierarchy too: mobs, corporations, and countries get spawned, dissolved, transformed, etc. all the time, after all. (Looser thought: could some agents-composed-of-parts be 'suspended' or 'transferred' to different substrate...?)

I didn't end up putting this in my coalitional agency post, but at one point I had a note discussing our terminological disagreement:

I don’t like the word hierarchical as much. A theory can be hierarchical without being scale-free—e.g. a theory which describes something in terms of three different layers doing three different things is hierarchical but not scale-free.

Whereas coalitions are typically divided into sub-coalitions (e.g. the "western civilization" coalition is divided into countries which are divided into provinces/states; political coalitions are divided into different factions and interest groups; etc). And so "coalitional" seems much closer to capturing this fractal/scale-free property.

This was an impressive demonstation of Claude for interviews. Was this one take?

(Also what prompt did you use? I like how your Claude speaks.)

Do we have a LessWrong tag for "hierarchical agency" or "multi-scale alignment" or something? Should I make one?

I guess make one? Unclear if hierarchical agency is the true name

Yeah i'm confused about what to name it. we can always change it later i guess.

also let me know if you have any posts you want me to definitely tag for it that you think i might miss otherwise

Compositional agency?

Are you working on this because you expect our first AGIs to be such hierarchical systems of subagents?
1. Or because you expect systems in which AGIs supervise subagents?
In either case, isn't the key question still whether the agent(s) at the top of the hierarchy are aligned?
In other areas of complex systems (economics, politics and nations, and notably psychology), mathematical formulations address sub-parts of the systems, but typically are not relied on for an overall analysis. Instead, understanding complex systems requires integrating a number of tools for understanding different parts, levels, and aspects of the system.
1. I worry that the cultural foundations of AI alignment bias the people most serious about it to focus excessively on mathematical/formal approaches.

I expect "first AGI" to be reasonably modelled as composite structure in a similar way as a single human mind can be modelled as composite.
The "top" layer in the hierarchical agency sense isn't necessarily the more powerful / agenty: the superagent/subagent direction is completely independent from relative powers. For example, you can think about Tea Appreciation Society at a university using the hierarchical frame: while the superagent has some agency, it is not particularly strong.
I think the nature of the problem here is somewhat different than typical research questions in e.g. psychology. As discussed in the text, one place where having mathematical theory of hierarchical agency would help is making us better at specifications of value evolution. I think this is the case because a specification would be more robust to scaling of intelligence. For example, compare learning objective
a. specified as minimizing KL divergence between some distributions
b. specified in natural language as "you should adjust the model so the things read are less surprising and unexpected"
You can use objective b. + RL to train/finetune LLMs, exactly like RLAIF is used to train "honesty", for example.
Possible problem with b. is the implicit representations of natural language concepts like honesty or surprise are likely not very stable: if you would train a model mostly on RL + however Claude understands these words, you would probably get pathological results, or at least something far from how you understand the concepts. Actual RLAIF/RLHF/DPO/... works mostly because it is relatively shallow: more compute goes into pre training.

Ah. Now I understand why you're going this direction.

I think a single human mind is modeled very poorly as a composite of multiple agents.

This notion is far more popular with computer scientists than with neuroscientists. We've known about it since Minsky and think about it; it just doesn't seem to mostly be the case.

Sure you can model it that way, but it's not doing much useful work.

I guess the devil is in the details, and you might come up with a really useful analysis using the metaphor of subagents. But it seems like an inefficient direction.

Along those lines, I strongly encourage any sort of AI-assisted writing as long as the central ideas are human-generated or at least thoroughly thought-through and endorsed by the human posting them.

This post and every post longer than two paragraphs would really benefit from some sort of summary or TLDR so people can prioritize properly.

Questions/thoughts on the comment posted separately.

In this post Jan Kulveit calls for creating a theory of "hierarchical agency", i.e. a theory that talks about agents composed of agents, which might themselves be composed of agents etc.

After reading this post, I came up with the following (extremely simplistic) toy model for hierarchical agency.

Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)

Introducing Luhmann's Social Systems Theory into our discussion is an intriguing idea. Let's explore how it might help us address the challenge of modeling hierarchical agency with a suitable mathematical formalism.
Overview of Luhmann's Social Systems Theory:
Niklas Luhmann's theory conceptualizes society as a complex set of self-referential social systems composed not of individuals or actions, but of communications. In his framework:
Autopoietic Systems: Social systems are autopoietic, meaning they are self-producing and maintain their boundaries through their own operations. They continuously reproduce the elements (communications) that constitute them.
Functional Differentiation: Society is differentiated into various subsystems (e.g., economy, law, politics), each operating based on its own binary code (profit/non-profit, legal/illegal, power/no power).
Communication as the Fundamental Unit: Communications are the basic elements, and meaning is generated through the network of communications within a system.
Operative Closure and Cognitive Openness: Systems are operationally closed—they can only refer to themselves—but cognitively open, as they observe and are influenced by their environment.
Relating Luhmann's Theory to Hierarchical Agency:
Type Consistency Across Levels:
Agents as Systems: If we consider both subagents and superagents as autopoietic systems of communications, we maintain type consistency. Both levels are constituted by the same fundamental processes.
Scale-Free Structure: Since Luhmann's theory doesn't privilege any particular scale, it aligns with your desideratum for a scale-free formalism.
Expressiveness for Real-World Situations:
Conflict and Integration: Luhmann's concept of structural coupling explains how different systems interact and influence each other without losing their autonomy. This could model conflicts between layers or how a superagent gains agency at the expense of subagents.
Emergent Properties: The emergent nature of social systems in Luhmann's theory parallels how superagents might exhibit properties not reducible to their subagents.
Representation of Intentionality:
System Intentionality: While individual intentions are abstracted away, systems exhibit purposive behavior through their operations. This could be adapted to represent the goals and beliefs of agents at different levels.
Communication of Intentions: Intentionality can be embedded in the communications that constitute the system, allowing us to model beliefs and desires.
Towards a Mathematical Formalism:
While Luhmann's theory is primarily sociological and lacks a mathematical framework, we can attempt to formalize its concepts:
Agent-Based Modeling:
Agents as Communication Processes: Model agents (both subagents and superagents) as processes that generate and respond to communications.
Network Dynamics: Use graph theory to represent communications as edges and agents as nodes, allowing us to analyze the structure and dynamics of interactions.
Category Theory:
Objects and Morphisms: Utilize category theory to model agents as objects and communications as morphisms, preserving type consistency and composability.
Functorial Relationships: Define functors that map between categories of agents at different hierarchical levels.
Dynamic Systems Theory:
Differential Equations: Represent the evolution of communications and agent states over time using differential equations.
Attractors and Stability: Analyze how certain communication patterns lead to stable superagent formations.
Information Theory:
Entropy and Redundancy: Measure the information content of communications, which could correlate with the intentionality and decision-making processes of agents.
Mutual Information: Assess how much information is shared between subagents and superagents, indicating the degree of coupling.
Potential Benefits:
Unified Framework: By treating communications as the foundational elements, we can model both agents and superagents within the same formal structure.
Emergence and Self-Organization: The formalism could capture how complex behaviors emerge from simple interactions, which is crucial for understanding hierarchical agency.
Applicability to AI Systems: For AI alignment, this approach might help in designing AI agents that can form or integrate into higher-level systems without losing alignment with human values.
Challenges:
Abstract Nature: Luhmann's theory is highly abstract, which might make direct mathematical translation difficult.
Intentionality Representation: Modeling intentionality purely through communications may not capture all aspects of beliefs and desires inherent in agents.

I just made a twitter list with accounts interested in hierarchical agency (or what i call "multi-scale alignment"). Lmk who should be added

Random but you might like this graphic I made representing hierarchical agency from my post today on a very similar idea. What would you change about it?

I'm glad you wrote this! I've been wanting to tell othres about ACS's research and finally have a good link

A related pattern-in-reality that I've had on my todo-list to investigate is something like "cooperation-enforcing structures". Things like

legal systems, police
immune systems (esp. in suppressing cancer)
social norms, reputation systems, etc.

I'd been approaching this from a perspective of "how defeating Moloch can happen in general" and "how might we steer Earth to be less Moloch-fucked"; not so much AI safety directly.

Most of the basis of cooperation enforcing structures, I'd argue rests on 2 general principles:

An iterated game, such that there is an equilibrium for cooperation, and
The ability to enforce a threat of violence if a player defects, ideally credibly, and often extends to a monopoly on violence.

Once you have those, cooperative equilibria become possible.

Norms can accomplish this as well - I wrote about this a couple weeks ago.

Nice! I'm very late to this. A few thoughts.

I didn't end up putting this in my coalitional agency post, but at one point I had a note discussing our terminological disagreement:

I don’t like the word hierarchical as much. A theory can be hierarchical without being scale-free—e.g. a theory which describes something in terms of three different layers doing three different things is hierarchical but not scale-free.

LESSWRONG
LW

LESSWRONG
LW

119

Hierarchical Agency: A Missing Piece in AI Alignment

119

Ω 47

119

Ω 47

119

Ω 47