But in practice, agents represent both of these in terms of the same underlying concepts. When those concepts change, both beliefs and goals change.
I like this reason to be unsatisfied with the EUM theory of agency.
One of the difficulties in theorising about agency is that all the theories are flexible enough to explain anything. Each theory is incomplete and vague in some way, so this makes the problem worse, but even when you make a detailed model of e.g. active inference, it ends up being pretty much formally equivalent to EUM.
I think the solution to this is to compare theories using engineering desiderata. Our goal is ultimately to build a safe AGI, so we want a theory that helps us reason about safety desiderata.
One of the really important safety desiderata is some kind of goal stability. When we build a powerful agent, we don't want it to change its mind about what's important. It should act to achieve known, predictable outcomes, even when it discovers facts and concepts we don't know about.
So my criticism of this research direction is that I don't think it'll be a good framework for making goal-stable agents. You want a framework that naturally models internal conflict of goals, and in particular you want to model this as conflict between agents. Conflict and cooperation between bounded, not-quite-rational agents is messy and hard to predict. Multi-agent systems are complex and detail dependent. Therefore it seems difficult to show that the overall agent will be stable.
(A reasonable response would be "but no proposed vague theories of bounded agency have this goal stability property, maybe this coalitional approach will turn out to help us come up with a solution", and that's true and fair enough, but I think research directions like this seem more promising).
Could you please make an argument for goal stability over process stability?
If I reflecticely agree that if the process A (QACI or CEV for example) is reflectively good then I agree to changing my values from B to C if process A happens? So it is more about the process than the underlying goals. Why do we treat goals as the main class citizen here?
There's something in well defined processes that make them applicable to themselves and reflectively stable?
However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:
- Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
- Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
- Form an EUM which is totally indifferent about the cake allocation between them (which would allocate cakes arbitrarily, and could be swayed by the tiniest incentive to give all Alice’s cakes to Bob, or vice versa)
None of these is very satisfactory!
I think this exact example is failing to really inhabit the mindset of a true linear(!) returns EUM agent. If Alice has literally linear returns Alice is totally happy to accept a deal which gets Alice 2x as many cakes + epsilon in 50% of worlds and nothing otherwise.
Correspondingly, if Alice and Bob have ex-ante exactly identical expected power and it is ex-ante as easy to make cake for then I think the agent they would build together would be something like:
From Alice's perspective this gets twice as many cakes + epsilon (due to being more efficient) in 50% of worlds and is thus a nice trade.
(If the marginal cost of giving a cake to Alice vs Bob increases with number of cakes, then you'd give some to both.)
If Alice/Bob had dimishing returns, then adding the utility functions with some bargained weighting is also totally fine and will get you some nice split of cake between them.
If we keep their preferences, but make them have different cake production abilities or marginal costs of providing cakes for them, then you just change the weights (based on some negotiation), not the linearity of the addition. And yes, this means that in many worlds (where one agent always has lower than ex-ante relative marginal cake consumption cost), one of the agents gets all the cake. But ex-ante they got a bit more in expectation!
I'm much more sympathetic to other objections to aggregations of EUM agents being EUM, like ontology issues, imperfect information (and adverse selection), etc.
I was a bit lazy in how I phrased this. I agree with all your points; the thing I'm trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion "Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for":
EUM treats these as messy details. Coalitional agency treats them as hints that EUM is missing something.
EDIT: another thing I glossed over is that IIUC Harsanyi's theorem says the aggregation of EUMs should have a weighted average of utilities, NOT a probability distribution over weighted averages of utilities. So even flipping a coin isn't technically kosher. This may seem nitpicky but I think it's yet another illustration of the underlying non-robustness of EUM.
I've now edited that section. Old version and new version here for posterity.
Old version:
None of these is very satisfactory! Intuitively speaking, Alice and Bob want to come to an agreement where respect for both of their interests is built in. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to weighted averages. The best they can do is to agree on a probabilistic mixture of EUMs—e.g. tossing a coin to decide between option 1 and option 2—which is still very inflexible, since it locks in one of them having priority indefinitely.
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to follow through on commitments they made about which decision procedure to follow (or even hypothetical commitments).
New version:
These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal.
We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes aggregation very risky when Alice and Bob can’t consider all possibilities in advance (i.e. in all realistic settings).
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).
I vaguely remember a LessWrong comment from you a couple of years ago saying that you included Agent Foundations in the AGISF course as a compromise despite not thinking it’s a useful research direction.
Could you say something about why you’ve changed your mind, or what the nuance is if you haven’t?
Active inference is an extension of predictive coding in which some beliefs are so rigid that, when they conflict with observations, it’s easier to act to change future observations than it is to update those beliefs. We can call these hard-to-change beliefs “goals”, thereby unifying beliefs and goals in a way that EUM doesn’t.
You're probably aware of it but it makes sense to explicitize that this move also puts in the goal category many biases, addictions, and maladaptive/disendorsed behaviors.
EUM treats goals and beliefs as totally separate. But in practice, agents represent both of these in terms of the same underlying concepts. When those concepts change, both beliefs and goals change.
Active inference is one framework that attempts to address it. Jeffrey-Bolker is another one, though I haven't dipped my toes into it deep enough to have an informed opinion on whether it's more promising than active inference for the thing you want to do.
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).
I first thought that this introduces epistemic instability because vNM EU theory rests on the independence axiom (so it looked like: to unify EU theory with active inference you wanted to reject one of the things defining EU theory qua EU theory) but then I realized that you hadn't assumed vNM as a foundation for EU theory, so maybe it's irrelevant. But still, as far as I remember, different foundations of EU theory give you slightly different implications (and many of them have some equivalent of the independence axiom; at least Savage does), so it might be good for you to think explicitly about what kind of EU foundation you're assuming. But it also might be irrelevant. I don't know. I'm leaving this thought-train-dump in case it might be useful.
Coalitional agency seems like an unnecessary constraint on design of a composite agent, since an individual agent could just (choose to) listen to other agents and behave the way their coalition would endorse, thereby effectively becoming a composite agent, without being composite "by construction". The step where an agent chooses which other (hypothetical) agents to listen to makes constraints on the nature of agents unnecessary, because the choice to listen to some agents and not others can impose any constraints that particular agent cares about, and so an "agent" could be as vague as a "computation" or a program.
(Choosing to listen to a computation means choosing a computation based on considerations other than its output, committing to use its output in a particular way without yet knowing what it's going to be, and carrying out that commitment once the output becomes available, regardless of what it turns out to be.)
This way we can get back to individual rationality, figuring out how an agent should choose to listen to which other agents/computations when coming up with its own beliefs and decisions. But actually occasionally listening to those other computations is the missing step in most decision theories, which would take care of interaction with other agents (both actual and hypothetical).
Good post. But I thought about this a fair bit and I think I disagree with the main point.
Let's say we talk about two AIs merging. Then the tuple of their expected utilities from the merge had better be on the Pareto frontier, no? Otherwise they'd just do a better merge that gets them onto the frontier. Which specific point on the frontier is a matter of bargaining, but the fact that they want to hit the frontier isn't, it's a win-win. And the merges that get them to the frontier are exactly those that output a EUM agent. If the point they want to hit is in a flat region of the frontier, the merge will involve coinflips to choose which EUM agent to become; and if it's curvy at that point, the merge will be deterministic. For realistic agents who have more complex preferences than just linearly caring about one cake, I expect the frontier will be curvy, so deterministic merge into a EUM agent will be the best choice.
Found this interesting and useful. Big update for me is that 'I cut you choose' is basically the property that most (all?) good self therapy modalities use afaict. In that the part or part-coalition running the therapy procedure can offer but not force things, since its frames are subtly biasing the process.
Discussions of how to aggregate values and probabilities feel disjoint. Jeffrey-Bolker formulation of expected utility presents the preference data as two probability distributions over the same sample space, so that expected utility of an event is reconstructed as the ratio of the event's measures given by the two priors. (The measure that goes into the numerator is "shouldness", and the other one remains "probability".)
This gestures at a way of reducing the problem of aggregating values to the problem of aggregating probabilities. In particular, markets seem to be easier to set up for probabilities than for expected utilities, so it might be better to set up two markets that are technically the same type of thing, one for probability and one for shouldness, than to target expected utility directly. Values of different agents are incomparable, but so are priors, any fundamental issues with aggregation seem to remain unchanged by this reformulation. These can't be "prediction" markets since resolution is not straightforward and somewhat circular, grounded in what the coalition will settle on eventually, but logical induction has to deal with similar issues already.
Nice! I think you might find my draft on Dynamics of Healthy Systems: Control vs Opening relevant to these explorations, feel free to skim as it's longer than ideal (hence unpublished, despite containing what feels like a general and important insight that applies to agency at many scales). I plan to write a cleaner one sometime, but for now it's claude-assisted writing up my ideas, so it's about 2-3x more wordy than it should be.
I think this is a really cool research agenda. I can also try to give my "skydiver's perspective from 3000 miles in the air" overview of what I think expected free energy minimisation means, though I am by no means an expert. Epistemic status: this is a broad extrapolation of some intuitions I gained from reading a lot of papers, it may be very wrong.
In general, I think of free energy minimisation as a class of solutions for the problem of predicting complex systems behaviour, in line with other variational principles in physics. Thus, it is an attempt to use simple physical rules like "the ball rolls down the slope" to explain very complicated outcomes like "I decide to build a theme park with roller coasters in it". In this case, the rule is "free energy is minimised", but unlike a simple physical system whose dimensionality is very literally visible, VFE is minimised in high dimensional probability spaces.
Consider the concrete case below: there are five restaurants in a row and you have to pick one to go to. The intuitive physical interpretation is that you can be represented by a point particle moving to one of five coordinates, all relatively close by in the three dimensional XYZ coordinate space. However, if we assume that this is just some standard physical process you'll end up with highly unintuitive behaviour (why does the particle keep drifting right and left in the middle of these coordinates, and then eventually go somewhere that isn't the middle?). Instead we might say that in an RL sense there is a 5 dimensional action space and you must pick a dimension to maximise expected reward. Free energy minimisation is a rule that says that your action is the one that minimises variation between the predicted outcome your brain produces and the final outcome that your brain observes---which can happen either if your brain is very good at predicting the future or if you act to make your prediction come true. A preference in this case is a bias in the prediction (you can see yourself going to McDonald's more, in some sense, and you feel some psychological aversion/repulsive force moving you away from Burger King) that is then satisfied by you going to the restaurant you are most attracted to. Of course this is just a single agent interpretation and with multiple subagents you can imagine valleys and peaks in the high dimensional probability space, which is resolved when you reach some minima that can be satisfied by action.
However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:
- Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
- Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
- Form an EUM which is totally indifferent about the cake allocation between them (which would allocate cakes arbitrarily, and could be swayed by the tiniest incentive to give all Alice’s cakes to Bob, or vice versa)
There was a recent paper which defined a loss minimising the self other overlap. Broadly, the loss is defined something like the difference between activations referencing itself and activations referencing other agents.
Does this help here? If Alice and Bob had both been built using SOO losses they’d always consistently assign an equivalent amount of cake to each other. I get that this breaks the initial assumption that Alice and Bob each have linear utilities but it seems like a nice way to break it in a way that ensures the best possible result for all parties.
I like this. Although I don't think humans have a scale-free goal architecture, I do think that humans tend to choose abstract principles and ideologies which best predict what their innate moral feelings will choose.
For example, some humans choose utilitarianism as the ideology they endorse, because it predicts what choice their innate moral feelings will make, e.g. donating to a charity which helps more people.
Once they choose an abstract principle or ideology, it will have a lot of weight in very abstract moral decisions which their innate moral feelings fail to understand, e.g. utilitarianism decides whether you should work on AI risk or work in a soup kitchen.
Sometimes, humans discover a conflict between an ideology they've chosen and their innate moral feelings, e.g. total utilitarianism predicts that increasing the population a millionfold, but making everyone's lives barely worth living (only 1/1000 as meaningful), would be a good thing. (This example is the mere addition paradox)
When this conflict happens, there may be a lot of dissonance and stress, and sometimes humans change their ideology (e.g. to average utilitarianism), and sometimes humans insist their innate moral feelings are wrong and stop feeling them.
Eliezer Yudkowsky once had a stressful ideology change (from being pro-AI to pro-humanity):
To the extent someone says that a superintelligence would wipe out humanity, they are either arguing that wiping out humanity is in fact the right thing to do (even though we see no reason why this should be the case) or they are arguing that there is no right thing to do (in which case their argument that we should not build intelligence defeats itself).
Maybe some people would prefer an AI do particular things, such as not kill them, even if life is meaningless?
At first he sort of blocked the moral feelings which disagreed with his ideology, but eventually he changed his ideology.
This system of "choosing abstract principles and ideologies which predict your innate moral choices the best" helps extrapolate our innate moral feelings to new concepts and world models far beyond our ancestral environment.
I hope future AGI will also have a goal architecture like this, so that the bit of good behaviours and tendencies we manage to train into the AGI, will make it place a little bit weight on abstract principles or ideologies which protect humans.
I recently left OpenAI to pursue independent research. I’m working on a number of different research directions, but the most fundamental is my pursuit of a scale-free theory of intelligent agency. In this post I give a rough sketch of how I’m thinking about that. I’m erring on the side of sharing half-formed ideas, so there may well be parts that don’t make sense yet. Nevertheless, I think this broad research direction is very promising.
This post has two sections. The first describes what I mean by a theory of intelligent agency, and some problems with existing (non-scale-free) attempts. The second outlines my current path towards formulating a scale-free theory of intelligent agency, which I’m calling coalitional agency.
Theories of intelligent agency
By a “theory of intelligent agency” I mean a unified mathematical framework that describes both understanding the world and influencing the world. In this section I’ll outline the two best candidate theories of intelligent agency that we currently have (expected utility maximization and active inference), explain why neither of them is fully satisfactory, and outline how we might do better.
Expected utility maximization
Expected utility maximization is the received view of intelligent agency in many fields (I’ll abbreviate it as EUM, and EUM agents as EUMs). Idealized EUMs have beliefs in the form of probability distributions, and goals in the form of utility functions, as specified by the axioms of probability theory and utility theory. They choose whichever strategy leads to the most utility in expectation; this is typically modelled as a process of search or planning.
EUM is a very productive framework in simple settings—like game theory, bargaining theory, microeconomics, etc. It’s particularly useful for describing agents making one-off decisions between a fixed set of choices. However, it’s much more difficult to use EUM to model agents making sequences of choices over time, especially when they learn and update their concepts throughout that process. The two points I want to highlight here:
So we might hope that a theory of deep learning, or reinforcement learning, or deep reinforcement learning, will help fill in EUM’s blind spots. Unfortunately, theoretical progress has been slow on all of these—they’re just too broad to say meaningful things about in the general case.
Active inference
Fortunately, there’s another promising theory which comes at it from a totally different angle. Active inference is a theory born out of neuroscience. Where EUM starts by assuming an agent already has beliefs and goals, active inference gives us a theory of how beliefs and goals are built up over time.
The core idea underlying active inference is predictive coding. Predictive coding models our brains as hierarchical networks where the lowest level is trying to predict our sensory inputs, the next-lowest level is trying to predict the lowest level, and so on. The higher up the hierarchy you go, the more abstract and compressed the representations become. The lower levels might represent individual “pixels” seen by our retinas, then higher levels lines and shapes, then higher levels physical objects like dogs and cats, then even higher levels abstract concepts like animals and life.
This is, of course, similar to how artificial neural networks work (especially ones trained by self-supervised learning). The key difference: predictive coding tells us that, in the brain, the patterns recognized at each level are determined by reconciling the bottom-up signals and the top-down predictions. For example, after looking at the image below, you might not perceive any meaningful shapes within it. But if you have a strong enough top-down prediction that the image makes sense (e.g. because I’m telling you it does) then that prediction will keep being sent down to lower layers responsible for identifying shapes, until they discover the dog. This explains the sharp shifts in our perceptions when looking at such images: at first we can’t see the dog at all, but when we find it it jumps into focus, and afterwards we can’t unsee it.
Predictive coding is a very elegant theory. And what’s even more elegant is that it explains actions in the same way—as very strong top-down predictions which override the default states of our motor neurons. Specifically, we can resolve conflicts between beliefs and observations either by updating our beliefs, or by taking actions which make the beliefs come true. Active inference is an extension of predictive coding in which some beliefs are so rigid that, when they conflict with observations, it’s easier to act to change future observations than it is to update those beliefs. We can call these hard-to-change beliefs “goals”, thereby unifying beliefs and goals in a way that EUM doesn’t.
This is a powerful and subtle point, and one which is often misunderstood. I think the best way to fully understand this point is in terms of perceptual control theory. Scott Alexander gives a good overview here; I’ll also explain the connection at more length in a follow-up post.
Towards a scale-free unification
Active inference is a beautiful theory—not least because it includes EUM as a special case. Active inference represents goals as probability distributions over possible outcomes. If we interpret the logarithm of each probability as that outcome’s utility (and set aside the value of information) then active inference agents choose actions which maximize expected utility. (One intuition for why such an interpretation is natural comes from Scott Garrabrant's geometric rationality.)
So what does expected utility maximization have to add to active inference? I think that what active inference is missing is the ability to model strategic interactions between different goals. That is: we know how to talk about EUMs playing games against each other, bargaining against each other, etc. But, based on my (admittedly incomplete) understanding of active inference, we don’t yet know how to talk about goals doing so within a single active inference agent.
Why does that matter? One reason: the biggest obstacle to a goal being achieved is often other conflicting goals. So any goal capable of learning from experience will naturally develop strategies for avoiding or winning conflicts with other goals—which, indeed, seems to happen in human minds.
More generally, any theory of intelligent agency needs to model internal conflict in order to be scale-free. By a scale-free theory I mean one which applies at many different levels of abstraction, remaining true even when you “zoom in” or “zoom out”. I see so many similarities in how intelligent agency works at different scales (on the level of human subagents, human individuals, companies, countries, civilizations, etc) that I strongly expect our eventual theory of it to be scale-free.
But active inference agents are cooperative within themselves while having strategic interactions with other agents; this privileges one level of analysis over all the others. Instead, I propose, we should think of active inference agents as being composed of subagents who themselves compete and cooperate in game-theoretic ways. I call this approach coalitional agency; in the next section I characterize my current understanding of it from two different directions.
Two paths towards a theory of coalitional agency
The core idea of coalitional agency is that we should think of agents as being composed of cooperating and competing subagents; and those subagents as being composed of subsubagents in turn; and so on. The broad idea here is not new—indeed, it’s the core premise of Minsky’s Society of Mind, published back in 1986. But I hope that thinking of coalitional agency as incorporating elements of both EUM and active inference will allow progress towards a formal version of the theory.
In this section I’ll give two different characterizations of coalitional agency: one starting from EUM and trying to make it more coalitional, and the other starting from active inference and trying to make it more agentic. More specifically, the first poses the question: if a group of EUMs formed a coalition, what would it look like? The second poses the question: how could active inference agents be more robust to conflict between their internal subagents?
From EUM to coalitional agency
If a group of EUMs formed a coalition, what would it look like? EUM has a standard answer to this: the coalition would be a linearly-aggregated EUM. In this section I first explain why the standard answer is unsatisfactory. I then give an alternative answer: the coalition should be an incentive-compatible decision procedure.
Aggregating into EUMs is very inflexible
In the EUM framework, any non-EUM agent is incoherent in the sense of violating the underlying axioms of probability theory and/or utility theory. So insofar as EUM has predictive power, it predicts that competent coalitions will also be EUMs. But which EUMs? The standard answer is given by Harsanyi’s utilitarian theorem, which shows that (under reasonable-seeming assumptions) an aggregation of EUMs into a larger-scale EUM must have a utility function that’s a weighted average of the subagents’ utilities.
However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:
These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal.
We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes aggregation very risky when Alice and Bob can’t consider all possibilities in advance (i.e. in all realistic settings).
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).
Coalitional agents are incentive-compatible decision procedures
The space of decision procedures is very broad; can we say more about which decision procedures rational agents should commit to? One key desideratum for commitments is that it’s easy to trust that they’ll be kept. Consider the example above of flipping a coin to decide between options 1 and 2 above. This is fair, but it sets up strong incentives for whoever loses the coinflip to break their commitment, since they will not get any benefit from keeping it.
And it’s even worse than that, because in general the only way to find out another agent’s utilities is to ask them, and they could just lie. From the god’s-eye perspective you can build an EUM which averages subagents’ utilities; from the perspective of the agents themselves, you can’t. In other words, EUMs constructed by taking a weighted average of subagents’ utilities are not incentive-compatible.
EUMs which can't guarantee each other's honesty will therefore want to aggregate into incentive-compatible decision procedures which each agent does best by following. Perhaps the best-known incentive-compatible decision procedure is the fair cake-cutting algorithm, also known as “I cut you choose”. This is a much simpler and more elegant way to split cakes than the example I gave above of Alice and Bob aggregating into a single EUM.
Now, cake-cutting is one very specific type of problem, and we shouldn’t expect there to be incentive-compatible decision procedures with such nice properties for all problems. Nevertheless, there’s a very wide range of possibilities to explore. Some of the simplest possible incentive-compatible decision procedures include:
These decision procedures each give subagents some type of control over the outputs—and, importantly, a type of control that generalizes to a range of problems beyond the ones they were able to consider during bargaining.
Which incentive-compatible decision procedure?
The question is then: how should subagents choose which incentive-compatible bargaining procedure to adopt? The most principled answer is that they should use a bargaining theory framework. This is a little different from the traditional theoretical framework for bargaining. Bargaining doesn’t typically produce ways of organizing the bargainers—instead it produces an object-level answer to whatever problem the bargainers face.
This makes sense when you have a single decision to make. But when bargainers face many possible future decisions, bargaining over outcomes requires specifying which outcome to choose in every possible situation. This is deeply intractable in realistic settings, where bargainers can’t predict every possible scenario they might face.
In those settings it is much more tractable to bargain over methods of making decisions which generalize beyond the problems that the bargainers are currently aware of. I don’t know of much work on this, but the same idealized bargaining solutions (e.g. the Nash bargaining solution) should still apply in principle. The big question is whether there’s anything interesting to be said about the relationship between incentive-compatible decision procedures and bargaining solutions. For example, are there classes of incentive-compatible decision procedures which make it especially easy for agents to identify which one is near the optimal bargaining solution? On a more theoretical level, one tantalizing hint is that the ROSE bargaining solution is also constructed by abandoning the axiom of independence—just as Garrabrant does in his rejection of EUM above. This connection seems worth exploring further.
To finish, I’ve summarized many of the claims from this section in the following table:
What do I mean by “hard to design or reason about?” One nice thing about EUMs is that their behavior is extremely easy to summarize: they do whatever’s best for their goals according to their beliefs. But we can’t talk about decision procedures in the same way. Individual subagents may have goals and beliefs, but the decision procedure itself doesn’t: it just processes those subagents into a final decision.
Fortunately, there’s a way to rescue our intuitive idea that agents should have beliefs and goals. It’ll involve talking about much more complex incentive-compatible decision procedures, though. So first I’ll turn to the other direction in which we can try to derive coalitional agency: starting from active inference.
From active inference to coalitional agency
I just gave an account of coalitional agents in which they’re built up from individual EUMs. In this section I’ll do the opposite: start from an active inference agent and modify it until it looks more like a coalitional agent.
More specifically, consider a hierarchical generative model containing beliefs/goals, where higher layers predict lower layers, and lower layers send prediction errors up to higher layers. Let’s define a subagent as a roughly-internally-consistent cluster of beliefs and goals within that larger agent. Note that this definition is a matter of degree: if we apply a high bar for internal consistency, then each subagent will be small (e.g. beliefs and desires about a single object) whereas a lower bar will lead to larger subagents (e.g. a whole ideology).
Subagents with different beliefs and goals will tend to make different predictions (including “predictions” about which actions they want the agent to take). What modifications do we need to make to our original setup for it to be robust to strategic dynamics between those subagents?
Predicting observations via prediction markets
When multiple subagents make conflicting predictions, the standard approach is to combine them by taking a precision-weighted average. Credit is then assigned to each subagent for the prediction in proportion to how confident it was. But this is not incentive-compatible: subagents can benefit by strategizing about how the other subagents will respond, and changing their responses accordingly.
There are various incentive-compatible ways to elicit predictions from multiple agents (many of which are discussed by Neyman). However, the most elegant incentive-compatible method for aggregating predictions is a prediction market. Each trader on a prediction market can choose to buy shares in propositions it thinks are overpriced and sell shares in propositions it thinks are underpriced. This allows subagents to specialize into different niches within the overall agent. It also incentivizes them to arbitrage away any logical inconsistency they notice. These dynamics are modeled by the Garrabrant induction framework.
Choosing actions via auctions
Given my discussion above about actions being in some sense predictions of future behavior, we might think that actions should be chosen by prediction markets too. However, there’s a key asymmetry: if I expect a complex plan to happen, I can profit by predicting any aspect of it. But if I want a complex plan to happen, I need to successfully coordinate every aspect of it. So, unlike predictions of observations, predictions of actions need to have some mechanism for giving a single plan control over many different actuators.
In active inference, the mechanism by which this occurs is called expected free energy minimization. I’m honestly pretty confused about how expected free energy minimization works, but I strongly suspect that it’s not incentive-compatible. In particular, the discontinuity involved in picking the single highest-value plan seems like it’d induce incentives to overestimate your own plan’s value. However, Demski et al.’s BRIA framework solves this problem by requiring subagents to bid for the right to implement a plan and receive the corresponding reward. Rational subagents will never bid more than the reward they actually expect. So my hunch is that something like this auction system would be the best way to adjust our original setup to make it incentive-compatible.
Aggregating values via voting
The last important component of decision-making is evaluating plans (whether in advance or in hindsight). What happens when different subagents disagree on which goals or values the plans should be evaluated in terms of? Again, the standard approach is to take a precision-weighted average of their evaluations, but this still has all the same incentive-compatibility issues. And unlike predictions, values have no ground truth feedback signal, meaning that prediction markets don’t help.
So I expect that the most appropriate way to aggregate goals/values is via a voting system. This is also the conclusion reached by Newberry and Ord, who model idealized moral decision-making in terms of a parliament in which subagents vote on what values to pursue. Specifically, they propose using random ballot voting, in which each voter’s favorite option is selected with probability proportional to their vote share. This voting algorithm has three particularly notable features:
Putting it all together
I’ve described two paths towards a theory of coalitional agency. On one path, we start from expected utility maximizers and aggregate them to form coalitional agents, via those EUMs bargaining about which decision procedures to use. The problem is that the resulting decision procedure may be incoherent in the sense that it can’t be ascribed beliefs or goals. On the other path, we make interactions between active inference subagents more incentive-compatible by using prediction markets, auctions, and voting (or similar mechanisms) to manage internal conflict.
What I’ll call the coalitional agency hypothesis is the idea that these two paths naturally “meet in the middle”—specifically, that EUMs doing (idealized) bargaining about which decision procedure to use would in many cases converge to something like my modified active inference procedure. If true, we’d then be able to talk about that procedure’s “beliefs” (the prices of its prediction market) and “goals” (the output of its voting procedure).
One line of work which supports the coalitional agency hypothesis is Critch’s negotiable reinforcement learning framework, under which EUMs should bet their influence on any disagreements about the future they have with other agents, so that they end up very powerful if (and only if) their predictions are right. I interpret this result as evidence that (some version of) prediction markets are the default outcome of bargaining over incentive-compatible decision procedures.
But all of this work is still vague and tentative. I’d very much like to develop a more rigorous formulation of coalitional agency. This would benefit greatly from working with collaborators (especially those with strong mathematical skills). So I’ll finish with two calls to action. If you’re a junior(ish) researcher and you want to work with me on any of this, apply to my MATS fellowship. If you’re an experienced researcher and you’d like to chat or otherwise get involved (potentially by joining a workshop series I’ll be running on this) please send me a message directly.
Thanks to davidad, Jan Kulveit, Emmett Shear, Ivan Vendrov, Scott Garrabrant, Abram Demski, Martin Soto, Laura Deming, Aaron Tucker, Adria Garriga, Oliver Richardson, Madeleine Song and others for helping me formulate these ideas.