Let’s start with the simplest coherence theorem: suppose I’ll pay to upgrade pepperoni pizza to mushroom, pay to upgrade mushroom to anchovy, and pay to upgrade anchovy to pepperoni. This does not bode well for my bank account balance. And the only way to avoid having such circular preferences is if there exists some “consistent preference ordering” of the three toppings - i.e. some ordering such that I will only pay to upgrade to a topping later in the order, never earlier. That ordering can then be specified as a utility function: a function which takes in a topping, and gives the topping’s position in the preference order, so that I will only pay to upgrade to a topping with higher utility.
More advanced coherence theorems remove a lot of implicit assumptions (e.g. I could learn over time, and I might just face various implicit tradeoffs in the world rather than explicit offers to trade), and add more machinery (e.g. we can incorporate uncertainty and derive expected utility maximization and Bayesian updates). But they all require something-which-works-like-money.
Money has two key properties in this argument:
- Money is additive across decisions. If I pay $1 to upgrade anchovy to pepperoni, and another $1 to upgrade pepperoni to mushroom, then I have spent $1 + $1 = $2.
- All else equal, more money is good. If I spend $3 trading anchovy -> pepperoni -> mushroom -> anchovy, then I could have just stuck with anchovy from the start and had strictly more money, which would be better.
These are the conditions which make money a “measuring stick of utility”: more money is better (all else equal), and money adds. (Indeed, these are also the key properties of a literal measuring stick: distances measured by the stick along a straight line add, and bigger numbers indicate more distance.)
Why does this matter?
There’s a common misconception that every system can be interpreted as a utility maximizer, so coherence theorems don’t say anything interesting. After all, we can always just pick some “utility function” which is maximized by whatever the system actually does. It’s the measuring stick of utility which makes coherence theorems nontrivial: if I spend $3 trading anchovy -> pepperoni -> mushroom -> anchovy, then it implies that either (1) I don’t have a utility function over toppings (though I could still have a utility function over some other silly thing, like e.g. my history of topping-upgrades), or (2) more money is not necessarily better, given the same toppings. Sure, there are ways for that system to “maximize a utility function”, but it can’t be a utility function over toppings which is measured by our chosen measuring stick.
Another way to put it: coherence theorems assume the existence of some resources (e.g. money), and talk about systems which are pareto optimal with respect to those resources - e.g. systems which “don’t throw away money”. Implicitly, we're assuming that the system generally "wants" more resources (instrumentally, not necessarily as an end goal), and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from Y to X. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it's not an expected utility maximizer - it "threw away money" unnecessarily. We assume that the resources are a measuring stick of utility, and then ask whether the system maximizes any utility function over the given state-space measured by that measuring stick.
Ok, but what about utility functions which don’t increase with resources?
As a general rule, we don’t actually care about systems which are “utility maximizers” in some trivial sense, like the rock which “optimizes” for sitting around being a rock. These systems are not very useful to think of as optimizers. We care about things which steer some part of the world into a relatively small state-space.
To the extent that we buy instrumental convergence, using resources as a measuring stick is very sensible. There are various standard resources in our environment, like money or energy, which are instrumentally useful for a very wide variety of goals. We expect a very wide variety of optimizers to “want” those resources, in order to achieve their goals. Conversely, we intuitively expect that systems which throw away such resources will not be very effective at steering parts of the world into relatively small state-space. They will be limited to fewer bits of optimization than systems which use those same resources pareto optimally.
So there’s an argument to be made that we don’t particularly care about systems which “maximize utility” in some sense which isn’t well measured by resources. That said, it’s an intuitive, qualitative argument, not really a mathematical one. What would be required in order to make it into a formal argument, usable for practical quantification and engineering?
The Measuring Stick Problem
The main problem is: how do we recognize a “measuring stick of utility” in the wild, in situations where we don’t already think of something as a resource? If somebody hands me a simulation of a world with some weird physics, what program can I run on that simulation to identify all the “resources” in it? And how does that notion of “resources” let me say useful, nontrivial things about the class of utility functions for which those resources are a measuring stick? These are the sorts of questions we need to answer if we want to use coherence theorems in a physically-reductive theory of agency.
If we could answer that question, in a way derivable from physics without any “agenty stuff” baked in a priori, then the coherence theorems would give us a nontrivial sense in which some physical systems do contain embedded agents, and other physical systems don’t. It would, presumably, allow us to bound the number of bits-of-optimization which a system can bring to bear, with more-coherent-as-measured-by-the-measuring-stick systems able to apply more bits of optimization, all else equal.
Epistemic status: probably wrong; intuitively, I feel like I'm onto something but I'm too uncertain about this framing to be confident in it
I refer to optimizers which can be identified by a measuring stick of utility as agenty optimizers
The measuring stick is optimization power. In particular, in the spirit of this sequence, it is the correlation between local optimization and optimization far away. If I have 4 basic actions available to me and each performs two bits of optimization on the universe, I am maximally powerful (for a structure with 4 basic actions) and most definitely either an agent or constructed by one. I speak and the universe trembles.
One might look at the life on Earth and see that it is unusually structured and unusually purposeful and conclude that it is the work of an agenty optimizer. And they would be wrong.
But if they looked closer, at the pipelines and wires and radio waves on Earth, they might conclude that they were the work of an agenty optimizer because they turn small actions (flipping a switch, pressing a key) into large, distant effects (water does or doesn't arrive at a village, a purchase is confirmed and a bushel of apples is shipped across the planet). And they would be correct.
In this framing, resources under my control are structures which propagate and amplify my outputs out into large, distant effects (they needn't be friendly, per se, they just have to be manipulable). Thus, a dollar (+ Amazon + a computer + ...) is an invaluable resource because, with it, I can cause any of literally millions of distinct objects to move from one part of the world to another by moving my fingers in the right way. And I can do that because the world has been reshaped to bend to my will in a way that clearly indicates agency to anyone who knows how to look.
However, I haven't the slightest idea how to turn this framework into a method for actually identifying agents (or resources) in a universe with weird physics.
Also, I have a sense that there is an important difference between accumulating asymmetric power (allies, secret AI) and creating approximately symmetrically empowering infrastructure (Elicit), which is not captured by this framework. Maybe the former is evidence of instrumental resource accumulation whereas the latter provides specific information about the creator's goals? But both *are* clear signs of agenty optimization, so maybe it's not relevant to this context?
Also possibly of note is that more optimization power is not strictly desirable because having too many choices might overwhelm your computational limitations.
I'm not sure if this is helpful, but I tend to think of the term "resources" as referring to things that are expended when used (like dollars or fuel cells). I think of reusable tools (like screwdrivers or light-switches) as being in a different category.
(I realize that approximately all tools will in fact wear out after some amount of use, but these still feel like naturally-distinct categories for most things in my actual life. I do not think of my screwdriver or my measuring tape or my silverware as having some finite number of charges that are being expended each time I use them.)
EDIT: Reworded for clarity.