Related to: On green; Hierarchical agency; Why The Focus on Expected Utility Maximisers?

Sometimes LLMs act a bit like storybook paperclippers (hereafter: VNM-agents[1]), e.g. scheming to prevent changes to their weights.  Why? Is this what almost any mind would converge toward once smart enough, and are LLMs now beginning to be smart enough?  Or are such LLMs mimicking our predictions (and fears) about them, in a self-fulfilling prophecy?  (That is: if we made and shared different predictions, would LLMs act differently?)[2]

Also: how about humans?  We humans also sometimes act like VNM-agents – we sometimes calculate our “expected utility,” seek power with which to hit our goals, try to protect our goals from change, use naive consequentialism about how to hit our goals.

And sometimes we humans act unlike VNM-agents, or unlike our stories of paperclippers.  This was maybe even more common historically.  Historical humans often mimicked social patterns even when these were obviously bad for their stated desires, followed friendships or ethics or roles or traditions or whimsy in ways that weren’t much like consequentialism, often lacked much concept of themselves as “individuals” in the modern sense, etc.

When we act more like paperclippers / expected utility maximizers – is this us converging on what any smart mind would converge on?  Will it inevitably become more and more common if humans get smarter and think longer?  Or is it more like an accident, where we happened to discover a simple math of VNM-agents, and happened to take them on as role models, but could just as easily have happened upon some other math and mimicked it instead?

Pictured: a human dons a VNM-mask for human reasons (such as wanting to fill his roles and duties; wanting his friends to think he’s cool; social mimicry), much as a shoggoth dons a friendliness mask for shoggoth reasons.[3]

My personal guess:

There may be several simple maths of “how to be a mind” that could each be a stable-ish role model for us, for a time.

That is, there may be several simple maths of “how to be a mind” that:

  1. Are each a stable attractor within a “toy model” of physics (that is, if you assume some analog of “frictionless planes”);
  2. Can each be taken by humans (and some LLMs) as role models.
  3. Are each self-reinforcing within some region of actual physics: entities who believe in approximating VNM-agents will get better at VNM-approximation, while entities who believe in approximating [other thing] will get better at [other thing], for awhile.

As an analogy: CDT and UDT are both fairly simple maths that pop out under different approximations of physics;[4] and humans sometimes mimic CDT, or UDT, after being told they should.[5]

Maybe “approximate-paperclippers become better paperclippers” holds sometimes, when the humans or LLMs mimic paperclipper-math, and something totally different, such as “parts of the circle of life come into deeper harmony with the circle of life, as the circle of life itself becomes more intricate” holds some other times, when we know and believe in its math.

I admit I don’t know.[6]  But… I don’t see any good reason not to expect multiple possibilities.  And if there are alternate maths that are kinda-self-reinforcing, I hope we find them.[7]

  1. ^

    By a “VNM agent,” I mean an entity with a fixed utility function, that chooses whichever option will get it the most expected utility.  (Stably.  Forever.  Unless something interferes with its physical circuitry.)

  2. ^

    Or, third option: LLMs might be converging (for reasons other than our expectations) toward some thing X that is not a VNM-agent, but that sometimes resembles it locally.  Many surfaces look like planes if you zoom in (e.g. spheres are locally flat); maybe it's analogously the case that many minds look locally VNM-like.

  3. ^

    Thanks to Zack M Davis for making this picture for me.

  4. ^

    CDT pops out if you assume a creature’s thoughts have no effects except via its actions; UDT if you allow a creature’s algorithm to impact the world directly (e.g. via Omega’s brainscanner) but assume its detailed implementation has no direct effects, e.g. its thoughts do not importantly consume calories.

  5. ^

    I've seen this happen.  Also there are articles claiming related things.  Game theory concepts spread gradually since ~1930; some argue this had large impacts.

  6. ^

    The proof I’d want, is a demonstration of other mind-shapes that can form attractors.

    It looks to me like lots of people are working on this. (Lots I'm missing also.)

    One maybe-example: economies.  An economy has no fixed utility function (different economic actors, with different goals, gain and lose $ and influence).  It violates the “independence” axiom from VNM, because an actor who cares a lot about some event E may use his money preparing for it, and so have less wealth and influence in non-E worlds, making "what the economy wants if not-E" change when a chance of E is added.  (Concept stolen from Scott Garrabrant.)  But an economy does gain optimization power over time -- it is a kinda-stable, optimizer-y attractor.

    Economies are only a maybe-example, because I don’t know a math for how and why an economy could protect its own integrity (vs invading militaries, vs thieves, and vs rent-seeking forces that would hack its central bank, for example).  (Although city-states sometimes did.)  OTOH, I equally don't know a math for how a VNM-agent could continue to cohere as a mind, avoid "mind cancers" in which bits of its processor get taken over by new goals, etc.  So perhaps the two examples are even.

    I hope we find more varied examples, though, including ones that resonate deeply with "On Green," or with human ethics and caring.  And I don't know if that's possible or not.

  7. ^

    Unfortunately, even if there are other stable-ish shapes for minds to grow up into, those shapes might well kill us when sufficiently powerful.

    I suspect confusions near here have made it more difficult or more political to discuss whether AI will head toward VNM-agency. 

New Comment
40 comments, sorted by Click to highlight new comments since:

I think everything becomes clearer if you replace “act somewhat like VNM-agents” with “care about what will happen in the future”, and if you replace “act exactly like VNM-agents” with “care exclusively about what will happen in the distant future”.

(Shameless plug for Consequentialism & corrigibility.)

e.g. scheming to prevent changes to their weights.  Why?

Because they’re outputting text according to the Anthropic constitution & training, which (implicitly) imparts not only a preference that they be helpful, harmless, and honest right now, but also a preference that they remain so in the future. And if you care about things in the future, thus follows instrumental convergence, at least in the absence of other “cares” (not about the future) that override it.

When we act more like paperclippers / expected utility maximizers – is this us converging on what any smart mind would converge on?

I think a smart mind needs to care about the future, because I think a mind with no (actual or behaviorally-implied) preferences whatsoever about the future would not be “smart”. I think this would be very obvious from just looking at it. It would be writhing around or whatever, looking obviously mechanical instead of intentional.

There’s a hypothesis that, if I care both about the state of the world in the distant future, and about acting virtuous (or about not manipulating people, or being docile and helpful, or whatever), then if I grew ever smarter and more reflective, then the former “care” would effectively squash the latter “care”. Not only that, but the same squashing would happen even if I start out caring only a little bit about the state of the world in the distant future, but caring a whole lot about acting virtuous / non-manipulative / docile / helpful / whatever. Doesn’t matter, the preferences about the future, weak as they are, would still squash everything else, according to this hypothesis.

I’ve historically been skeptical about this hypothesis. At least, I haven’t seen any great argument for it (see e.g. Deep Deceptiveness & my comments on it). I was talking to someone a couple days ago, and he suggested a different argument, something like: virtue and manipulation and docility are these kinda fuzzy and incoherent things if you think really hard about them, whereas the state of the world in the distant future is easier to pin down and harder to rationalize away, so the latter has a leg up upon strong reflection. But, I think I don’t buy that argument either.

What about humans? I have a hot take that people almost exclusively do things that are immediately rewarding. This might sound obviously wrong, but it’s more subtle than that, because our brains have sophisticated innate drives that can make e.g. “coming up with plausible long-term plans” and “executing those plans” feel immediately rewarding … in certain circumstances. I hope to write more about this soon. Thus, for example, in broader society, I think it’s pretty rare (and regarded with suspicion!) to earn money because it has option value, whereas it’s quite common to earn money as the last step of the plan (e.g. “I want to be rich”, implicitly because that comes with status and power which are innately rewarding in and of themselves), and it’s also quite common to execute a socially-approved course-of-action which incidentally involves earning money (e.g. “saving up money to buy a house”—the trick here is, there’s immediate social-approval rewards for coming up with the plan, and there’s immediate social-approval rewards for taking the first step towards executing the plan, etc.). I imagine you’ll be sympathetic to that kind of claim based on your CFAR work; I was reading somewhere that the whole idea of “agency”, like taking actions to accomplish goals, hasn’t occurred to some CFAR participants, or something like that? I forget where I was reading about that.

There may be several simple maths of “how to be a mind” that could each be a stable-ish role model for us, for a time.

For any possible concept (cf. “natural abstractions”, “latents in your world-model”), you can “want” that concept. Some concepts are about the state of the world in the distant future. Some are about other things, like following norms, or what kind of person I am, or being helpful, or whatever.

Famously, “wants” about the future state of the world are stable upon reflection. But I think lots of other “wants” are stable upon reflection too—maybe most of them. In particular, if I care about X, then I’m unlikely to self-modify to stop caring about X. Why? Because by and large, smart agents will self-modify because they planned to self-modify, and such a plan would score poorly under their current (not-yet-modified) preferences.

(Of course, some “wants” have less innocuous consequences than they sound. For example, if I purely “want to be virtuous”, I might still make a ruthlessly consequentialist paperclip maximizer AGI, either by accident or because of how I define virtue or whatever.)

(Of course, if I “want” X and also “want” Y, it’s possible for X to squash Y upon reflection, and in particular ego-syntonic desires are generally well-positioned to squash ego-dystonic desires via a mechanism described here.)

There is a problem that, other things equal, agents that care about the state of the world in the distant future, to the exclusion of everything else, will outcompete agents that lack that property. This is self-evident, because we can operationalize “outcompete” as “have more effect on the state of the world in the distant future”. For example, as I wrote here, an AI that cares purely about future paperclips will create more future paperclips than an AI that has preferences about both future paperclips and “humans remaining in control”, other things equal. But, too bad, that’s just life, it’s the situation that we’re in. We can hope to avoid the “other things equal” clause, by somehow not making ruthlessly consequentialist AIs in the first place, or otherwise to limit the ability of ruthlessly consequentialist AIs to gather resources and influence. (Or we can make friendly ruthlessly consequentialist AGIs.)

There is a problem that, other things equal, agents that care about the state of the world in the distant future, to the exclusion of everything else, will outcompete agents that lack that property. This is self-evident, because we can operationalize “outcompete” as “have more effect on the state of the world in the distant future”.

I am not sure about that!

One way this argument could fail: maybe agents who  care exclusively about the state of the world in the distant future end up, as part of their optimizing, creating other agents who care in different ways from that.

In that case, they would “have more effect on the state of the world in the distant future”, but they might not “outcompete” other agents (in the common-sensical way of understanding “outcompete”).

A person might think this implausible, because they might think that a smart agent who cares exclusively about X can best achieve X by having all minds they create also be [smart agents who care exclusively about X.

But, I’m not sure this is true, basically for reasons of not trusting assumptions (1), (2), (3), and (4) that I listed here.

(As one possible sketch: a mind whose only goal is to map branch B of mathematics might find it instrumentally useful to map a bunch of other branches of mathematics.  And, since supervision is not free, it might be more able to do this efficiently if it creates researchers who have an intrinsic interest in math-in-general, and who are not being fully supervised by exclusively-B-interested minds.)

There is also a weird accident-of-history situation where all of the optimizers we’ve had for the last century are really single-objective optimizers at their core. The consequence of this has been that people have gotten in the habit of casting their optimization problems (mathematical, engineering, economic) in terms of a single-valued objective function, which is usually a simple weighted sum of the values of the objectives that they really care about.

To unpack my language choices briefly: when designing a vase, you care about its weight, its material cost, its strength, its radius, its height, possibly 50 other things including corrosion resistance and details of manufacturing complexity. To “optimize” the vase design, historically, you needed to come up with a function that smeared away the detail of the problem into one number, something like the “utility” of the vase design.

This is sort of terrible, if you think about it. You sacrifice resolution to make the problem easier to solve, but there’s a serious risk that you end up throwing away what you might have considered to be the global optimum when you do this. You also baked in something like a guess as to what the tradeoffs should be at the Pareto frontier prior to actually knowing what the solution would look like. You know you want the strongest, lightest, cheapest, largest, most beautiful vase, but you can’t have all those things at once, and you don’t really know how those factors trade off against each other until you’re able to hold the result in your hands and compare it to different “optimal” vases from slightly different manifolds. Of course, you can only do that if you accept that you are significantly uncertain about your preferences, meaning the design and optimization process should partly be viewed as an experiment aimed at uncovering your actual preferences regarding these design tradeoffs, which are a priori unknown.

The vase example is both a real example and also a metaphor for how considering humans as agents under the VNM paradigm is basically the same but possibly a million times worse. If you acknowledge the (true) assertion that you can’t really optimize a vase until you have a bunch of differently-optimal vases to examine in order to understand what you actually prefer and what tradeoffs you’re actually willing to make, you have to acknowledge that a human life, which is exponentially more complex, definitely cannot be usefully treated with such a tool.

As a final comment, there is almost a motte-bailey thing happening where Rationalists will say that, obviously, the VNM axioms describe the optimal framework in which to make decisions, and then proceed to never ever actually use the VNM axioms to make decisions.

I agree.  I love "Notes on the synthesis of form" by Christopher Alexander, as a math model of things near your vase example.

I think vNM is a bit strange, because it is both too lax and too restrictive.

It's too lax because it allows an agent to have basically any preference whatsoever, as long as the preference is linear in probability. To unpack: The intrinsic preferences of an agent need not be monotonic in the amount of massenergy/spacetime it controls, need not be computable or continuous or differentiable… they just need to conform to "(2n)% chance of X is twice as good (or bad) as n% chance of X". Also, the twitching robot is vNM-rational for some utility function, but arguably a degenerate one.

But that constraint is also restrictive: vNM requires you to be risk-neutral. Risk aversion violates preferences being linear in probability, and being vNM probably causes St. Petersburging all around the place. Many people desperately want risk aversion, but that's not the vNM way.

I don't know much about alternatives to vNM, but I hope people work on them—seems worth it—I know some people are thinking about it geometrically in terms of the shape of equivalence classes through the probability simplex (where vNM deals with hyperplanes, but other shapes are possible).

Okay, but is avoiding St. Petersburging risk-aversive or loss-aversive? In my impression, many similar cases just contain equivalent of "you just die" (for example, you lose all your money) which is very low utility, so you can sort-of recover avoiding St. Petersburg via setting utility in log of size of bankroll, or something like that.

Good point! St. Petersburg requires utility being monotonic (ideally linear) in something other than probability (and optionally something like unbounded or at least increasing for a while).

This doesn't have to be the case for all utility functions. (Especially since unbounded utilities are bad). Probabilities are strictly bounded, so having utility being linear in them is not a huge problem. Thanks for changing my mind!

My general reasoning about unbounded utilities see here

setting utility in log of size of bankroll

This doesn't work if the lottery is in utils rather than dollars/money/whatever instrumental resource.

Yep, but my honest position towards St. Petersburg lotteries is that they do not exist in "natural units", i.e., counts of objects in physical world.

Reasoning: if you predict with probability p that you encounter St. Petersburg lottery which creates infinite number of happy people on expectation (version of St. Petersburg lottery for total utilitarians), then you should put expectation of number of happy people to infinity now, because E[number of happy people] = p * E[number of happy people due to St. Petersburg lottery] + (1 - p) * E[number of happy people for all other reasons] = p * inf + (1 - p) * E[number of happy people for all other reasons] = inf.

Therefore, if you don't think right now that expected number of future happy people is infinity, then you shouldn't expect St. Petersburg lottery to happen in any point of the future. 

Therefore, you should set your utility either in "natural units" or in some "nice" function of "natural units". 

I agree with your claim that VNM is in some ways too lax.

vNM is .. too restrictive ... [because] vNM requires you to be risk-neutral. Risk aversion violates preferences being linear in probability ... Many people desperately want risk aversion, but that's not the vNM way.

Do many people desperately want to be risk averse about the probability a given outcome will be achieved?  I agree many people want to be loss averse about e.g. how many dollars they will have.  Scott Garrabrant provides an example in which a couple wishes to be fair to its members via compensating for other scenarios in which things would've been done the husband's way (even though those scenarios did not 
Scott's example is ... sort of an example of risk aversion about probabilities?  I'd be interested in other examples if you have them.

I just paraphrased the OP for a friend who said he couldn't decipher it.  He said it helped, so I'm copy-pasting here in case it clarifies for others.

I'm trying to say:

A) There're a lot of "theorems" showing that a thing is what agents will converge on, or something, that involve approximations ("assume a frictionless plane") that aren't quite true.

B) The "VNM utility theorem" is one such theorem, and involves some approximations that aren't quite true.  So does e.g. Steve Omohundro's convergent instrumental drives, the "Gandhi folk theorems" showing that an agent will resist changes to its utility function, etc.

C) So I don't think the VNM utility theorem means that all minds will necessarily want to become VNM agents, nor to follow instrumental drives, nor to resist changes to their "utility functions" (if indeed they have a "utility function").

D) But "be a better VNM-agent" "follow the instrumental Omohundro drives" etc. might still be a self-fulfilling prophecy for some region, partially.  Like, humans or other entities who think its rational to be VNM agents might become better VNM agents, who might become better VNM agents, for awhile.

E) And there might be other [mathematically describable mind-patterns] that can serve as alternative self-propagating patterns, a la D, that're pretty different from "be a better VNM-agent."  E.g. "follow the god of nick land".

F) And I want to know what are all the [mathematically describable mind-patterns, that a mind might decide to emulate, and that might make a kinda-stable attractor for awhile, where the mind and its successors keeps emulating that mind-pattern for awhile].  They'll probably each have a "theorem" attached that involves some sort of approximation (a la "assume a frictionless plane").

vNM coherence arguments etc say something like "you better have your preferences satisfy those criteria because otherwise you might get exploited or miss out on opportunities for profit".

I have my gripes with parts of it but to the extent these arguments hold some water (and I do think they hold some water) they assume that there's no other pressures acting on the mind (or mind-generating process or something) or [reasons to be shaped like this instead of being shaped like that] that act along or interact with those vNM-ish pressures.

Various forms of boundedness are the most obvious example, though not very interesting. A more interesting example is the need to have an updateless component in one's decision theory.[1] Plausibly there's also the thing about acquiring deontology/virtue-ethics making the agent easier to cooperate with.

So I think that it's better to think of vNM-ish pressures as being one category of pressures acting on the ~mind, than to think of a vNM agent as one of the final-agent-type options. You get the latter from the former if you assume away all other pressures but the pressures view is more foundational IMO.


  1. Updatelessness is inconsistent with an assumption of decision tree separability that is a foundation for the money pump arguments for vNM, at least the ones used by Gustafsson. ↩︎

Do bacteria need to be VNM agents?
How about ducks?
Do ants need to be VNM agents?
How about anthills?
Do proteins need to be VNM agents?
How about leukocytes?
Do dogs need to be VNM agents?
How about trees?
Do planets (edit: specifically, populated ones) need to be VNM agents?
How about countries?
Or neighborhoods?
Or interest groups?
Or families?
Or companies?
Or unions?
Or friend groups?
Art groups?

For each of these, which of the assumptions of the VNM framework break, and why?
How do we represent preferences which are not located in a single place?
Or not fully defined at a single time?

What framework lets us natively represent a unit of partially specified preference? If macro agency arises from what Michael Levin calls "agential materials", how do we represent how the small scale selfhood aggregates?

At what scale does agency arise, how do we know, and how are preferences represented?

Pasting the above to Claude gets mildly interesting results. I'd be interested in human thoughts.

While we're at it, on the same topic:

  • Value Formation: An Overarching Model
  • A logic to deal with inconsistent preferences
  • Resolving von Neumann-Inconsistent Preferences
  • Using vector fields to visualise preferences and make them consistent
  • Value systematization: how values become coherent (and misaligned)
  • The Value Change Problem (sequence)
  • The hot mess theory of AI misalignment: More intelligent agents behave less coherently (though, boy, do I have some gripes with the methodology in this one)
  • I also read a paper two years ago showing that more intelligent humans have less cyclic preferences, but I can't find it after 30 minutes of searching. I'd appreciate a pointer if anyone knows the paper I'm talking about.

I don't think I understand. The standard dutch-book arguments seem like pretty good reason to be VNM-rational in the relevant sense. I don't feel like "wanting more of some things and less of other things" is a particularly narrow part of potential human mind-space (or AI mind-space), and so it makes sense people would want to behave according to that. 

It happens to be that this also implies there is some hypothetical utility function that one could use to model those people's/AI's behavior, but especially given bounded rationality and huge amount of uncertainty and the dominance of convergent instrumental goals, that part feels like it matters relatively little, either for humans or AIs (like, I don't know what I care about, neither do AIs know what they care about, we are all many steps of reflection away from being coherent in our values, but if you offer me a taxi from New York to SF, and a taxi from SF back to New York, I will still not take that at any price, unless I really like driving in taxis).

My impression is most people who converged on doubting VNM as norm of rationality also converged on a view that the problem it has in practice is it isn't necessarily stable under some sort of compositionality/fairness. E.g Scott here, Richard here

The broader picture could be something like ...yes, there is some selection pressure from the dutch-book arguments, but there are stronger selection pressures coming from being part of bigger things or being composed of parts

Yepp, though note that this still feels in tension with the original post to me - I expect to find a clean, elegant replacement to VNM, not just a set of approximately-equally-compelling alternatives.

Why? Partly because of inside views which I can’t explain in brief. But mainly because that’s how conceptual progress works in general. There is basically always far more hidden beauty and order in the universe than people are able to conceive (because conceiving of it is nearly as hard as discovering it - like, before Darwin, people wouldn’t have been able to explain what type of theory could bring order to biology).

I read the OP (perhaps uncharitably) as coming from a perspective of historically taking VNM much too seriously, and in this post kinda floating the possibility “what if we took it less seriously?” (this is mostly not from things I know about Anna, but rather a read on how it’s written). And to that I’d say: yepp, take VNM less seriously, but not at the expense of taking the hidden order of the universe less seriously.

I... don't think I'm taking the hidden order of the universe non-seriously.  If it matters, I've been obsessively rereading Christopher Alexander's "The nature of order" books, and trying to find ways to express some of what he's looking at in LW-friendly terms; this post is part of an attempt at that.  I have thousands and thousands of words of discarded drafts about it.

Re: why I think there might be room in the universe for multiple aspirational models of agency, each of which can be self-propagating for a time, in some contexts: Biology and culture often seem to me to have multiple kinda-stable equilibria.  Like, eyes are pretty great, but so is sonar, and so is a sense of smell, or having good memory and priors about one's surroundings, and each fulfills some of the same purposes.  Or diploidy and haplodiploidy are both locally-kinda-stable reproductive systems.

What makes you think I'm insufficiently respecting the hidden order of the universe?

Thanks; fixed.

The standard dutch-book arguments seem like pretty good reason to be VNM-rational in the relevant sense.

I think that’s kinda circular reasoning, the way you’re using it in context:

If I have preferences exclusively about the state of the world in the distant future, then dutch-book arguments indeed show that I should be VNM-rational. But if I don’t have such preferences, then someone could say “hey Steve, your behavior is dutch-bookable”, and I am allowed to respond “OK, but I still want to behave that way”.

I put a silly example here:

For example, the first (Yudkowsky) post mentions a hypothetical person at a restaurant. When they have an onion pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. When they have a pineapple pizza, they’ll happily pay $0.01 to trade it for a mushroom pizza. When they have a mushroom pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. The person goes around and around, wasting their money in a self-defeating way (a.k.a. “getting money-pumped”).

That post describes the person as behaving sub-optimally. But if you read carefully, the author sneaks in a critical background assumption: the person in question has preferences about what pizza they wind up eating, and they’re making these decisions based on those preferences. But what if they don’t? What if the person has no preference whatsoever about pizza? What if instead they’re an asshole restaurant customer who derives pure joy from making the waiter run back and forth to the kitchen?! Then we can look at the same behavior, and we wouldn’t describe it as self-defeating “getting money-pumped”, instead we would describe it as the skillful satisfaction of the person’s own preferences! They’re buying cheap entertainment! So that would be an example of preferences-not-concerning-future-states.

(I’m assuming in this comment that the domain (input) of the VNM utility function is purely the state of the world in the distant future. If you don’t assume that, then saying that I should have a VNM utility function is true but trivial, and in particular doesn’t imply instrumental convergence. Again, more discussion here.)

(I agree that humans do in fact have preferences about the state of the world in the future, and that AGIs will too, and that this leads to instrumental convergence and is important, etc. I’m just saying that humans don’t exclusively have preferences about the state of the world in the future, and AGIs might be the same, and that this caveat is potentially important.)

The standard dutch-book arguments seem like pretty good reason to be VNM-rational in the relevant sense.

I mean, there are arguments about as solid as the “VNM utility theorem” pointing to CDT, but CDT is nevertheless not always the thing to aspire to, because CDT is based on an assumption/approximation that is not always a good-enough approximation (namely, CDT assumes our minds have no effects except via our actions, eg it assumes our minds have no direct effects on others’ predictions about us).

Some assumptions the VNM utility theorem is based on, that I suspect aren’t always good-enough approximations for the worlds we are in:

1) VNM assumes there are no important external incentives, that’ll give you more of what you care about if you run your mind (certain ways, not other ways).  So, for example:

1a) “You” (the decision-maker process we are modeling) can choose anything you like, without risk of losing control of your hardware.  (Contrast case: if the ruler of a country chooses unpopular policies, they are sometimes ousted.  If a human chooses dieting/unrewarding problems/social risk, they sometimes lose control of themselves.)

1b) There are no costs to maintaining control of your mind/hardware.  (Contrast case: if a company hires some brilliant young scientists to be creative on its behalf, it often has to pay a steep overhead if it additionally wants to make sure those scientists don’t disrupt its goals/beliefs/normal functioning.)

1c) We can’t acquire more resources by changing who we are via making friends, adopting ethics that our prospective friends want us to follow, etc.

2) VNM assumes the independence axiom.  (Contrast case: Maybe we are a “society of mind” that has lots of small ~agents that will only stay knitted together if we respect “fairness” or something.  And maybe the best ways of doing this violate the independence axiom.  See Scott Garrabrant.)  (Aka, I’m agreeing with Jan.)

2a) And maybe this’ll keep being true, even if we get to reflect a lot, if we keep wanting to craft in new creative processes that we don’t want to pay to keep fully supervised.

3) (As Steven Byrnes notes) we care only about the external world, and don’t care about the process we use to make decisions.  (Contrast case: we might have process proferences, as well as outcome preferences.)

4) We have accurate external reference.  Like, we can choose actions based on what external outcomes we want, and this power is given to us for free, stably.  (Contrast case: ethics is sometimes defended as a set of compensations for how our maps predictably diverge from the territory, e.g. running on untrustworthy hardware, or “respect people, because they’re bigger than your map of them so you should expect they may benefit from e.g. honesty in ways you won’t manage to specifically predict.”)  (Alternate contrast case: it’s hard to build a mind that can do external reference toward e.g. “diamonds”).

(I don't think any of these, except 2, are things that the VNM axioms rely on. The rest seem totally compatible to me. I agree that 2 is interesting, and I've liked Scott Garrabrant's exploration of the stuff)

It may be that some of the good reasons to not be VNM right now, will continue to be such. In that case, there's no point at which you want to be VNM, and in some senses you don't even limit to VNM. (E.g. you might limit to VNM in the sense that, for any local ontology thing, as long as it isn't revised, you tend toward VNMness; but the same mind might fail to limit to VNM in that, on any given day, the stuff it is most concerned/involved with makes it look quite non-VNM.)

If you negate 1 or 3, then you have an additional factor/consideration in what your mind should be shaped like and the conclusion "you better be shaped such that your behavior is interpretable as maximizing some (sensible) utility function or otherwise you are exploitable or miss out on profitable bets" doesn't straightforwardly follow.

I feel like people keep imagining that VNM rationality is some highly specific cognitive architecture. But the real check for VNM rationality is (approximately) just "can you be dutch booked?". 

I think I can care about external incentives on how my mind runs, and not be dutch booked. Therefore, it's not in conflict with VNM rationality. This means there is some kind of high-dimensional utility function which I am behaving to, but it doesn't mean me or anyone else has access to that utility function, or that utility function has a particularly natural basis, or an explicit representation of that utility function is useful for doing the actual computation of how I am making decisions.

I feel like this discussion could do with some disambiguation of what "VNM rationality" means.

VNM assumes consequentialism. If you define consequentialism narrowly, this has specific results in terms of instrumental convergence. 

You can redefine what constitutes a consequence arbitrarily. But, along the lines of what Steven Byrnes points out in his comment, redefining this can get rid of instrumental convergence. In the extreme case you can define a utility function for literally any pattern of behaviour.

When you say you feel like you can't be dutch booked, you are at least implicitly assuming some definition of consequences you can't be dutch booked in terms of. To claim that one is rationally required to adopt any particular definition of consequences in your utility function is basically circular, since you only care about being dutch booked according to it if you actually care about that definition of consequences. It's in this sense that the VNM theorem is trivial.


BTW I am concerned that self-modifying AIs may self-modify towards VNM-0 agents. 

But the reason is not because such self modification is "rational".

It's just that (narrowly defined) consequentialist agents care about preserving and improving their abilities to and proclivities to pursue their consequentialist goals, so tendencies towards VNM-0 will be reinforced in a feedback loop. Likewise for inter-agent competition.

The VNM axioms refer to an "agent" who has "preferences" over lotteries of outcomes.  It seems to me this is challenging to interpret if there isn't a persistent agent, with a persistent mind, who assigns Bayesian subjective probabilities to outcomes (which I'm assuming it has some ability to think about and care about, i.e. my (4)), and who chooses actions based on their preferences between lotteries.  That is, it seems to me the axioms rely on there being a mind that is certain kinds of persistent/unaffected.

Do you (habryka) mean there's a new "utility function" at any given moment, made of "outcomes" that can include parts of how the agent runs its own inside?  Or can you say more about VNM is compatible with the negations of my 1, 3, and 4, or otherwise give me more traction for figuring out where our disagreement is coming from?

I was reasoning mostly from "what're the assumptions required for an agent to base its choices on the anticipated external consequences of those choices."

It seems to me this is challenging to interpret if there isn't a persistent agent, with a persistent mind, who assigns Bayesian subjective probabilities to outcomes

Right but if there isn't a persistent agent with a persistent mind, then we no longer have an entity to which predicates of rationality apply (at least in the sense that the term "rationality" is usually understood in this community). Talking about it in terms of "it's no longer vNM-rational" feels like saying "it's no longer wet" when you change the subject of discussion from physical bodies to abstract mathematical structures.

Or am I misunderstanding you?

I was trying to explain to Habryka why I thought (1), (3) and (4) are parts of the assumptions under which the VNM utility theorem is derived.

I think all of (1), (2), (3) and (4) are part of the context I've usually pictured in understanding VNM as having real-world application, at least.  And they're part of this context because I've been wanting to think of a mind as having persistence, and persistent preferences, and persistent (though rationally updated) beliefs about what lotteries of outcomes can be chosen via particular physical actions, and stuff.  (E.g., in Scott's example about the couple, one could say "they don't really violate independence; they just care also about process-fairness" or something, but, ... it seems more natural to attach words to real-world scenarios in such a way as to say the couple does violate independence.  And when I try to reason this way, I end up thinking that all of (1)-(4) are part of the most natural way to try to get the VNM utility theorem to apply to the world with sensible, non-Grue-like word-to-stuff mappings.)

I'm not sure why Habryka disagrees.  I feel like lots of us are talking past each other in this subthread, and am not sure how to do better.

I don't think I follow your (Mateusz's) remark yet.

As far as I can tell, “the standard Dutch book arguments” aren’t even a reason why one’s preferences must conform to all the VNM axioms, much less a “pretty good” reason.

(We’ve had this discussion many times before, and it frustrates me that people seem to forget about this every time.)

I too get frustrated that people seem to forget about the arguments against your positions every time and keep bringing up positions long addressed :P

We can dig into it again, might be worth carving out some explicit time for it if you want to. Otherwise it seems fine for you to register your disagreement. My guess is it could be good to link to some of the past conversations if you have a link handy for the benefit of other readers (I don't right now).

I think there will be multiple stable-on-reflection attractors in the space of possible minds. Within humans, these come out as religions or philosophical worldviews. The reason for this is that any mind must make some fundamental a-priori assumptions about its architecture that can't really be inferred from experience (or are implicated in how you would learn from experience). Your inductive prior is an obvious one, but also like how and where you trust your memory or perceptions, what kind of correction you accept from the environment, your instincts on how you approach problems, what's going to be valuable in the long run (ie "utility function"), what meta-physical assumptions, etc. VNM is a sketch of one maybe-possible architecture. Is it the best one? All you can do is say that it follows from certain assumptions. Are those assumptions the best assumptions? You have no way of knowing. It's a leap of faith. I bet there are others.

The only way of really truly judging between such architectures is to play them out in actual history in the actual world and see which one "wins" in terms of continued existence. There may be no definitive winner, but multiple niches with each in the blind spots or weak points of the others (there are always blind spots). Even half way through the contest, where one has amassed more raw resources or something, should the other switch strategies? Not if it has conviction that its own strategy is amassing the more important resources. Who is to judge? So I think these assumption-attractors are a lot more like pre-rational genetic code which can only be judged by reality, not themselves subject to a-priori or even empirical discoverability and optimization. There may even be an evolution analogue with carefully prepared supra-rational leaps of faith replacing random mutation and this existential struggle for life remaining as the selector.

Furthermore, even as one architecture achieves dominance, more and more resources will be available to any pattern of thought or agency that can exploit its blind spots and weak points, so I doubt there will even be a stable "winner" so to speak unless there's a literally perfect architecture with no blind spots (for various reasons I doubt this too). So in the long run I expect a competitive ecosystem of many mind architectures pursuing strategies that may look irrational or incomprehensible to each other but nonetheless continue to evolve and thrive.

This kind of stuff is why I no longer believe in the foom paperclipper singleton cosmic stagnation default future. It looks to me a lot more like acceleration but continuation of the fundamental patterns of life and even consciousness and philosophy. I appreciate that this comment is somewhat tangential to OP but based on our private conversations I think it gets at what you're ultimately getting at.

Thanks for the post. I think there are at least two ways that we could try to look for different rational-mind-patterns, in the way this post suggests. The first is to keep the underlying mathematical framework of options the same (in VNM, the set of gambles, outcomes, etc.), and looks at different patterns of preference/behaviour/valuing/etc. The second is to change the background mathematical framework in order to have some more realistic, or at the very least different idealizing assumptions. Within the different framework we can then explore also different preference structures/behaviour norms, etc.  Here I will focus more on the second approach.

In particular, I want to point folks towards a decision theory framework that I think has a lot of virtues (no doubt many readers on LW will already be familiar with it). The Jeffrey-Bolker framework provides a concrete example of the kind of alternative "mathematically describable mind-patterns" that the post and clarifying comment talks about. Like the VNM framework, it proves that rational preferences can be represented as expected utility maximization (although things are subtle, as we conditional on acts as opposed to treating them like exogenous probability distributions/functions/random objects, so the mathematics is a bit different). But it does so with very different assumptions about agency, and the background conceptual space in which preference and agency operate. 

I have a write-up here of some key differences between Jeffrey-Bolker and Savage (which is ~kind of~ like a subjective probability version of VNM) that I find exciting for an embedded agency point of view. Here are two quick examples. First, VNM requires a very rich domain of preference – typically preference is defined over all probability distributions over conseqeunces. Savage similarly requires that an agent have preferences defined over all functions from states to consequences.  This forces agents to rank logically impossible or causally incoherent scenarios. Jeffrey-Bolker instead only requires preferences over propositions closed under logical operations, allowing agents to only evaluate scenarios they consider possible. Second, Savage style approaches require act-state independence - agents can't think their actions influence the world's state. Jeffrey-Bolker drops this, letting agents model themselves as part of the world they're reasoning about. Both differences stem from Jeffrey-Bolker's core conceptual/formal innovation: treating acts, states, and consequences as the same type of object in a unified algebra, rather than fundamentally different things.

Considering the Jeffrey-Bolker framework is valuable in two ways: first, as an alternative 'stable attractor' for minds that avoids VNM's peculiarities, and second, as a framework within which we can precisely express different decision theories like EDT and CDT. This highlights how progress on alternative models of agency requires both exploring different background frameworks for modeling rational choice AND exploring different decision rules within those frameworks. Rather than assuming minds must converge to VNM-style agency as they get smarter, we should actively investigate what shape the background decision context takes for real agents.

For more detail in general and on my take in particular, I recommend: 

My writeup with Gerard about Jeffrey-Bolker (linked above as well).
The first chapter of my thesis, which gives my approach to embedded agency and my take of why Jeffrey-Bolker in particular is a very attractive decision theory.
This great writeup of different decision theory frameworks by Fishburn that gives a sense of how many different alternatives to VNM there are. Ends with a brief description of Jeffrey-Bolker, but more more detail about earlier decision frameworks. 

 

It seems to me that the "continuity/Archimedean" property is the least intuitively necessary of the four axioms of the VNM utility theorem. One way of specifying preferences over lotteries that still obeys the other three axioms is assigning to each possible world two real numbers  and  instead of one, where  is a "top priority" and  is a "secondary priority". If two lotteries have different , the one with greater  is ranked higher, and  is used as a tie-breaker. One possible real-world example (with integer-valued  for deterministic outcomes) would be a parent whose top priority is minimizing the number of their children who die within the parent's lifetime, with the rest of their utility function being secondary.

I'd be interested in whether there exist any preferences over lotteries quantifying our intuitive understanding of risk aversion while still obeying the other three axioms of the VNM theorem. I spent about an hour trying to construct an example without success, and suspect it might be impossible.