Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We would like a mathematical theory which characterises the intuitive notion of ‘optimisation’.

Before Shannon introduced his mathematical theory of communication, the concept of ‘information’ was vague and informal. Then Shannon devised several standardised measures of it, with useful properties and clear operational interpretations. It turned out the concept of information is universal, in that it can be quantified on a consistent scale (bits) across various contexts.

Could something similar be true for ‘optimisation’?

In this post we review a few proposed ways to measure optimisation power and play around with them a bit.

Our general setup will be that we’re choosing actions from a set to achieve an outcome in a set . We have some beliefs about how our actions will affect outcomes in the form of a probability distribution for each . We also have some preferences, either in the form of a preference ordering over outcomes, or a utility function over outcomes. We will also assume there is some default distribution , which we can interpret either as our beliefs about if we don’t act, or if we take some default action[1].

Yudkowsky

Yudkowsky’s proposed definition just makes use of a preference ordering over . To measure the optimisation power of some outcome , we count up all the outcomes which are at least as good as , and divide that number by the total number of possible outcomes. It’s nice to take a negative log to turn this fraction into a number of bits: If I achieve the second best outcome out of eight, that's bits of optimisation power.

If the outcome space is infinite, then we can't count the number of outcomes at least as good as the one we got, so we need a measure to integrate over. If we make use of our default probability distribution here, the resulting quantity has a nice interpretation. is just : the default probability of doing as well as we did. Since we're always assuming we've got a default distribution, we might as well define OP like this even in the finite-domain case. Again we’ll take a log to get Now 2 bits of optimisation power means the default probability of doing this well was .

So far we’ve just been thinking about the optimisation power of achieving a specific outcome. We can define the optimisation power of an action as the expected optimisation power under the distribution it induces over outcomes: The above definitions just make use of a preference ordering. If we do have a utility function then we’d like our definitions to make use of that too. Intuitively, achieving the second best outcome out of three should constitute more optimisation power in a case where it’s almost as good as the first and much better than the third, compared to a case where it’s only slightly better than the third and much less good than the first[2].

Analogously to how we previously asked ‘what fraction of the default probability mass is on outcomes at least as good as this one?’ we could try to ask ‘what fraction of the default expected utility comes from outcomes at least as good as this one?’.

But making use of utility functions in the above definition is tricky. Recall that utility functions originate from the Von Neumann-Morgenstern theorem, which says that if an agent choosing between probabilistic mixtures of options satisfies some weak rationality criteria then it acts as if it maximises expected utility according to a utility function . The utility function produced by the VNM-theorem is only defined up to positive affine transformations, meaning that the utility function , for any and , equally well represents the tradeoffs between outcomes the agent is willing to take. Another way of looking at this is that only ratios of differences in utility are real. Therefore any definition of OP which changes when you multiply your utilities by a positive scalar or add a constant is very questionable. Alex Altair ran into this problem when trying to define OP in terms of the rate of change of utility.

One way we considered to get around this is to measure the utility of every outcome relative to the worst one. Let and define

In words, this asks 'what fraction of the default extra expected utility on top of the minimum comes from outcomes at least this good?'.

This is invariant to translating and rescaling utility. Unfortunately it has a problem of its own - it’s sensitive to our choice of . By adding some made up element to with large negative utility and zero probability of occurring, we can make OP arbitrarily low. In that case basically all of the default relative expected utility comes from avoiding the worst outcome, which is guaranteed, so you don’t get any credit for optimising.

Wentworth

An alternative approach to measuring optimisation in bits comes from John’s Utility Maximisation = Description Length Minimisation. In the language of our setup, John shows that you can take any utility function [3], and map it to a probability distribution , such that maximising expected utility with respect to is equivalent to minimising cross-entropy with respect to .

At first we weren’t sure of the precise sense of equivalence John meant, but one way to state it is that the interval scale over actions which you obtain by considering is precisely the same as the interval scale you get by considering . This will become very explicit with a bit of rejigging.

John defines by taking a softmax: . But we’ll keep working in base 2 and write instead[4].

So we get that

The term on the right doesn't depend on or , so what this means is that for some constant . In other words, this switch from maximising expected utility to minimising cross-entropy is just like translating your utility function by a constant and carrying on as you were.

Since utility functions are only defined up to translation anyway, this means the sense of equivalence is as strong as can be, but it does dispel the impression that this is a useful way to measure optimisation power in bits. The number of extra 'bits' of optimisation power that some action has compared to is precisely the number of extra expected utilons - and that means that scaling your utility function scales your optimisation power by the same amount.

Takeaways

We would like a measure of the amount of optimisation taking place which makes use of utility functions, which is invariant under postive affine transformations of those utility functions, and which isn't too sensitive to strange things like the utility of the worst outcome. The search continues.


  1. Or even before we act. ↩︎

  2. See this Stuart Armstrong post for related discussion. ↩︎

  3. Where is finite. ↩︎

  4. As an aside: notice that multiplying the utility function by a positive constant will change the resulting softmax distributions to be more sharply peaked at the maxima. ↩︎

New Comment
38 comments, sorted by Click to highlight new comments since: Today at 6:58 AM

You might be interested in some of my open drafts about optimization;

One distinction that I pretty strongly hold as carving nature at its joint is (what I call) optimization vs agents. Optimization has no concept of a utility function, and it just about the state going up an ordering. Agents are the thing that has a utility function, which they need for picking actions with probabilistic outcomes.

Aha - mm that's quite interesting. As gears says, I'd be curious what to you are the defining parts of agents that imply that generic optimisation processes like natural selection and gradient descent aren't agent while humans and animals are.

Is it about action / counterfactuality?

EDIT: If I take your perspective seriously that optimisation only talks about preference orderings not utility functions then maybe the supposed deficiency of the Yudkowsky definition is not so big.

 

We could then define analogs of the entropy as

and cross-entropy as 

which is the same as  defined above. 

We can then also consider a relative entropy / optimisation measure

As well as the reverse where we flip 

 

Given two variables  on , where  splits as  with the natural induced preference ordering we have the  -relevant mutual information

mmmm this is quite nice actually. 

I'd still like to understand better why utility functions are intrinsically about agents while preference ordering are about optimisation. This isn't totally apparent to me. 

My best guess about the core difference between optimization and agency is the thing I said above about, "a utility function, which they need for picking actions with probabilistic outcomes".

An agent wants to move the state up an ordering (its optimization criterion). But an agent also has enough modelling ability to know that any given action has (some approximation of) a probability distribution over outcomes. (Maybe this is what you mean by "counterfactuality".) Let's say you've got a toy model where your ordering over states is A < B < C < D < E and you're starting out in state C. The only way to decide between [a 30% chance of B + a 70% chance of D] and [a 40% chance of A + a 60% change of E] is to decide on some numerical measure for how much better E is than D, et cetera.

Gradient descent doesn't have to do this at all. It just looks at the gradient and is like, number go down? Great, we go in the down direction. Similarly, natural selection isn't doing this either. It's just generating a bunch of random mutations and then some of them die.

(I'm not totally confident that one couldn't somehow show some way in which these scenarios can be mathematically described as calculating an expected utility. But I haven't needed to pull in these ideas for deconfusing myself about optimization.)

Uhm two comments/questions on this.

  1. Why do you need to decide between those probability distributions? You only need to get one action (or distribution thereof) out. You can do it without deciding, eg by taking their average and sampling. On the other hand vNM tells us utility is being assigned if your choice satisfies some conditions, but vNM = agency is a complicated position to hold.

  2. We know that at some level every physical system is doing gradient descent or a variational version thereof. So depending on the scale you model a system, you would assign different degrees of agency?

By the way gradient descent is a form of local utility minimization, and by tweaking the meaning of 'local' one can get many other things (evolution, Bayesian inference, RL, 'games', etc).

Isn't gradient descent agentic over the parameters being optimized, according to the "moved by reasons" definition in discovering agents?

I mean, that makes sense according to their definition, I think I'm just defining the word differently. Personally I think defining "agent" such that gradient descent is an agent seems pretty off from the colloquial use of the word agent.

I would be interested to see a sketch of how you mathematize "agent" such that gradient descent could be said to not have a utility function. Best as I can tell, "having a utility function" is a noninteresting property that everything has - a sort of panagentism implied by trivial utility functions. Though nontriviality of utility functions might be able to define what you're talking about, and I can imagine there being some nontriviality definitions that do exclude gradient descent over boxed parameters, eg that there's no time  where the utility function becomes indifferent. Any utility function that only cares about the weights becomes indifferent in finite time, I think? so this should exclude the "just sit here being a table" utility function. Although, perhaps this is insufficiently defined because I haven't specified what physical mechanism to extract as the preference ordering in some cases in which case there could totally be agents. I'd be curious how you try to define this sort of thing, anyway.

(that is to say that for utility function such that there are no worldlines  and  that diverge at  such that ; call that constraint 1, "never being indifferent between timelines". though that version of the constraint might demand that the utility function never be indifferent to anything at all, so perhaps a weaker constraint might be that there be no time where the utility function is indifferent to all possible worldlines; ie, if constraint 1 is no worldlines that diverge and yet get the same order position, constraint 2 is at all times  there are at least one unique  and  that diverge at  such that . diverge being defined as  for early times  and  for some .)

Nice, I'd read the first but didn't realise there were more. I'll digest later.

I think agents vs optimisation is definitely reality-carving, but not sure I see the point about utility functions and preference orderings. I assume the idea is that an optimisation process just moves the world towards states, but an agent tries to move the world towards certain states i.e. chooses actions based on how much they move the world towards certain states, so it make sense to quantify how much of a weighting each state gets in its decision-making. But it's not obvious to me that there's not a meaningful way to assign weightings to states for an optimisation process too - for example if a ball rolling down a hill gets stuck in the large hole twice as often as it gets stuck in the medium hole and ten times as often as the small hole, maybe it makes sense to quantify this with something like a utility function. Although defining a utility function based on the typical behaviour of the system and then trying to measure its optimisation power against it gets a bit circular.

Anyway, the dynamical systems approach seems good. Have you stopped working on it?

Mostly it's that I've found that, while trying to understand optimization, I've never needed to put "weights" on the ordering. (Of course, you always could map your ordering onto a monotonically increasing function.)

I think the concept of "trying" mostly dissolves under the kind of scrutiny I'm trying to apply. Or rather, to well-define "trying", you need a whole bunch of additional machinery that just makes it a different thing than (my concept of) optimization, and that's not what I'm studying yet.

I've also been working entirely in deterministic settings, so there's no sense of "how often" a thing happens, just a single trajectory. (This also differentiates my thing from Flint's.)

I haven't stopped working on the overall project. I do seem to have stopped writing and editing that particular sequence, though. I'm considering totally changing the way I present the concept (such that the current Intro post would be more like a middle-post) so I decided to just pull the trigger on publishing the current state of it. I'm also trying to get more actual formal results, which is more about stuff from the end of that sequence. But I'm pretty behind on formal training, so I'm also trying to generally catch up on math.

Another component of these which I'm usually skeptical of: the existence of a prior over actions/outcomes, and measuring optimization power via how narrow a target has been hit. This seems weird to me, because it suggests a sense in which having more knowledge about the inner workings of an agent should make its outputs objectively less optimized towards a goal, or that writing an optimization algorithm yourself makes it no longer an optimization algorithm because you know where its going to land. I understand that various people have various fixes, but they all feel pretty ad-hoc to me.

I think you might be missing the intended meaning of Yudkowsky's measure. It's intended to be a direct application of Bayes. That means it's necessarily in the context of a prior. His measure of optimization is; under the belief that there is no optimization process present, how surprised would you be if the state ended up how far up the state ordering? And if there is an optimization process, then it will end up surprisingly far up relative to that. The stronger it is, the more surprising. We're not saying you do believe there is no optimizer. But then if you condition on their being an optimizer, and ask "how surprising is it that the optimizer does optimization?" then of course the surprise disappears. It's not that having more knowledge makes it objectively less of an optimizer, it's that it makes it subjectively less surprising.

But then if you know everything, then nothing ever will be an optimizer!

Like, I'll also be less surprised, but the definition using priors seems non-fundamental in the same way that a definition using priors of atoms would be non-fundamental. I imagine the following dialogue

Me: What is an atom?

You: You can tell whether or not an atom is happening by ranking the worlds based on how much the hypothesis 'an atom is happening' predicts those worlds.

Me: Ok, something feels off about that definition. What about, how can I tell which parts of the world are atoms?

You: You can tell which parts of the world are atoms by computing how well the hypothesis 'an atom is happening' predicts the world after you replace various sections of the world with random noise. The smallest section of the world which when randomized reduces the probability of the hypothesis 'an atom is happening' to zero is what we call an atom.

Me: That still seems weird, and I don't actually know if you're going to be able to develop an atomic theory based off that definition.

Okay, that's an interesting comparison. Maybe this will help; Yudkowsky's measure of optimization is a measure, like of how much it's happening, rather than the definition. Then the definition is "when a system's state moves up an ordering". Analogously, objects have length, and you can tell "how much of an object" there is by how long it is. And if there's no object, then it will have zero length. But that doesn't make the definition of "object" be "a thing that has length". Does that make sense?

[-]aysja10mo62

Yudkowsky’s measure still feels weird to me in ways that don’t seem to apply to length, in the sense that length feels much more to me like a measure of territory-shaped things, and Yudkowsky’s measure of optimization power seems much more map-shaped (which I think Garrett did a good job of explicating). Here’s how I would phrase it:

Yudkowsky wants to measure optimization power relative to a utility function: take the rank of the state you’re in, take the total number of all states that have equal or greater rank, and then divide that by the total number of possible states. There are two weird things about this measure, in my opinion. The first is that it’s behaviorist (what I think Garrett was getting at about distinguishing between atom and non-atom worlds). The second is that it seems like a tricky problem to coherently talk about “all possible states.” 

So, like, let’s say that we have two buttons next to each other. Press one button and get the world that maxes out your utility function. Press the other and, I don’t know, you get a taco. According to Yudkowsky’s measure, pressing one of these buttons is evidence of vastly more optimization power than the other even though, intuitively, these seem about “equally hard” from the agents perspective. 

This is what I mean about it being “behaviorist”—with this measure you only care about which world state attains (and how well that state ranks), but not how you got to that state. It seems clear to me that both of these are relevant in measuring optimization power. Like, conditioned on certain environments some things become vastly easier or harder. Getting a taco is easy in Berkeley, getting a taco is hard in a desert. And if your valuation of taco utility doesn’t change, then your optimization power can end up being largely a function of your environment, and that feels… a bit weird? 

On the flip side, it’s also weird that it can vary so much based on the utility function. If someone is maximally happy watching TV at home all of the time, I feel hesitant to say that they have a ton of optimization power? 

The thing that feels lacking in both of these cases, to me, is the ability to talk about how hard these goals are to achieve in reality (as a function of agent and environment). Because the difficulty of achieving the same world state can vary dramatically based on the environment and the agent. Grabbing a water bottle is trivial if there is one next to me, grabbing one if I have to construct it out of thermodynamic equilibrium is vastly harder. And importantly, the difference here isn’t in my utility function, but in how the environment shapes the difficulty of my goals, and in my ability as an agent to do these different things. I would like to say that the former uses less optimization power than the latter, and that this is in part a function of the territory.

You can perhaps rescue this by using a non-uniform prior over “all possible states,” and talk about how many bits it takes to move from that distribution to the distribution we want. So like, when I’m in the desert, the state “have a taco” is less likely than when I’m in Berkeley, therefore it takes more optimization power to get there. But then we run into some other problems.

The first is what Garrett points out, that probabilities are map things, and it’s a bit… weird for our measure of a (presumably) territory thing to be dependent on them. It’s the same sort of trickiness that I don’t feel we’ve properly sorted out in thermodynamics—namely, that if we take the existence of macrostates to be reflections of our uncertainty (as Jaynes does), then it seems we are stuck saying something to the effect of “ice cubes melt because we become more uncertain of their state,” which seems… wrong. 

The second is that I claim that figuring out the “default” distribution is the entire problem, basically. Like, how do I know that a taco appearing in the desert is less likely than it is in Berkeley? How do I know what grabbing a bottle is more likely when there is a bottle rather than an equilibrium soup? Constructing the “correct” distribution, to the extent that makes sense, over the default outcomes seems to me to be the entire problem of figuring out what makes some tasks easier or harder, which is close to what we were trying to measure in the first place. 

I do expect there is a way to talk about the correct default distribution, but that it’s tricky, and that part of why it’s so tricky is because it’s a function of both map and territory shaped things. In any case, I don’t think you get a sensible measure of optimization or other agency-terms if you can’t talk about them as things-in-the-territory (which neither of these measures really do); I’d really like to be able to. I also agree that an explanation (or measure) of atoms as Garrett laid out is unsatisfying; I feel unsatisfied here too, for similar reasons. 

Small note: Yudkowsky definition is about a preference order not a utility function. Indeed, this was half the reason we did the project in the first place !

The first is what Garrett points out, that probabilities are map things, and it’s a bit… weird for our measure of a (presumably) territory thing to be dependent on them. It’s the same sort of trickiness that I don’t feel we’ve properly sorted out in thermodynamics—namely, that if we take the existence of macrostates to be reflections of our uncertainty (as Jaynes does), then it seems we are stuck saying something to the effect of “ice cubes melt because we become more uncertain of their state,” which seems… wrong.

For this part, my answer is Kolmogorov complexity. An ice cube has lower K-complexity than the same amount of liquid water, which is a fact about the territory and not our maps. (And if a state has lower K-complexity, it's more knowable; you can observe fewer bits, and predict more of the state.)

One of my ongoing threads is trying to extend this to optimization. I think a system is being objectively optimized if the state's K-complexity is being reduced. But I'm still working through the math.

Yeah... so these are reasonable thoughts of the kind that I thought through a bunch when working on this project, and I do think they're resolvable, but to do so I'd basically be writing out my optimization sequence.

I agree with Alexander below though, a key part of optimization is that it is not about utility functions, it is only about a preference ordering. Utility functions are about choosing between lotteries, which is a thing that agents do, whereas optimization is just about going up an ordering. Optimization is a thing that a whole system does, which is why there's no agent/environment distinction. Sometimes, only a part of the system is responsible for the optimization, and in that case you can start to talk about separating them, and then you can ask questions about what that part would do if it were placed in other environments.

[+][comment deleted]10mo10

Hm, I'm not sure this problem comes up.

Say I've built a room-tidying robot, and I want to measure its optimisation power. The room can be in two states: tidy or untidy. A natural choice of default distribution is my beliefs about how tidy the room will be if I don't put the robot in it. Let's assume I'm pretty knowledgeable and I'm extremely confident that in that case the room will be untidy: and (we do have to avoid probabilities of 0, but that's standard in a Bayesian context). But really I do put the robot in and it gets the room tidy, for an optimisation power of bits.

That 11 bits doesn't come from any uncertainty on my part about the optimisation process, although it does depend on my uncertainty about what would happen in the counterfactual world where I don't put the robot in the room. But becoming more confident that the room would be untidy in that world makes me see the robot as more of an optimiser.

Unlike in information theory, these bits aren't measuring a resolution of uncertainty, but a difference between the world and a counterfactual.

[-]aysja10mo30

I don't see the difference between "resolution of uncertainty" and "difference between the world and a counterfactual." To my mind, resolution of uncertainty is reducing the space of counterfactuals, e.g., if I'm not sure whether you'll say yes or no, then you saying "yes" reduces my uncertainty by one bit, because there were two counterfactuals. 

I think what Garrett is gesturing at here is more like "There is just one way the world goes, the robot cleans the room or it doesn't. If I had all the information about the world, I would see the robot does clean the room, i.e., I would have no uncertainty about this, and therefore there is no relevant counterfactual. It's not as if the robot could have not cleaned the room, I know it doesn't. In other words, as I gain information about the world, the distance between counterfactual worlds and actual worlds grows smaller, and then so does... the optimization power? That's weird." 

Like, we want to talk about optimization power here as "moving the world more into your preference ordering, relative to some baseline" but the baseline is made out of counterfactuals, and those live in the mind. So we end up saying something in the vicinity of optimization power being a function of maps, which seems weird to me.  

The above formulas rely on comparing the actual world to a fixed counterfactual baseline. Gaining more information about the actual world might make the distance between the counterfactual baseline and the actual world grow smaller, but it also might make it grow bigger, so it's not the case that the optimisation power goes to zero as my uncertainty about the world decreases. You can play with the formulas and see.

But maybe your objection is not so much that the formulas actually spit out zero, but that if I become very confident about what the world is like, it stops being coherent to imagine it being different? This would be a general argument against using counterfactuals to define anything. I'm not convinced of it, but if you like you can purge all talk of imagining the world being different, and just say that measuring optimisation power requires a controlled experiment: set up the messy room, record what happens when you put the robot in it, set the room up the same, and record what happens with no robot.

Another way of looking at this is that only ratios of differences in utility are real.

I suspect real-world systems replace argmax with something like softmax, and in that case the absolute scale of the utility function becomes meaningful too (representing the scale at which it even bothers optimizing).

Yes might very well be. But how do we find the scale parameter? Where does it come from?

which is invariant under postive affine transformations of those utility functions

I expect this is not possible. If you measure optimization relative to a utility function, then having a universal scale would mean you could compare two agents with different utilities and say which one optimizes better its utility in an absolute sense. Take two identical retargetable agents, give agent 1 an easy task, agent 2 an impossibly difficult task. How would your ideal measure of optimization compare them? Should it say they are as good as each other because they are, apart from the utility functions, identical? Or should it say agent 1 is better because it reaches its goal?

I think that, if you want to use utilities, you have to restrict yourself to comparisons between agents with the same utility function.

Probably the easy utility function makes agent 1 have more optimisation power. I agree this means comparisons between different utility functions can be unfair, but not sure why that rules out a measure which is invariant under positive affine transformations of a particular utility function?

why that rules out a measure which is invariant under positive affine transformations of a particular utility function?

Ah, ok. I assumed you wanted the invariance to make cross-utility comparisons. From your question I infer you want it instead for within-utility comparisons.

The reason I assumed the former is that I don't consider invariance useful for within-utility comparisons. Like utility is a scale that compares preferences, you want a scale on which to rate retargetable agents optimizing the same utility. The particular units should not matter.

In the case of utility, you can renormalize it in if it's bounded from above and below. An important example where this is not possible is the log scoring rule for predictions, i.e., an agent that outputs probabilities for unobserved events with utility will output its actual probability assignments. You can define "actual probability assignments" operatively as the probabilities the agent would use to choose the optimal action for any utility function, assuming the agent is retargetable. Does this suggest something about the feasibility of standardizing any optimization score? Maybe it's not possible in general too? Maybe it's possible under useful restrictions to the utility? For example, of the little I know about Infra-Bayesianism, there's that they use utilities in .

Your last paragraph about retargetability sounds quite interesting. Do you have a reference for this story?

No, I'm making thoughts up as I argue.

I'm am not sure if with "paragraph about retargetability" you are attaching a label to the paragraph or expressing specific care about "retargetability". I'll assume the latter.

I used the term "retargetable agents" to mean "agents with a defined utility-swap-operation", because in general an agent may not be defined in a way that makes it clear what does it mean to "change its utility". So, whenever I invoke comparisons of different utilities on the "same" agent, I want a way to mark this important requirement. I think the term "retargetable agent" is a good choice, I found it in TurnTrout's sequence, and I think I'm not misusing it even though I use it to mean something a bit different.

Even without cross-utility comparisons, when talking above about different agents with the same utility function, I preferred to say "retargetable agents", because: what does it mean to say that an agent has a certain utility function, if the agent is not also a perfect Bayesian inductor? If I'm talking about measures of optimization, I probably want to compare "dumb" agents with "smart" agents, and not only in the sense of having "dumb" or "smart" priors. So when I contemplate the dumb agent failing in getting more utility in a way that I can devise but it can't, shouldn't I consider it not a utility maximizer? If I want to say that an algorithm is maximizing utility, but stopping short of perfection, at some point I need to be specific about what kind of algorithm I'm talking about. It seems to me that a convenient agnostic thing I could do is considering a class of algorithms which have a "slot" for the utility function.

Related: when Yudkowsky talks about utility maximizers, he doesn't just say "the superintelligence is a utility maximizer", he says "the superintelligence is efficient relative to you, fact from which you can make some inferences and not others, etc."

I feel like the "obvious" thing to do is to ask how rare (in bits) the post-opitimization EV is according to the pre-optimization distribution. Like, suppose that pre-optimization my probability distribution over utilities I'd get is normally distributed, and after optimizing my EV is +1 standard deviation. Probability of doing that well or better is 0.158, which in bits is 2.65 bits.

Seems indifferent to affine transformation of the utility function, adding irrelevant states, splitting/merging states, etc. What are some bad things about this method?

This is the same as Eliezer's definition, no? It only keeps information about the order induced by the utility function.

Oh, whoops.

Unfortunately it has a problem of its own - it’s sensitive to our choice of . By adding some made up element to  with large negative utility and zero probability of occurring, we can make OP arbitrarily low. In that case basically all of the default relative expected utility comes from avoiding the worst outcome, which is guaranteed, so you don’t get any credit for optimising.

 

What if we measure the utility of an outcome relative not to the worst one but to the status quo, i.e., the outcome that would happen if we did nothing/took null action?

In that case, adding or subtracting outcomes to/from   doesn't change  for outcomes that were already in , as long as the default outcome also remains in .

Obviously, this means that  for any  depends on the choice of default outcome. But I think it's OK? If I have $1000 and increase my wealth to $1,000,000, then I think I "deserve" being assigned more optimization power than if I had $1,000,000 and did nothing, even if the absolute utility I get from having $1,000,000 is the same.

We’re already comparing to the default outcome in that we’re asking “what fraction of the default expected utility minus the worst comes from outcomes at least this good?”.

I think you’re proposing to replace “the worst” with “the default”, in which case we end up dividing by zero.

We could pick some other new reference point other than the worst, but different to the default expected utility. (But that does introduce the possibility of negative OP and still have sensitivity issues).

It seems to me that an optimization measure should take into account "what would have happened anyway". Like, suppose your list of outcomes is:

A) the laws of physics continue unbroken

B) the laws of physics are broken in a first way

C) the laws of physics are broken in a second way

...

add a zillion ways to break the laws of physics

Obviously anything ordering the list of outcomes in this way is super good at optimization! (/s)

Therefore, I think that any measure of optimization should be relative to some sort of counterfactual prior on the outcomes - which might be, e.g. what would happen if you had no optimizer, or based on some predefined distribution over what could be in place of the optimizer in question. I think you should carefully keep track of what counterfactual prior you are measuring relative to.

My intuition here is apparently the opposite of Garrett Baker's. But, I have two things to say on this:

  • depending on the situation, you might be able to have some kind of "objective" counterfactual prior - like maybe you know that all outcomes are equally possible and can use the uniform prior.
  • I really think you can't do without this, and if you try doing without this, you're just basically going to end up with something equivalent to some choice of counterfactual prior, while pretending it is "objective" - like Bayes vs. frequentism.

If you also want the definition to be based on utility rather than on ordinal rankings, I don't think it makes sense to measure it in bits. If you want it to take into account utility, you're going to want some kind of smooth relation to the utility numbers which is in general going to be incompatible with the bits idea because your distributions of outcomes in utility could have any sort of weird shape.

So, I suggest, if you do want a utility-based definition, to just throw out the bits idea entirely and just use utility and a counterfactual prior and make it linear in utility.


A couple of suggestions for measuring optimization relative to a utility function and a counterfactual prior:

  1. The number of standard deviations of improvement in utility obtained relative to the counterfactual prior. This might be a more useful measure for "weak" optimizers where the whole distribution is relevant. Requires that the standard deviation of u under the counterfactual prior is well defined.
  2. If u is bounded above, you can have an optimization measure linear in utility that maxes at 1 if you obtain maximum utility (and optimization is 0 for the counterfactual prior of course). This might be a more relevant measure for "strong" optimizers that we expect to get close to maximum utility, but requires that u is bounded above, and that the counterfactual prior obtains a well-defined expected utility (i.e. not minus infinity).

What about optimisation power of  as a measure of outcome that have utility greater than the utility of 

Let  be the set of outcome with utility greater than  according to utility function :

The set    is invariant under translation and non-zero rescaling of the utility function  and we define the optimisation power of the outcome ' according to utility function  as:

This does not suffer from comparing w.r.t a worst case and seem to satisfies the same intuition as the original OP definition while referring to some utility function.

This is in fact the same measure as the original optimisation power measure with an order given by the utility function

Yeah this is the expectation of the Yudkowsky measure I think?

Right, I got confused because I thought your problem was about trying to define a measure of optimisation power - for ex analogous to the Yudkowsky measure - that was also referring to a utility function, while being invariant from scaling and translation but this is different from asking

"what fraction of the default expected utility comes from outcomes at least as good as this one?’"