I was actually going to write something like this up, but you beat me to it! My idea was pretty similar. The main difference is that in my setting, a utility function only cares about outcomes (rather than intermediate events such as ). Here's most of what I had written so far (feel free to just skim it, given that it's mostly the same thing):
I am interested in the question of when we can substitute changes in beliefs for changes in utility function and vice versa.
Here is a simple setting of utility maximization. There are possible worlds, 2 actions, and outcomes. An agent has some prior over possible worlds, and some utility function over outcomes. The agent can be exposed to different independent experiments. In an experiment, the agent does not know the possible world , but indirectly observes and thereby sees an observation with probability . Then the agent chooses action . Subsequently, some outcome will occur (stochastically) based on the possible world and action, and the agent values this outcome according to .
Let be the "transition difference matrix" defined as It is not unreasonable to believe that can be determined a priori: if possible worlds are (say) probabilistic Turing machines accepting an action as input and returning an outcome as output, then indeed is known a priori.
Let stand for the elementwise product between vectors and , and let stand for the elementwise quotient. Also, define the likelihood to be a vector with . The agent's posterior distribution over the possible world after seeing evidence will then be proportional to , using Bayes' rule. We can compute the expected utility difference between actions in an experiment as
The agent's behavior across experiments is entirely determined by the function . Let us rewrite this as: So in fact, the behavior is determined by the vector . This vector has entries, and entry is equal to the prior probability of possible world times the expected utility difference (between actions 1 and 0) if we are in possible world . We could call this vector the policy of .
Now we can ask two interesting questions:
For question 1:
We can find satisfying this if and only if assigns probability 0 to any outcome that does and is right-invertible. That is, must not have a smaller support than , and we must be able to span with the by-action differences in outcome distributions for each possible world. I expect that will be right-invertible in most practical cases (there will be more possible worlds than outcomes).
For question 2: this will not be true in general. If for some possible world , has a different sign from (that is, and recommend different actions in possible world ), then the policies must be different, unless . So we are not able to replace changes in utility function with changes in beliefs, in general. There is a way to do this by making some possible worlds observationally equivalent, but I haven't worked through all the details.
A putative new idea for AI control; index here.
This post is a synthesis of some of the ideas from utility indifference and false miracles, in an easier-to-follow format that illustrates better what's going on.
Utility scaling
Suppose you have an AI with a utility u and a probability estimate P. There is a certain event X which the AI cannot affect. You with to change the AI's estimate of the probability of X, by, say, doubling the odds ratio P(X):P(¬X). However, since it is dangerous to give an AI false beliefs (they may not be stable, for one), you instead want to make the AI behave as if it were a u-maximiser with doubled odds ratio.
Assume that the AI is currently deciding between two actions, α and ω. The expected utility of action α decomposes as:
u(α) = P(X)u(α|X) + P(¬X)u(α|¬X).
The utility of action ω is defined similarly, and the expected gain (or loss) of utility by choosing α over ω is:
u(α)-u(ω) = P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)).
If we were to double the odds ratio, the expected utility gain becomes:
u(α)-u(ω) = (2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)))/Ω, (1)
for some normalisation constant Ω = 2P(X)+P(¬X), independent of α and ω.
We can reproduce exactly the same effect by instead replacing u with u', such that
u'( |X)=2u( |X)
u'( |¬X)=u( |¬X)
Then:
u'(α)-u'(ω) = P(X)(u'(α|X)-u'(ω|X)) + P(¬X)(u'(α|¬X)-u'(ω|¬X)),
= 2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)). (2)
This, up to an unimportant constant, is the same equation as (1). Thus we can accomplish, via utility manipulation, exactly the same effect on the AI's behaviour as a by changing its probability estimates.
Notice that we could also have defined
u'( |X)=u( |X)
u'( |¬X)=**(1/2)**u( |¬X)
This is just the same u', scaled.
The utility indifference and false miracles approaches were just special cases of this, where the odds ratio was sent to infinity/zero by multiplying by zero. But the general result is that one can start with an AI with utility/probability estimate pair (u,P) and map it to an AI with pair (u',P) which behaves similarly to (u,P'). Changes in probability can be replicated as changes in utility.
Utility translating
In the previous, we multiplied certain utilities by two. But by doing so, we implicitly used the zero point of u. But utility is invariant under translation, so this zero point is not actually anything significant.
It turns out that we don't need to care about this - any zero will do, what matters simply is that the spread between options is doubled in the X world but not in the ¬X one.
But that relies on the AI being unable to affect the probability of X and ¬X itself. If the AI has an action that will increase (or decrease) P(X), then it becomes very important where we set the zero before multiplying. Setting the zero in a different place is isomorphic with adding a constant to the X world and not the ¬X world (or vice versa). Obviously this will greatly affect the AI's preferences between X and ¬X.
One way of avoiding the AI affecting X is to set this constant so that u'(X)=u'(¬X), in expectation. Then the AI has no preferences between the two situations, and will not seek to boost one over the other. However, note that u(X) is an expected utility calculation. Therefore:
Choosing the constant so that u'(X)=u'(¬X) requires accessing the AI's probability estimate P for various worlds; it cannot be done from outside, by multiplying the utility, as the previous approach could.
Even if u'(X)=u'(¬X), this does not mean that u'(X|Y)=u'(¬X|Y) for every event Y that could happen before X does. Simple example: X is a coin flip, and Y is the bet of someone on that coin flip, someone the AI doesn't like.
This explains all the complexity of the utility indifference approach, which is essentially trying to decompose possible universes (and adding constants to particular subsets of universes) to ensure that u'(X|Y)=u'(¬X|Y) for any Y that could happen before X does.