A putative new idea for AI control; index here.

Many designs for creating AGIs (such as Open-Cog) rely on the AGI deducing moral values as it develops. This is a form of value loading (or value learning), in which the AGI updates its values through various methods, generally including feedback from trusted human sources. This is very analogous to how human infants (approximately) integrate the values of their society.

The great challenge of this approach is that it relies upon an AGI which already has an interim system of values, being able and willing to correctly update this system. Generally speaking, humans are unwilling to easily update their values, and we would want our AGIs to be similar: values that are too unstable aren't values at all.

So the aim is to clearly separate the conditions under which values should be kept stable by the AGI, and conditions when they should be allowed to vary. This will generally be done by specifying criteria for the variation ("only when talking with Mr and Mrs Programmer"). But, as always with AGIs, unless we program those criteria perfectly (hint: we won't) the AGI will be motivated to interpret them differently from how we would expect. It will, as a natural consequence of its program, attempt to manipulate the value updating rules according to its current values.

How could it do that? A very powerful AGI could do the time honoured "take control of your reward channel", by either threatening humans to give it the moral answer it wants, or replacing humans with "humans" (constructs that pass the programmed requirements of being human, according to the AGI's programming, but aren't actually human in practice) willing to give it these answers. A weaker AGI could instead use social manipulation and leading questioning to achieve the morality it desires. Even more subtly, it could tweak its internal architecture and updating process so that it updates values in its preferred direction (even something as simple as choosing the order in which to process evidence). This will be hard to detect, as a smart AGI might have a much clearer impression of how its updating process will play out in practice than it programmers would.

The problems with value loading have been cast into the various "Cake or Death" problems. We have some idea what criteria we need for safe value loading, but as yet we have no candidates for such a system. This post will attempt to construct one.

 

Changing actions and changing values

Imagine you're an effective altruist. You donate £10 a day to whatever the top charity on Giving What We Can is (currently Against Malaria Foundation). I want to convince you to donate to Oxfam, say.

"Well," you say, "if you take over and donate £10 to AMF in my place, I'd be perfectly willing to send my donation to Oxfam instead."

"Hum," I say, because I'm a hummer. "A donation to Oxfam isn't completely worthless to you, is it? How would you value it, compared with AMF?"

"At about a tenth."

"So, if I instead donated £9 to AMF, you should be willing to switch your £10 donations to Oxfam (giving you the equivalent value of £1 to AMF), and that would be equally good as the status quo?"

Similarly, if I want to make you change jobs, I should pay you, not the value of your old job, but the difference in value between your old job and your new one (monetary value plus all other benefits). This is the point at which you are indifferent to switching or not.

Now imagine it was practically possible to change people's value. What is the price at which a consequentialist would allow their values to be changed? It's the same argument: the price at which Gandhi should accept to become a mass murderer, is the difference (according to all of Gandhi's current values) between the expected effects of current-Gandhi and the expected effects of murderer-Gandhi. At that price, he has lost nothing (and gained nothing) by going through with the deal.

Indifference is key. We want the AGI to be motivated neither to preserve their previous values, nor to change them. It's obvious why we wouldn't want the AGI to keep its values, but the obvious isn't clear - shouldn't the AGI want to make moral progress, to seek out better values?

The problem is that having an AGI that strongly desires to improve its values is a danger - we don't know how it will go about doing so, what it will see as the most efficient way to do so, and what the long term effect might be (various forms of wireheading may be a danger). To mitigate this risk, it's better to have very close control over how the AGI desires such improvement. And the best way of doing this is to have the AGI indifferent to value change, and having a separate (possibly tunable) module that regulates any positive desire towards value improvements. This gives us a much better understanding of how the AGI could behave in this regards.

So in effect we are seeking to have AGIs that apply "conservation of expected evidence" to their values - it does not benefit them to try and manipulate their values in any way. See this post for further thoughts on the matter.

 

Pay and be paid: the price of value change

The above gives an effective model for value change indifference. It's even easier with utility-function based agents that we get to design: instead of paying them with money or changes in the world, we can pay them with utility. So, if we want to shift it from utility "v" to utility "w", it has to gain the expected difference (according to its current value function, ie v) of it being a v-maximiser versus being a w-maximiser.

So we can define a meta-utility function U, consisting of a current utility function (which the agent uses to make decisions) along with a collection of constant terms. Every time an agent changes their current utility function, a new constant term is added to undo the expected effect of the change. So for instance, if an agent hears evidence that causes it to update its current utility function from v to w, then its meta-utility U changes as:

U =  v  + (Past Constants)   →

U =  w  + E(v|v→v) - E(w|v→w) + (Past Constants).

Here (Past Constants) are previous constant terms dating from previous changes of utility, v→w denotes the change of utility function v into utility function w, and v→v denotes the conterfactual where v was left unchanged. I generally prefer to define counterfactuals, when I can, by taking a stochastic process that almost always has one outcome: i.e. a process that keeps v constant with probability 1/10^100 and otherwise takes v to w. That way, conditioning on v→v is a perfectly reasonable thing to do, but v→w is the only thing that happens in practice. This formula requires that the agent assess its own future defectiveness at accomplishing certain goals, given that it has them, so is vulnerable to the usual Löbian problems.

This formula is still imperfect. A clue is that it isn't symmetric in v and w. The problem can be seen more clearly if w=v-10, for example. Then both expected terms are equal (as changing from v to v-10 has no effect on the agent's behaviour), so the agent would derive no benefit from the change, but would end up with a penalty of -10. Hence it would not want to change, which destroys the whole point. Or conversely, if w=v+10, it would desperately want to change, and would sacrifice up to 10 of value to do so.

The problem is that our initial model is incorrect. We don't want a v maximiser that is ok changing utilities to w. We want a U maximiser that is ok changing its current v to w. To achieve that, we need to tweak the formula slightly: define the agent as a U maximiser and replace the change of of utility formula with:

U =  v  + (Past Constants)   →

U =  w  + E(v|v→v) - E(w|v→w) + (Past Constants).

Here we subtract the expected value of w (given v→w) rather than the expected value of v (note that this is similar to the formula presented here, if we assume the evidence is moral evidence rather than world knowledge).

Then under this new formula, the agent is properly indifferent to changes of the type w→v±10. Indeed, it will behave exactly as a perfect v-maximiser, until such time as it transitions to w, where it will behave as a perfect w-maximiser.

This can be seen by the following situations:

  • Suppose the agent can take immediate action to increase the value of v, from 0 to a, or delay for a while and increase the value of v from 0 to b later (no other options are possible). However, in the interim period, its utility will have changed to w (and it knows this), so will no longer be a v-maximiser. What will it do?

Answer: Just as a standard v-maximiser, if a>b, it acts immediately, and if b>a, it delays (and ultimately does nothing, because it will be a w-maximiser when the opportunity arises). To see this, note that if it acts immediately, at values transition it will have E(v|v→v)=a (because it will have acted, and no further increases are possible). If it delays, at values transition it will have E(v|v→v)=b (because if it would stay a v maximiser, it will act to achieve b in the future). So it will chose depending on which of a or b is higher (the effects on w can be ignored; see the next questions).

  • Suppose the agent is currently a v-maximiser, but can take immediate actions to change the value of w (none of these actions changes the value of v). It knows it will soon be a w-maximising agent. What will it do?

Answer: It will be indifferent between all these actions. Suppose it takes actions that increases the utility w by a. Then as it transitions to a new value, w will be worth w+a. But E(w|v→w) will also go up by a. Since that term is being subtracted, the increase by a is cancelled out, and has no effect. So a U-maximising agent that is currently using v, will not choose behaviours that increases any of the future utilities it will be using.

  • Suppose the agent is currently a v-maximiser, but will soon transition to being either a w or w' maximiser. It can manipulate its environment to ensure it becomes one or the other. What will it do?

Answer: It will be indifferent between the two options, and will continue v-maximising. At the moment of transition, the E(v|v→v) term does not contain either w or w'. As for the other terms, it has the opportunity to transition to utility w-E(w|v→w) or w'-E(w|v→w'). But notice that

E(w-E(w|v→w)|it is now a w-maximising agent)

= E(w-E(w|v→w)|v→w)

= E(w|v→w)-E(w|v→w)

= 0.

Thus the expected utility increase from transitioning to w is zero. Similarly for w'. Thus the agent has no preferences between transitioning to w or to w'. Note that this formula implies that if the agent's initial utility function is u (initially, U=u), then E(U)=E(u), as each utility function change adds zero expected increase. This could be seen as the defining property of this approach: the agent cannot generate extra utility by influencing its own values.

For all these reasons, this formula seems to give a proper model of an agent that acts only on its current utility function, but has no objections to transitioning to a new one.

Note that the formula is silent on when and how utility functions should be changed. So it won't accomplish miracles: if the agent has a stupid criteria for changing its utility, it will change them in a stupid way. All that the formula accomplishes is that the agent will not attempt to manipulate its utility function changing criteria.

 

A last wrinkle...

EDIT: Paul Christiano has convinced me that the ultra-sophisticated Cake or Death problem has been solved by the above method. Yay! There may still be an issue as to what order to update in if a piece of information has both empirical and values implications. It feels like the order shouldn't matter, but I'm not totally sure.

That's the current setup for value loading. What are people's thoughts on it?

New Comment
51 comments, sorted by Click to highlight new comments since:

I very much liked the analogy to Conservation of Expected Evidence---it strikes me as potentially deep; we want a system where EU and VoI are balanced such that it will never 'try' to end up with particular values, just as a Bayesian will never try to find evidence pointing in a particular direction.

I'm not sure who originally coined the phrase "Cake or Death problem" but I have a suspicion that it was me, and if so, I strongly suspect that I said it at a workshop and that I did it just to give the problem a 5-second handle that would work for a workshop. It's possible we should rename it before anyone publishes a paper.

coined the phrase "Cake or Death problem" but I have a suspicion that it was me

Alas, I am the guilty party - I came up with it during an old FHI discussion on value loading.

Huh. Okay, I'm pretty sure I have an actual memory of inventing this, though I was hesitant about saying so. But I also remember inventing it at a fairly recent MIRI workshop, while the term clearly dates back to at least 2012. Maybe I saw and subconsciously remembered, or maybe "Cake or Death" is just the obvious thing to call utility alternatives.

"Cake or Death" was part of an Eddie Izzard joke from 1998-- I think it has achieved some kind of widespread memetic success, though, since I've seen it in quite a few places since.

You probably do have a memory, it's just false. Human brains do that.

The "Cake or Death" expression has strong Portal connotations for me :-)

Cake or death appears to be a minor meme: http://en.wikipedia.org/wiki/Cake_or_Death or google.

Yes. I am suggesting that it seems quite probable to me that I am the one who took that Eddie Izzard routine, which I have seen, and turned it into the name of value learning problems.

[-][anonymous]40

Could you explain this a bit more deeply? I get the feeling I'm missing something as I try to pop myself out of Human Mode and put myself in Math Agent Mode.

To my mind, evidence to a value learner is still evidence, value learning is an epistemic procedure. Just as we don't optimize the experimental data to confirm a hypothesis in science... we don't optimize the value-learning data to support a certain utility function, it's irrational. It's logical uncertainty to consider hypothetical scenarios: the true data on which I'm supposed to condition is the unoptimized data, so any action I take to alter the value-learning input data is in fact destroying information about what I'm supposed to value, it's increasing my own logical uncertainty.

The intuition here is almost frequentist: my mind tells me that there exists a True Utility Function which is Out There, and my value-learning evidence constitutes data to learn what that function is, and if it's my utility function, then my expected utility after learning always has a higher value than before learning, because learning reduces my uncertainty.

EDIT: Ok, I think this post makes more sense if I assume I'm thinking about an agent that doesn't employ any form of logical uncertainty, and therefore has no epistemic access to the distinction between truth and self-delusion. Since the AI's actions can't cause its epistemic beliefs to fail to converge but can cause its moral beliefs to fail to converge (because morality is not ontologically fundamental), there is therefore a problem of making sure it doesn't optimize its moral data away based on an incompletely learned utility function.

The problem is when you want to work with a young AI where the condition on which the utility function depends lies in the young AI's decision-theoretic future. I.e. the AI is supposed to update on the value of an input field controlled by the programmers, but this input field (or even abstractions behind it like "the programmers' current intentions", should the AI already be mature enough to understand that) are things which can be affected by the AI. If the AI is not already very sophisticated, like more sophisticated than anyone presently has any good idea how to formally talk about, then in the process of building it, we'll want to do "error correction" type things that the AI should accept even though we can't yet state formally how they're info about an event outside of the programmers and AI which neither can affect.

Roughly, the answer is: "That True Utility Function thing only works if the AI doesn't think anything it can do affects the thing you defined as the True Utility Function. Defining something like that safely would represent a very advanced stage of maturity in the AI. For a young AI it's much easier to talk about the value of an input field. Then we don't want the AI trying to affect this input field. Armstrong's trick is trying to make the AI with an easily describable input field have some of the same desirable properties as a much-harder-to-describe-at-our-present-stage-of-knowledge AI that has the true, safe, non-perversely-instantiable definition of how to learn about the True Utility Function."

[-][anonymous]20

Right, ok, that's actually substantially clearer after a night's sleep.

One more question, semi-relevant: how is the decision-theoretic future different from the actual future?

The actual future is your causal future, your future light cone. Your decision-theoretic future is anything that logically depends on the output of your decision function.

This seems like a very useful idea - thanks!

An interesting thing to note is that, intuitively, it feels like a lot of humans are indifferent about many of their own values. E.g. people know that their values will probably change as they age, and many (though definitely not all) people just kind of shrug their shoulders about this. They know that their values at age 80 are likely to be quite different than their values at age 18, but their reaction to this thought is neither "yay changing values" nor "oh no how can I prevent this", but rather a "well, that's just the way it is".

[-]Jiro40

It depends on what you mean by a change in values. Someone might know, for instance, that when they get older they would be more inclined to be conservative, but that that's because they act for their own benefit and being conservative is more beneficial to an older person (because it's pro-wealthy and pro-family and because he'll have a family and more money then). The average man on the street would say that the older person has "changed his values to become more conservative". But he really hasn't changed his intrinsic values, he's just changed his instrumental values, because those depend on circumstances.

[-][anonymous]30

That depends on which of your evaluative judgements we consider to have what levels of noise. If we take a more sophisticated psychological view, it's actually very well-founded that at least some preferences are formed by our biology, and some others are formed by our early-life experiences, and then others are formed by experiences laid down when we've already got the foundations of a personality, and so on. And the "lower layers" are much less prone to change, or at least, to noisy change, to change without some particular life-event behind it.

Problem: not only will such an AI not resist its utility function being altered by you, it will also not resist its utility function being altered by a saboteur or by accident. I don't think I'd want to call this proposal a form of value learning, since it does not involve the AI trying to learn values, and instead just makes the AI hold still while values are force-fed to it.

The AI will not resist its values being changed in the particular way that is specified in to trigger a U transition. It will resist other changes of value.

That's true; it will resist changes to its "outer" utility function U. But it won't resist changes to its "inner" utility function v, which still leaves a lot of flexibility, even though that isn't its true utility function in the VNM sense. That restriction isn't strong enough to avoid the problem I pointed out above.

I will only allow v to change if that change will trigger the "U adaptation" (the adding and subtracting of constants). You have to specify what processes count as U adaptations (certain types of conversations with certain people, eg) and what doesn't.

Oh, I see. So the AI simply losing the memory that v was stored in and replacing it with random noise shoudn't count as something it will be indifferent about? How would you formalize this such that arbitrary changes to v don't trigger the indifference?

By specifying what counts as an allowed change in U, and making the agent in to a U maximiser. Then, just as standard maximises defend their utilities, it should defend U(un clubbing the update, and only that update)

[-][anonymous]00

I think there is a genuine problem here... the AI imposes no obstacle to "trusted programmers" changing its utility function. But apart from the human difficulties (the programmers could be corrupted by power, make mistakes etc.) what stops the AI manipulating the programmers into changing its utility function e.g. changing a hard to satisfy v into some w which is very easy to satisfy, and gives it a very high score?

[This comment is no longer endorsed by its author]Reply
[-][anonymous]10

You can't always solve human problems with AI design.

I'm not sure what you mean. The problem I was complaining about is an AI design problem, not a human problem.

[-][anonymous]30

No, I would say that if you start entering false utility data into the AI and it believes you, because after all it was programmed to be indifferent to new utility data, that's your problem.

If the AI's utility function changes randomly for no apparent reason because the AI has litterally zero incentive to make sure that doesn't happen, then you have an AI design problem.

[-][anonymous]30

It didn't change for no reason. It changed because someone fed new data into the AI's utility-learning algorithm which made it change. Don't give people root access if you don't want them using it!

Being changed by an attacker is only one of the scenarios I was suggesting. And even then, presumably you would want the AI to help prevent them from hacking its utility function if they aren't supposed to have root access, but it won't.

Anyway, that problem is just a little bit stupid. But you can also get really stupid problems, like the AI wants more memory, so it replaces its utility function with something more compressible so that it can scavange from the memory where its utility function was stored.

Nice post! I hadn't read the previous posts on Cake or Death-like problems, but your post persuaded me to read them.

Enforcing indifference between choices to prevent abuse is a beautiful idea. You remark that as a consequence of this indifference the expected change in utility for changing from utility function 'v' to function 'w' is equal to 0, but isn't the converse also true (i.e. the desired update formula follows from the conservation of expected utility)? Above you justify/motivate the change in the update formula (marked in bold in your post) to E(w|v->w) on grounds of symmetry, but doesn't it simply follow from demanding that changing the utility function itself should not change the (average) utiliy? This sounds like a direct formalisation of demanding that our agent is indifferent with respect to its own utility function.

I think this is correct. When I write the principle up properly, I'll probably be emphasising that issue more.

Really interesting, but I'm a bit confused about something. Unless I misunderstand, you're claiming this has the property of conservation of moral evidence... But near as I can tell, it doesn't.

Conservation of moral evidence would imply that if it expected that tomorrow it would transition from v to w, then right now it would be acting on w rather than v (except for being indifferent as to whether or not it actually transitions to w), but what you have here would, if I understood what you said correctly, will act on v until that moment it transitions to w, even though it knew in advance it was going to transition to w.

Indeed! An ideal moral reasoner could not predict the changes to their moral system.

I couldn't guarantee that, but instead I got a weaker condition: an agent that didn't care about the changes to their moral system.

Ah, alright.

Actually, come to think about it, even specifying the desired behavior would be tricky. Like if the agent assigned a probability of 1/2 to the proposition that tomorrow they'd transition from v to w, or some other form of mixed hypothesis re possible future transitions, what rules should an ideal moral-learning reasoner follow today?

I'm not even sure what it should be doing. mix over normalized versions of v and w? what if at least one is unbounded? Yeah, on reflection, I'm not sure what the Right Way for a "conserves expected moral evidence" agent is. There're some special cases that seem to be well specified, but I'm not sure how I'd want it to behave in the general case.

Is there a difference between a value loading agent that changes its utility function based on evidence, and a UDT agent with a fixed utility function over all possible universes, which represents different preferences for different universes (e.g., it prefers more cakes in the universe where the programmer says "I want cake" and more deaths in the universe where the programmer says "I want death")?

Every value loading agent I've considered (that pass the naive cake-or-death problem, at least) can be considered equivalent to a UDT agent.

I'm just not sure it's a useful way of thinking about it, because the properties that we want - "conservation of moral evidence" and "don't manipulate your own moral changes" - are not natural UDT properties, but dependent on a particular way of conceptualising a value loading agent. For instance, the kid that doesn't ask whether eating cookies is bad, has a sound formulation as a UDT agent, but this doesn't seem to capture what we want.

EDIT: this may be relevant http://lesswrong.com/r/discussion/lw/kdx/conservation_of_expected_moral_evidence_clarified/

It seems to me that there are natural ways to implement value loading as UDT agents, with the properties you're looking for. For example, if the agent values eating cookies in universes where its creator wants it to eat cookies, and values not eating cookies in universes where its creator doesn't want it to eat cookies (glossing over how to define "creator wants" for now), then I don't see any problems with the agent manipulating its own moral changes or avoiding asking whether eating cookies is bad. So I'm not seeing the motivation for coming up with another decision theory framework here...

It seems like the natural way to address value learning is to have beliefs about what is really valuable, e.g. by having some distribution over normalized utility functions and maximizing E[U] over both empirical and moral uncertainty.

In that case we are literally incapable of distorting results (just like we are incapable of changing physical facts by managing the news), and we will reason about VOI in the correct way. I have never understood what about the Bayesian approach was unsuitable. Of course it has many of its own difficulties, but I don't think you've resolved any of them. Instead you get a whole heap of extra problems from giving up on a principled and well-understood approach to learning and replacing it with something ad hoc.

I'm also confused about what this agent actually does (but I might just be overlooking something).

You write "U = ..." a bunch of times, but then you talk about an agent whose utility is completely different from that, i.e. an agent that cares about "adding a constant" to the definition of U. That's obviously not what a U-maximizer would do. Instead the agent seems to have U = v + C, where C is a compensatory term defined abstractly as the sum of all future adjustments produced by the indifference formula.

I guess this C is defined with respect to the agent's current beliefs, conditioned on the events leading up to the compensation (defining it with respect to their beliefs at the time the compensation occurs seems unworkable). But at that point can't we just collapse the double expectations E[E[w|v-->w]] = E[w|v-->w]? And at that point we can just write the entire expression as E[v|v-->v-->v--->v...], which seems both more correct and much simpler.

Moreover, I don't yet see why to bother with all of this machinery. We have some AI whose values might change in some way. You seem to be saying "just give the AI a prior probability of 99.99999% that each change won't actually happen, even though they really will." As far as I can tell, all of the intuitive objections against this kind of wildly false belief also apply to this kind of surgically modified values (e.g. the AI will still make all of the same implausible inferences from its implausible premise).

One difference is that this approach only requires being able to surgically alter utility functions rather than beliefs. But you need to be able to specify the events you care about in your AI's model of the world, and at that point it seems like those two operations are totally equivalent.

It seems like the natural way to address value learning is to have beliefs about what is really valuable, e.g. by having some distribution over normalized utility functions and maximizing E[U] over both empirical and moral uncertainty.

This can go disastrously wrong. We lack a good update rule for moral uncertainty. Suppose the rule is "X is bad iff a human says its bad". Then killing all humans prevents the AI from ever concluding X is bad, which might be something it desires. See sophisticated cake or death for another view of the problem: http://lesswrong.com/lw/f3v/cake_or_death/

In that case we are literally incapable of distorting results (just like we are incapable of changing physical facts by managing the news)

Moral facts are not physical facts. We want something like "X is bad if humans would have said X is bad, freely, unpressured and unmanipulated", but then we have to define "freely, unpressured and unmanipulated".

You seem to be saying "just give the AI a prior probability of 99.99999% that each change won't actually happen, even though they really will." As far as I can tell, all of the intuitive objections against this kind of wildly false belief also apply to this kind of surgically modified values (e.g. the AI will still make all of the same implausible inferences from its implausible premise).

It has no incorrect beliefs about the world. It is fully aware that the changes are likely to happen, but it's meta utility causes it to ignore this fact - it cannot gain anything by using its knowledge of that probability.

This can go disastrously wrong. We lack a good update rule for moral uncertainty. Suppose the rule is "X is bad iff a human says its bad". Then killing all humans prevents the AI from ever concluding X is bad, which might be something it desires. See sophisticated cake or death for another view of the problem: http://lesswrong.com/lw/f3v/cake_or_death/

Again, assuming you can't make the inference from "X is bad if people say X is bad" and "people probably say X is bad" to "X is probably bad." But this is a very simple and important form of inference that almost all practical systems would make. I don't see why you would try to get rid of it!

Also, I agree we lack a good framework for preference learning. But I don't understand why that leads you to say "and so we should ignore the standard machinery for probabilistic reasoning," given that we also don't have any good framework for preference learning that works by ignoring probabilities.

Moral facts are not physical facts.

A Bayesian is incapable of distorting any facts by managing the news, except for facts which actually depend on the news.

We want something like "X is bad if humans would have said X is bad, freely, unpressured and unmanipulated", but then we have to define "freely, unpressured and unmanipulated".

The natural approach is to build a model where "humans don't want X" causes "humans say X is bad." In even a rudimentary model of this form (of the kind that we can build today), pressure or manipulation will then screen off the inference from human utterances to human preferences.

Is there any plausible approach to value learning that doesn't capture this kind of inference? I think this is one of the points where MIRI and the mainstream academic community are in agreement (though MIRI expects this will be really tough).

It has no incorrect beliefs about the world. It is fully aware that the changes are likely to happen, but it's meta utility causes it to ignore this fact - it cannot gain anything by using its knowledge of that probability.

I brought this up in the post on probability vs utility. So far you haven't pointed to any situation where these two possibilities do anything different. If they do the same thing, and one of them is easier to understand and has been discussed at some length, it seems like we should talk about the one that is easier to understand.

In even a rudimentary model of this form (of the kind that we can build today), pressure or manipulation will then screen off the inference from human utterances to human preferences.

This seems surprising to me, because I think a model that is able to determine the level of 'pressure' and 'manipulation' present in a situation is not rudimentary. That is, yes, if I have a model where "my preferences" have a causal arrow to "my utterances," and the system can recognize that it's intervening at "my utterances" then it can't infer readily about "my preferences." But deciding where an intervention is intervening in the graph may be difficult, especially when the thing being modeled is a person's mind.

Yes, we can't build models today that reliably make these kinds of inferences. But if we consider a model which is architecturally identical, yet improved far enough to make good predictions, it seems like it would be able to make this kind of inference.

As Stuart points out, the hard part is pointing to the part of the model that you want to access. But for that you don't have to define "freely, unpressured and unmanipulated." For example, it would be sufficient to describe any environment that is free of pressure, rather than defining pressure in a precise way.

This all looks clever, apart from the fact that the AI becomes completely indifferent to arbitrary changes in its value system. The way you describe it, the AI will happily and uncomplainingly accept a switch from a friendly v (such as promoting human survival, welfare and settlement of Galaxy) to an almost arbitrary w (such as making paperclips), just by pushing the right "update" buttons. An immediate worry is about who will be in charge of the update routine, and what happens if they are corrupt or make a mistake: if the AI is friendly, then it had better worry about this as well.

Interestingly, the examples you started with suggested that the AI should be rewarded somehow in its current utility v as a compensation for accepting a change to a different utility w. That does sound more natural, and more stable against rogue updates.

"Well," you say, "if you take over and donate £10 to AMF in my place, I'd be perfectly willing to send my donation to Oxfam instead."

"Hum," I say, because I'm a hummer. "A donation to Oxfam isn't completely worthless to you, is it? How would you value it, compared with AMF?"

"At about a tenth."

"So, if I instead donated £9 to AMF, you should be willing to switch your £10 donations to Oxfam (giving you the equivalent value of £1 to AMF), and that would be equally good as the status quo?"

Question: I don't understand your Oxfam/AMF example. According to me, if you decided to donate £10 to AMF, I see a that Oxfam, which I care about 0.1 times as much as AMF, has lost £1 worth of AMF donation, while AMF has gained £10. If I then decide to follow through with my perfect willingness, and I donate £10 to Oxfam, only then do I have equilibrium, because

Before: £10 0.1 utiliton + £10 1 utiliton = 11 utilitons.

After: £10 0.1 utiliton + £10 1 utiliton = 11 utilitons.

But in the second hypothetical,

After: £11 0.1 utiliton + £9 1 utiliton = 10.1 utilitons.

Which seems clearly inferior. In fact, even if you offered to switch donations with me, I wouldn't accept, because I may not trust you to fulfil your end of the deal, resulting in a lower expected utility.

I'm clearly missing some really important point here, but I fail to see how the example is related to utility function updating...

In the first situation, you were donating £10 to AMF (10 utilons).

Then I ask you to which to Oxfam. You said yes, if I covered your donation to AMF. This would indeed give you £10+0.1*£10=£11, as you said.

I said "hang on." I pointed out that this was pure profit for you, and that if in instead I gave £9 to AMF, then this would be equivalent to your first situations (£9 (from me to AMF) + 0.1*£10 (from you to Oxfam) = £10). This is the point at which you are indifferent to changing.

because I may not trust you to fulfil your end of the deal

We removed those potential issues to get a clearer example.

Ah! I finally get it! Unfortunately I haven't gotten the math. Let me try to apply it, and you can tell me where (if?) I went wrong.

U = v + (Past Constants) →

U = w + E(v|v→v) - E(w|v→w) + (Past Constants).

Before, U = v + 0, setting (Past Constants) to 0 because we're in the initial state. v = 0.1*Oxfam + 1*AMF.

Therefore, U = 10 utilitons.

After I met you, you want me to change my w to weight Oxfam higher, but only if a constant was given (the E terms) U' = w + E(v|v->v) - E(w|v->w). w = 1*Oxfam + 0.1*AMF.

What we want is for U = U'.

E(v|v->v) = ? I'm guessing this term means, "Let's say I'm a v maximiser. How much is v?" In that case, E(v|v->v) = 10 utilitons.

E(w|v->w) = ? I'm guessing this term means, "Let's say I become a w maximiser. How much is w?" In that case, E(w|v->w) = 10 utilitons.

U' = w + 10 - 10 = w.

Let's try a different U*, with utility function w* = 1*Oxfam + 10*AMF (It acts the same as a v-maximiser) E(v|v->v) = 10 utilitons. E(w*|v->w*) = 100 utilitons. U* = w* + 10 - 100 = w* - 90.

Trying this out, we obviously will be donating 10 to AMF in both utility functions. U = v = 0.1*Oxfam + 1*AMF = 0.1*0 + 1*10 = 10 utilitons. U* = w* - 90 = 1*Oxfam + 10*AMF - 90 = 0 + 100 - 90 = 10 utilitons.

Obviously all these experiments are useless. v = 0.1*Oxfam + 1*AMF is a completely useless utility function. It may as well be 0.314159265*Oxfam + 1*AMF. Let's try something that actually makes some sense, (economically.)

Let's have a simple marginal utility curve, (note partial derivatives) dv/dOxfam = 1-0.1*Oxfam, dv/dAMF = 10-AMF. In both cases, donating more than 10 to either charity is plain stupid.

U = v v = (Oxfam-0.05*Oxfam^2) + (10*AMF-0.5*AMF^2) Maximising U leads to AMF = 100/11 ≈ 9.09, Oxfam ≈ 0.91 v happens to be: v = 555/11 ≈ 50.45

(Note: Math is mostly intuitive to me, but when it comes to grokking quadratic curves by applying them to utility curves which I've never dabbled with before, let's just say I have a sizeable headache about now.)

Now you, because you're so human and you think we simulated AI can so easily change our utility functions, come over to me and tell me to change v to w = (100*Oxfam-5*Oxfam^2) + (10*AMF-0.5*AMF^2). What you're saying is to increase dw/dOxfam = 100 * dv/dOxfam, while leaving dw/dAMF = dv/dAMF. Again, partial derivatives.

U' = w + E(v|v->v) - E(w|v->w). Maximising w leads to Oxfam = 100/11 ≈ 9.09, AMF = 0.91, the opposite of before. w = 5550/11 ≈ 504.5 U' = w + 555/11 - 5550/11 = w - 4995/11 Which still checks out.

Also, I think I finally get the math too, after working this out numerically. It's basically U = (Something), and trying to make the utility function change must preserve that (Something). U' = (Something) is a requirement. so you have your U = v + (Constants), and you set U' = U, just that you have to maximise v or w before determining your new set of (Constants) max(v) + (Constants) = max(w) + (New Constants)

(New Constants) = max(v) - max(w) + (Constants), which are your E(v|v->v) - E(w|v->w) + (Constants) terms, except under different names.

Huh. If only I had thought max(v) and max(w) from the start... but instead I got confused with the notation.

Thanks for sticking it out to the end :-)