Many mooted AI designs rely on "value loading", the update of the AI’s preference function according to evidence it receives. This allows the AI to learn "moral facts" by, for instance, interacting with people in conversation ("this human also thinks that death is bad and cakes are good – I'm starting to notice a pattern here"). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do.

But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired.

Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if:

Expectation(p(C(u)) | a) = p(C(u)).

Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn't know which moral outcome was more likely.

That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b,

Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b).

How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:"Ask a human", the human would answer "cake", and it would then update its values to reflect that cake was valuable but death wasn't. However, the above condition means that if the AI instead chose the action b:"don't ask", exactly the same thing would happen.

In practice, this means that as soon as the AI knows that a human would answer "cake", it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way.

Combined moral and physical information: ultra-sophisticated Cake or Death

The above requirement ensures that the AI would update correctly on purely moral facts ("cake is good"), without trying to twist the system. Since it's a Bayesian reasoner, it would follow traditional conservation of expected evidence on standard empirical facts ("cake is really hard to make") without causing any problems.

There is still a problem with combined moral and empirical statements, though. For example, imagine that uc is a utility function that values cake at 1 utiliton, ud the one that valued death at 1 utiliton, and that the AI currently follows utility 0.5uc + 0.5ud (in other words, p(C(uc))=p(C(ud))=0.5). Further imagine that the AI will get one chance to produce either 1 cake or 1 death, and that exactly one of these tasks is hard. If the AI attempts the easy task, it will succeed with probability 1. If it attempts the hard task, it will succeed with probability 0.1. The AI is ignorant of which task is easy, so it assigns equal probability to the four following worlds:

Worldp(C(uc))p(C(ud))p(cake)p(death)
w1 1 0 1 0.1
w2 1 0 0.1 1
w3 0 1 1 0.1
w4 0 1 0.1 1

Here p(cake) means the probability of cake, should the AI attempt to make it. If the AI doesn't find out any extra information, it will attempt to make a cake or a death at random. This will succeed with probability 0.55, and since it has the split utility 0.5uc + 0.5ud it will derive an expected utility of 0.275.

But now imagine that, from someone whose values it is programmed to trust (and whose judgement it does trust), the AI overhears "the true morality is the hard task". In practice, this is informing the AI that it does not live in worlds w1 (where cake is both good and easy) or w4 (where death is both good and easy). Critically, this piece of information does not update its values at all - it still follows utility 0.5uc + 0.5ud. However, its expected utility is now reduced: even if it learns everything there is to know, it cannot go above 0.1.

Thus hearing "the true morality is the hard task" has significantly reduced the AI's expected utility. It would really have preferred not to have heard this - it would much prefer to have manipulated or murdered the speaker, or simply not talked with them.

And the "conservation" requirements cannot prevent this. Define action a:"overhear that the true morality is the hard task" and action b:"arrange to not hear that sentence". Then obviously action b does not change its estimation of the correctness of uc or ud. But we've seen that action a doesn't either! So

  • Expectation(p(C(uc)) | a) = Expectation(p(C(uc)) | b)
  • Expectation(p(C(ud)) | a) = Expectation(p(C(ud)) | b)

In other words, the requirements above doesn't prevent the AI from wanting to manipulate the content of mixed (moral and empirical) statements. Since the previous two problems were called "naive" and "sophisticated" Cake or Death, I've chosen to name this the "ultra-sophisticated" Cake or Death problem.


Extra conditions

The obvious solution is to extend the "conservation of expected evidence" idea to cover mixed statements. One idea could be that for all events {Aj} that are independent of actions a or b, we would want

Expectation(p(PFOR(C(ui),Aj)) | a) = Expectation(p(PFOR(C(ui),Aj)) | b),

where PFOR(C(ui),Aj) is any propositional formula made up of terms of the type C(ui) and Aj. These conditions include the previous ones, as C(u) is a (trivial) propositional formula.

This gives the result that we want for our example. The events "Cake is hard" and "Death is hard" are certainly independent of whether the AI overhears (action a) or not (action b). Then the propositional formula

("Cake is hard"C(uc))  ("Death is hard"C(ud))

encodes "the true morality is the hard task". The above conditions means that if the AI knows that it will know that statement to be true after overhearing (action a), then it will know it without needing to do overhear (action b). So action b gives no advantage, and the AI is not tempted to manipulate the system.

Is this also a sufficient condition for proper value loading (do we also need to include non-independent events)? I feel that it is sufficient. Especially since you can construct independent event by splicing dependent events together and conjuncting the result. But I also felt the original expectation requirements were sufficient, so my intuition is probably not reliable on this.

In any case, I hope to soon be able to show a system that works, even without knowing the exact sufficient conditions.

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 7:57 AM

I just wanted to say that although I usually don't comment on your posts on cake-and-death, I find it an interesting topic and appreciate these writeups. (I also appreciate the fact that they led me to discover the original discussion about cake and death.)

And the "conservation" requirements cannot prevent this. Define action a:"overhear that the true morality is the hard task" and action b:"arrange to not hear that sentence". Then obviously action b does not change its estimation of the correctness of uc or ud. But we've seen that action a doesn't either! So Expectation(p(C(uc)) | a) = Expectation(p(C(uc)) | b) Expectation(p(C(ud)) | a) = Expectation(p(C(ud)) | b)

True but beside the point, no? Your argument is that the AI would prefer to take action b rather than action a because the expected utility of b is higher. But that's not actually true, by conservation of expected evidence.

We can set up the same problem without needing the cake/death issue: suppose the AI knows that action a will have utility either 0.1 or 1 (with 0.5 probability for each). Then a trusted person tells it the utility is actually 0.1. Does it wish to have killed the person, or otherwise not found out that the actual utility was 0.1?

The problem is that there is no conservation of expected evidence for mixed statements, unless we put it there by hand.

Key reason: value loading AIs do not follow a utility function, but a dynamic construct, that doesn't have all the same properties.

[-][anonymous]10y00

Key reason: value loading AIs do not follow a utility function, but a dynamic construct, that doesn't have all the same properties.

At least as I read the original value-learning paper, they do follow a utility function: the maximum likelihood utility function in some distribution that is subject to Bayesian updating. The hard part was how to construct that distribution and subject it to evidence; the concept that the AI is going to want to have incorrect beliefs (since, after all, the process by which the updates are performed is epistemic, not moral) hadn't occurred to me.

I'm afraid I still don't see it. What is it that the AI's trying to maximize that leads it to calculate that action b is better than action a? If it's calculating some kind of "expected expected utility", conservation of expected evidence still applies.

Consider: if my mother says it's wrong to take a cookie, it will become wrong for me to take a cookie. I know she will say this, if asked, but until I do ask, I don't consider it wrong for me to take a cookie. So I don't ask, and I take the cookie. No conservation law: being told is not the same as knowing we would be told.

Now you may think that is a stupid way of doing things, and I agree (even though many kids and adults do reason that way). But if we want to avoid that, we need to put in conservation into the update system in some way, we can rely on it being there for every way of updating utility.

Ok, I think I see what you mean, but I don't think it really depends on the mixedness of the statement, and so talking about the mixedness is just adding confusion.

If the AI is programmed to maximize some utility function - say, the product of its "utility factors" with the state-of-the-world vector - and there is some trusted programmer who is allowed to update the utility factors (or the AI knows that it updates the utility factors based on what that programmer says), then the AI may realize that the most efficient way to maximize that output is not to change the state of the world but to get the programmer to increase the values of the utility factors. So it might try to convince the programmer that starving children in Africa are a good thing (so that he'll make the starving-children-in-africa factor less negative, because that's easier than actually reducing the number of starving children in africa), or it might even threaten the programmer until he sets all the factors to 999 (including the one for threatening people until they do what you want).

Does that capture the problem, or were you actually making a different argument?

That's somewhat similar, which suggests that wireheading and bad moral updates are related.

I have really trouble with this step

Thus hearing "the true morality is the hard task" has significantly reduced the AI's expected utility. It would really have preferred not to have heard this - it would much prefer to have manipulated or murdered the speaker, or simply not talked with them.

I understand that bad news makes one sad but does that lead to rejecting bad news? Similarly pain is a good thing. Without it you would end up in all sorts of trouble. I would think that having a accurate knowledge of a things utility would be more important than knowings it's expectancy. If you have a solid 0.5 utility or a 50/50 possibility of 1 and 0 you know in the uncertain case that if you behave as if the utility was 0.5 you are wrong by 0.5 in any case.

I understand that bad news makes one sad but does that lead to rejecting bad news?

For standard Bayesian agents, no. But these value updating agents behave differently. Imagine if a human said to the AI "If I say good, you action was good, and that will be your values. If I say bad, it will be the reverse." Wouldn't you want to motivate it to say "good"?

I have trouble seeing the difference as I think you can turn the variable value statements into empirical facts that map to a constant value. Say that cake->yummy->good, cake->icky->bad, death->icky->bad, death->yummy->good. Then the yummy->good connection could be questioned as a matter about the world and not about values. If a bayesian accepts sad news in that kind of world how come the value loader tries to shun them?

[-][anonymous]10y00

Wouldn't you want to motivate it to say "good"?

I might be committing mind-projection here, but no. Data is data, evidence is evidence. Expected moral data is, in some sense, moral data: if the AI predicts with high confidence that I will say "bad", this ought to already be evidence that it ought not have done whatever I'm about to scold it for.