You know that when you title a post with "clarified", that you're just asking for the gods to smite you down, but let's try...

There has been some confusion about the concept of "conservation of expected moral evidence" that I touched upon in my posts here and here. The fault for the confusion is mine, so this is a brief note to try and explain it better.

The canonical example is that of a child who wants to steal a cookie. That child gets its morality mainly from its parents. The child strongly suspects that if it asks, all parents will indeed confirm that stealing cookies is wrong. So it decides not to ask, and happily steals the cookie.

I argued that this behvaiour showed a lack of "conservation of expected moral evidence": if the child knows what the answer would be, then that should be equivalent with actually asking. Some people got this immediately, and some people were confused that the agents I defined seemed Bayesian, and so should have conservation of expected evidence already, so how can they violate that principle?

The answer is... both groups are right. The child can be modelled as a Bayesian agent reaching sensible conclusions. If it values "I don't steal the cookie" at 0, "I steal the cookie without being told not to" at 1, and "I steal the cookie after being told not to" at -1, then its behaviour is rational - and those values are acceptable utility values over possible universes. So the child (and many value loading agents) are Bayesian agents with the usual properties.

But we are adding extra structure to the universe. Based on our understanding of what value loading should be, we are decreeing that the child's behaviour is incorrect. Though it doesn't violate expected utility, it violates any sensible meaning of value loading. Our idea of value loading is that, in a sense, values should be independent of many contingent things. There is nothing intrinsically wrong with "stealing cookies is wrong iff the Milky Way contains an even number of pulsars", but it violates what values should be. Similarly for "stealing cookies is wrong iff I ask about it".

But lets dig a bit deeper... Classical conservation of expected evidence fails in many cases. For instance, I can certainly influence the variable X="what Stuart will do in the next ten seconds" (or at least, my decision theory is constructed on assumptions that I can influence that). My decisions change X's expected value quite dramatically. What I can't influence is facts that are not contingent on my actions. For instance, I can't change my expected estimation of the number of pulsars in the galaxy last year. Were I super-powerful, I could change my expected estimation of the number of pulsars in the galaxy next year - by building or destroying pulsars, for instance.

So conservation of expected evidence only applies to things that are independent of the agent's decisions. When I say we need to have "conservation of expected moral evidence" I'm saying that the agent should treat their (expected) morality as independent of their decisions. The kid failed to do this in the example above, and that's the problem.

So conservation of expected moral evidence is something that would be automatically true if morality were something real and objective, and is also a desiderata when constructing general moral systems in practice.

New Comment
10 comments, sorted by Click to highlight new comments since:
[-]asr110

The canonical example is that of a child who wants to steal a cookie. That child gets its morality mainly from its parents. The child strongly suspects that if it asks, all parents will indeed confirm that stealing cookies is wrong. So it decides not to ask, and happily steals the cookie.

I find this example confusing. I think what it shows is that children (humans?) aren't very moral. The reason the child steals instead of asking isn't anything to do with the child's subjective moral uncertainty -- it's that the penalty for stealing-before-asking is lower than stealing-after-asking, and the difference in penalty is enough to make "take the cookie and ask forgiveness if caught" better than "ask permission".

I suspect this is related to our strong belief in being risk-averse when handing out penalties. If I think there's a 50% chance my child misbehaved, the penalty won't be 50% of the penalty if they were caught red-handed. Often, if there's substantial uncertainty about guilt, the penalty is basically zero -- perhaps a warning. Here, the misbehavior is "doing a thing you knew was wrong;" even if the child knows the answer in advance, when the child explicitly asks and is refused, the parent gets new evidence about the child's state of mind, and this is the evidence that really matters.

I suspect this applies to the legal system and society more broadly as well -- because we don't hand out partial penalties for possible guilt, we encourage people to misbehave in ways that are deniable.

A child will eventually learn the independence. It's a very different thing to hardcode assume the independence from the start. Values are also complex and there might not be a natural line on what is natural to care about. If I take the the morality of the agent to be independent of it's decisions really far it will mean the AI believes it's a good boy or a bad boy already no matter what it does (and it just doesn't know which). This seems to be opposite what moral behavior is about. There is also no telling that the valued actions need to simple concepts. After all cake is good unless diabetes and death might not be all bad if euthanasia.

Also if I am inconsistent or plain out make an error on assigning the labels I would rather have the AI have a dialogue back to me about it than make silently a really convoluted utility function.

We could off course think that a value loader is a two step process where the values loaded are not taken as is as the actual utility function but modified to make sense in a "non-cheating way". That is the actual utility function always contains terms about moral fairness even if not explicitly input. This reduces the amount of values softcoded as meta-morality is hardcoded. But this layering also solves the original problem. The agent might evaluate the new situation with lower expected utility and would choose not to go there - if it had a choice. But by layering we are taking the choice away so the wrong valuation doesn't matter. By enforcing the hand-inputted conservation of moral evidence we tie the AIs hand to form a moral strategy. "We know better" before hand.

An interesting point, hinting that my approach at moral updating ( http://lesswrong.com/lw/jxa/proper_value_learning_through_indifference/ ) may be better than I supposed.

I was more getting to that it narrows down the problem instead of generalising it. It reduces the responsibilities of the AI and widens those of humans. If you solved this problem you would only get up to the most virtous human (which isn't exactly bad). Going beyond would require ethics competency that would have to be added as we are tying it's hands in this department.

I take the point in practice, but there's no reason we couldn't design something to follow a path towards ultra-ethicshood that had the conservation property. For instance, if we could implement "as soon as you know your morals would change, then change them", this would give us a good part of the "conservation" law.

[-][anonymous]20

So conservation of expected moral evidence is something that would be automatically true if morality were something real and objective, and is also a desiderata when constructing general moral systems in practice.

Yes, but usual learning and prediction algorithms deal entirely with things that are "real and objective", in the sense that you simply cannot change them (ie: laws of science).

This is yet another domain where my intuitions are outpacing my ability to learn mathematics. For domains where my actions can affect the experiment, I know damn well I should avoid affecting the experiment. The justification is damn simple when you think of data/information as a substance: experiments/learning are done to gain information, and if I alter the outcome of the experiment I gain information only about my own decisions, which I already had, thus rendering the experiment/learning pointless.

This leads to a question of how to model value learning as collecting moral information, and thus make the conclusion epistemically natural to the agent that biasing its own learning process is yielding falsehoods.

This area (or perhaps just the example?) is complicated somewhat because for authority-based moral systems (parental, religious, legal, professional...) directly ignoring a command/ruling is in itself considered to be an immoral act: on top of whatever the content of said act was. And even if the immorality of the act is constant, most of those systems seem to recognise in principle and/or in practice that acting when you suspect you'd get a different order is different to direct breaking of orders.

This makes sense for all sorts of practical reasons: caution around uncertainty, the clearer Schelling point of 'you directly disobeyed', and, cynically, the fact that it can allow those higher in the authority chain plausible deniability (the old "I'm going to pay you based on your sales. Obviously I won't tell you to use unscrupulous methods. If you ask me directly about any I will explicitly ban them. But I might not ask you in too much detail what you did to achieve those sales: that's up to you")

Yep, the example isn't perfect - but it is easy to grasp, and it's obviously not an ideal behaviour.

"So conservation of expected moral evidence is something that would be automatically true if morality were something real and objective, and is also a desiderata when constructing general moral systems in practice."

This seems to go against your pulsar example... I guess you mean something like: "if [values were] real, objective, and immutable"?

Sorry, I don't get your point. Could you develop it?