This article was originally on the FHI wiki and is being reposted to LW Discussion with permission. All content in this article is credited to Daniel Dewey.

In value loading, the agent will pick the action:

Here A is the set of actions the agent can take, e is the evidence the agent has already seen, W is the set of possible worlds, and U is the set of utility functions the agent is considering.

The parameter C(u) is some measure of the 'correctness' of the utility u, so the term p(C(u)|w) is the probability of u being correct, given that the agent is in world w. A simple example is of an AI that completely trusts the programmers; so if u is some utility function that claims that giving cake is better than giving death, and w1 is a world where the programmers have said "cake is better than death" while w2 is a world where they have said the opposite, then p(C(u)|w1) = 1 and p(C(u) | w2) = 0.

There are several challenging things in this formula:

W : How to define/represent the class of all worlds under consideration

U : How to represent the class of all utility functions over such worlds

C : What do we state about the utility function: that it is true? believed by humans?

p(C(u)|w) : How to define this probability

: How to sum up utility functions (a moral uncertainty problem)

In contrast:

is mostly the classic AI problem. It is hard to predict what the world is like from evidence, but this is a well known and studied problem and not unique to the present research. There is a trick to it here in that the nature of w includes the future actions of the agent which will depend upon how good future states look to it, but this recursive definition eventually bottoms out like a game of chess (where what happens when I make a move depends on what moves I make after that). It may cause an additional exponential explosion in calculating out the formula though, so the agent may need to make probabilistic guesses as to its own future behaviour to actually calculate an action.

This value loading equation is not subject to the classical Cake or Death problem, but is vulnerable to the more advanced version of the problem, if the agent is able to change the expected future value of p(C(u)) through its actions.

Daniel Dewey's Paper

The above idea was partially inspired by a draft of Learning What to Value, a paper by Daniel Dewey. He restricted attention to streams of interactions, and his equation, in a simplified form, is:

where S is the set of all possible streams of all past and future observations and actions.

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 5:28 PM

Maybe "value loading" is a term most people here can be expected to know, but I feel like this post would really be improved by ~1 paragraph of introduction explaining what's being accomplished and what the motivation is.

As it is, even the text parts make me feel like I'm trying to decipher an extremely information-dense equation.

Maybe "value loading" is a term most people here can be expected to know

It's the first time I've seen the term, and the second it has appeared at all on LessWrong.

It may be more current among "people who are on every mailing list, read every LW post, or are in the Bay Area and have regular conversations with [the SI]" (from its original mention on LW).

It's more an FHI term than a SI/LessWrong term.

It's often called "indirect normativity": a strategy in which instead of directly encoding the goal for an AI (or moral agent), we specify a certain way of "learning what to value/inferring human values" so that the AI can then deduce human values (and then implement it).

Ah, so it means the same thing as "value learning?" For some reason when I read "value loading" I thought of, like, overloading a function :D "I want cake, and that desire is also a carnal lust for BEES!"

What helped me was thinking of it in terms of: "Oh, like 'reading' human preferences as if they were an XML config file that the program loads at runtime."

Could you define the "Cake or Death problem" and given an example of a decision-making system that falls prey to it?

First nitpick: Since the sum on i (i just being some number I'm using to number utility functions) of u_i(w)·p(C(u_i)|w) is a function only dependent on w, it's really just a complicatedly-written utility function. I think you want u_i(w)·p(C(u_i)|w, e) - that would allow the agent to gain some sort of evidence about its utility function. Also, since C(u_i) is presumably supposed to represent a fixed logical thingamabob, to be super-precise we could talk about some logical uncertainty measure over whether the utility function is correct, M(u_i, w, e), rather than a probability - but I think we don't have to care about that.

Second nitpick: To see what happens, let's assume our agent has figured out its utility function - it now picks the action with the largest sum on w of p(w|e, a)·u(w), where "w" is a world describing present, past and future, and u(w) is its one true utility function. This happens to look a lot like an evidential decision theory (EDT) agent, which runs into known problems. For example, if there was a disease that had low utility but made you unable to punch yourself in the face, this fact makes an EDT agent want to punch itself in the face so it could increase the probability it didn't have the disease.

I'll post the "cake or death" problem in a post soon.

This one?

(Remember: always give your esoteric philosophical conundra good names.)

Oh, okay, thanks. So, shallowly speaking, you just needed to multiply the utilities of the strategies "don't ask and pick cake" and "don't ask and pick death" by 0.5.