Crossposted at the Intelligent Agents Forum.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it's predictable that every year they are alive, they will have the same desire to survive till the next year.

This human (not a completely implausible example, I hope!) has a contradiction between their long and short term preferences. So which is accurate? It seems we could resolve these preferences in favour of the short term ("live forever") or the long term ("die after a century") preferences.

Now, at this point, maybe we could appeal to meta-preferences - what would the human themselves want, if they could choose? But often these meta-preferences are un- or under-formed, and can be influenced by how the question or debate is framed.

Specifically, suppose we are scheduling this human's agenda. We have the choice of making them meet one of two philosophers (not meeting anyone is not an option). If they meet Professor R. T. Long, he will advise them to follow long term preferences. If instead, they meet Paul Kurtz, he will advise them to pay attention their short term preferences. Whichever one they meet, they will argue for a while and will then settle on the recommended preference resolution. And then they will not change that, whoever they meet subsequently.

Since we are doing the scheduling, we effectively control the human's meta-preferences on this issue. What should we do? And what principles should we use to do so?

It's clear that this can apply to AIs: if they are simultaneously aiding humans as well as learning their preferences, they will have multiple opportunities to do this sort of preference-shaping.

New to LessWrong?

New Comment
34 comments, sorted by Click to highlight new comments since: Today at 6:01 AM

I often think that the idea that "human have values" is wrong. Humans don't "have" values. They are boxes, where different values appear, reach their goals and dissolved.

I had infinitely many different values during my life, they often contradict each other. There is something like the democracy of values in human mind, where different values affect my behaviour according to some form of their interaction. Sometimes it is a dictatorship.

But if we look on a human as on box for values, it still creates some preferred set of values. One - the need to preserve the box, that is survival (and life extension). Another is about preventing the dictatorship of one value - it may be less obvious.

It is a set of meta-values, which help to thrive and interact different values, which come from social medium, form book I read, from biological drives, and from personal choices.

This is correct. In fact, it is common on LW to use the word "agent" to mean something that rigidly pursues a single goal as though it were infinitely important. The title of this post uses it this way. But no agents exist, in this sense, and no agents should exist. We are not agents and should not want to be, in that way.

On the other hand, this is bad way to use the word "agent", since it is better just to use it of humans as they are.

That's why I used the "(idealised) agent" description (but titles need to be punchier).

Though I think "simple" goal is incorrect. The goal can be extremely complex - much more complex that human preferences. There's no limit to the subtleties you can pack into a utility function. There is a utility function that will fit perfectly to every decision you make in your entire life, for example.

The reason to look for an idealised agent, though, is that a utility function is stable in a way that humans are not. If there is some stable utility function that encompasses human preferences (it might be something like "this is the range of human preferences" or similar) then, if given to an AI, the AI will not seek to transform humans into something else in order to satisfy our "preferences".

The AI has to be something of an agent, so it's model of human preferences has to be an agent-ish model.

"There is a utility function that will fit perfectly to every decision you make in your entire life, for example."

Sure, but I don't care about that. If two years from now a random glitch causes me to do something a bit different, which means that my full set of actions matches some slightly different utility function, I will not care at all.

Is that really the standard definition of agent though? Most textbooks I've seen talk of agents working towards the achievement of a goal, but it says nothing about the permanence of that goal system. I would expect an "idealized agent" to always take actions that maximize likelihood of achieving its goals, but that is orthogonal from whether the system of goals changes.

Then take my definition of agent in this post as "expected utility maximiser with a clear and distinct utility that is, in practice, Cartesianianly separated from the rest of the universe", and I'll try and be clearer in subsequent posts.

I think that any agent with a short single goal is dangerous, and such people are named "maniacs". Addicts also have only one goal.

One way to try to create "safe agent" is to give it a very long list of goals. Human being comes with a complex set of biological drives, and culture provides complex set of values. This large set of values creates context for any value or action.

So replace the paperclip-tiling AI with the yak-shaving AI? :-D

Not all complex values are safe. For example, the negation of human values is exactly as complex as human values but is the most dangerous set of values possible.

This is true, as long as you do not allow any consistent way of aggregating the list (and humans do not have a way to do that, which prevents them from being dangerous.)

Statements about preferences are not preferences. "I don't want to live past 100", for most people, is a false statement, not a contradictory desire.

I don't think it's false, it's more like implicitly conditioned on what you expect. I would say it unrolls into "I don't want to live past 100 given that I expect myself to be sick, feeble-minded, and maybe in permanent low-grade pain by that point".

Take away the implied condition and the preference will likely change as well.

Unfortunately the implied conditional is often a alief, not a belief. So if you say "imagine that you were healthy, smart, and happy..." they'll still often say they don't want to live that long. But if there were a lot of healthy, smart, happy 100 year olds, people would change their minds.

So if you say "imagine that you were healthy, smart, and happy..." they'll still often say they don't want to live that long.

And what makes you believe that? I doubt that you have data.

It isn't rare to come across an actually healthy, smart, and happy 80 year old who says that they feel that they have basically lived long enough. Obviously this is anecdotal but I would estimate that I have seen or heard of such incidents at least ten times in my life. So this is not only a counterfactual. People sometimes preserve these preferences even when the situation is actual.

In fact I do. Parental data :-(

Fair enough. Either way, it's not a contradiction, it's just imprecision in communication of preferences.

Note that there may be inconsistency over time - predicted preferences and actual preferences often differ. I don't see any reason that wouldn't be true of an AI as well.

It's a true statement, in that people will take actions that match up with that preference.

Exactly. If someone says, "I don't want to live past 100, and therefore I will not bother to exercise," and they do not bother to exercise, it does not make sense to claim, "You secretly want to live past 100, even though you don't realize it."

There's a difference between contradictory preferences and time-inconsistent preferences. A rational agent can both want to live at least one more year and not want to live more than a hundred years, and this is not contradicted by the possibility that the agent's preferences will have changed 99 years later, so that the agent then wants to live at least another year. Of course, the agent has an incentive to influence its future self to have the same preferences it does (ie so that 99 years later, it wants to die within a year), so that its preferences are more likely to get achieved.

To rephrase my comment on your previous post, I think the right solution isn't to extrapolate our preferences, but to extrapolate our philosophical abilities and use that to figure out what to do with our preferences. There's no unique way to repair a utility function that assumes a wrong model of the world, or reconcile two utility functions within one agent, but if the agent is also a philosopher there might be hope.

but to extrapolate our philosophical abilities and use that to figure out what to do with our preferences.

Do you expect that there will be a unique way of doing this, too?

Many philosophical problems seem to have correct solutions, so I have some hope. For example, the Absent-Minded Driver problem is a philosophical problem with a clear correct solution. Formalizing the intuitive process that leads to solving such problems might be safer than solving them all up front (possibly incorrectly) and coding the solutions into FAI.

It seems that the problems to do with rationality have correct solutions, but not the problems to do with values.

Why? vNM utility maximization seems like a philosophical idea that's clearly on the right track. There might be other such ideas about being friendly to imperfect agents.

vNM is rationality - decisions.

Being friendly to imperfect agents is something I've seen no evidence for; it's very hard to even define.

I have what feels like a naive question. Is there any reason that we can't keep appealing to even higher-order preferences? I mean, when I find that I have these sorts of inconsistencies, I find myself making an additional moral judgment that tries to resolve the inconsistency. So couldn't you show the human (or, if the AI is doing all this in its 'head', a suitably accurate simulation of the human) that their preference depends on the philosopher that we introduce them to? Or in other cases where, say, ordering matters, show them multiple orderings, or their simulations' reactions to every possible ordering where feasible, and so on. Maybe this will elicit a new judgment that we would consider morally relevant. But this all relies on simulation, I don't know if you can get the same effect without that capability, and this solution doesn't seem even close to being fully general.

I imagine that this might not do much to resolve your confusion however. It doesn't do much to resolve mine.

I don't think most humans have higher order preferences, beyond, say, two levels max.

Okay well that doesn't jive with my own introspective experience.

It seems to me to jive with how many people react to unexpected tensions between different parts of their values (eg Global warming vs markets solve everything, or Global warming vs nuclear power is bad). If the tension can't be ignored or justified away, they often seem to base their new decision on affect and social factors, far more than any principled meta-preference for how to resolve tensions.

But you can still keep asking the "why" question and go back dozens of layers, usually suffering combinatorial explosion of causes, and even recursion in some cases. Only very, very rarely have I ever encountered a terminal, genesis cause for which there isn't a "why" -- the will to live is honestly the only one occurring to me right now. Everything else has causes upon causes as far as I'd care to look...

Oh, in that sense, yeah. I meant as in having articulated meta-preferences that explain lower level preferences.

it's predictable that every year they are alive, they will have the same desire to survive till the next year.

As I've pointed out before, this is false. There is an annual probability that they will want to die within a year, and there is no reason to believe this probability will diminish indefinitely. So sooner or later they will not want to survive another year.

[-][anonymous]7y00

You do understand the ideas of "illustrative example" and "thought experiments", right? ^_^

[This comment is no longer endorsed by its author]Reply