I often think that the idea that "human have values" is wrong. Humans don't "have" values. They are boxes, where different values appear, reach their goals and dissolved.
I had infinitely many different values during my life, they often contradict each other. There is something like the democracy of values in human mind, where different values affect my behaviour according to some form of their interaction. Sometimes it is a dictatorship.
But if we look on a human as on box for values, it still creates some preferred set of values. One - the need to preserve the box, that is survival (and life extension). Another is about preventing the dictatorship of one value - it may be less obvious.
It is a set of meta-values, which help to thrive and interact different values, which come from social medium, form book I read, from biological drives, and from personal choices.
This is correct. In fact, it is common on LW to use the word "agent" to mean something that rigidly pursues a single goal as though it were infinitely important. The title of this post uses it this way. But no agents exist, in this sense, and no agents should exist. We are not agents and should not want to be, in that way.
On the other hand, this is bad way to use the word "agent", since it is better just to use it of humans as they are.
That's why I used the "(idealised) agent" description (but titles need to be punchier).
Though I think "simple" goal is incorrect. The goal can be extremely complex - much more complex that human preferences. There's no limit to the subtleties you can pack into a utility function. There is a utility function that will fit perfectly to every decision you make in your entire life, for example.
The reason to look for an idealised agent, though, is that a utility function is stable in a way that humans are not. If there is some stable utility function that encompasses human preferences (it might be something like "this is the range of human preferences" or similar) then, if given to an AI, the AI will not seek to transform humans into something else in order to satisfy our "preferences".
The AI has to be something of an agent, so it's model of human preferences has to be an agent-ish model.
"There is a utility function that will fit perfectly to every decision you make in your entire life, for example."
Sure, but I don't care about that. If two years from now a random glitch causes me to do something a bit different, which means that my full set of actions matches some slightly different utility function, I will not care at all.
Is that really the standard definition of agent though? Most textbooks I've seen talk of agents working towards the achievement of a goal, but it says nothing about the permanence of that goal system. I would expect an "idealized agent" to always take actions that maximize likelihood of achieving its goals, but that is orthogonal from whether the system of goals changes.
I think that any agent with a short single goal is dangerous, and such people are named "maniacs". Addicts also have only one goal.
One way to try to create "safe agent" is to give it a very long list of goals. Human being comes with a complex set of biological drives, and culture provides complex set of values. This large set of values creates context for any value or action.
I don't think it's false, it's more like implicitly conditioned on what you expect. I would say it unrolls into "I don't want to live past 100 given that I expect myself to be sick, feeble-minded, and maybe in permanent low-grade pain by that point".
Take away the implied condition and the preference will likely change as well.
It isn't rare to come across an actually healthy, smart, and happy 80 year old who says that they feel that they have basically lived long enough. Obviously this is anecdotal but I would estimate that I have seen or heard of such incidents at least ten times in my life. So this is not only a counterfactual. People sometimes preserve these preferences even when the situation is actual.
There's a difference between contradictory preferences and time-inconsistent preferences. A rational agent can both want to live at least one more year and not want to live more than a hundred years, and this is not contradicted by the possibility that the agent's preferences will have changed 99 years later, so that the agent then wants to live at least another year. Of course, the agent has an incentive to influence its future self to have the same preferences it does (ie so that 99 years later, it wants to die within a year), so that its preferences are more likely to get achieved.
To rephrase my comment on your previous post, I think the right solution isn't to extrapolate our preferences, but to extrapolate our philosophical abilities and use that to figure out what to do with our preferences. There's no unique way to repair a utility function that assumes a wrong model of the world, or reconcile two utility functions within one agent, but if the agent is also a philosopher there might be hope.
Many philosophical problems seem to have correct solutions, so I have some hope. For example, the Absent-Minded Driver problem is a philosophical problem with a clear correct solution. Formalizing the intuitive process that leads to solving such problems might be safer than solving them all up front (possibly incorrectly) and coding the solutions into FAI.
I have what feels like a naive question. Is there any reason that we can't keep appealing to even higher-order preferences? I mean, when I find that I have these sorts of inconsistencies, I find myself making an additional moral judgment that tries to resolve the inconsistency. So couldn't you show the human (or, if the AI is doing all this in its 'head', a suitably accurate simulation of the human) that their preference depends on the philosopher that we introduce them to? Or in other cases where, say, ordering matters, show them multiple orderings, or their simulations' reactions to every possible ordering where feasible, and so on. Maybe this will elicit a new judgment that we would consider morally relevant. But this all relies on simulation, I don't know if you can get the same effect without that capability, and this solution doesn't seem even close to being fully general.
I imagine that this might not do much to resolve your confusion however. It doesn't do much to resolve mine.
It seems to me to jive with how many people react to unexpected tensions between different parts of their values (eg Global warming vs markets solve everything, or Global warming vs nuclear power is bad). If the tension can't be ignored or justified away, they often seem to base their new decision on affect and social factors, far more than any principled meta-preference for how to resolve tensions.
But you can still keep asking the "why" question and go back dozens of layers, usually suffering combinatorial explosion of causes, and even recursion in some cases. Only very, very rarely have I ever encountered a terminal, genesis cause for which there isn't a "why" -- the will to live is honestly the only one occurring to me right now. Everything else has causes upon causes as far as I'd care to look...
it's predictable that every year they are alive, they will have the same desire to survive till the next year.
As I've pointed out before, this is false. There is an annual probability that they will want to die within a year, and there is no reason to believe this probability will diminish indefinitely. So sooner or later they will not want to survive another year.
Crossposted at the Intelligent Agents Forum.
This is an example of humans not being (idealised) agents.
Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it's predictable that every year they are alive, they will have the same desire to survive till the next year.
This human (not a completely implausible example, I hope!) has a contradiction between their long and short term preferences. So which is accurate? It seems we could resolve these preferences in favour of the short term ("live forever") or the long term ("die after a century") preferences.
Now, at this point, maybe we could appeal to meta-preferences - what would the human themselves want, if they could choose? But often these meta-preferences are un- or under-formed, and can be influenced by how the question or debate is framed.
Specifically, suppose we are scheduling this human's agenda. We have the choice of making them meet one of two philosophers (not meeting anyone is not an option). If they meet Professor R. T. Long, he will advise them to follow long term preferences. If instead, they meet Paul Kurtz, he will advise them to pay attention their short term preferences. Whichever one they meet, they will argue for a while and will then settle on the recommended preference resolution. And then they will not change that, whoever they meet subsequently.
Since we are doing the scheduling, we effectively control the human's meta-preferences on this issue. What should we do? And what principles should we use to do so?
It's clear that this can apply to AIs: if they are simultaneously aiding humans as well as learning their preferences, they will have multiple opportunities to do this sort of preference-shaping.