Holden's Objection 1: Friendliness is dangerous

PhilGoetz

13 Holden's Objection 1: Friendliness is dangerous

18th May 2012

7 min read

13

Nick_Beckstead asked me to link to posts I referred to in this comment. I should put up or shut up, so here's an attempt to give an organized overview of them.

Since I wrote these, LukeProg has begun tackling some related issues. He has accomplished the seemingly-impossible task of writing many long, substantive posts none of which I recall disagreeing with. And I have, irrationally, not read most of his posts. So he may have dealt with more of these same issues.

I think that I only raised Holden's "objection 2" in comments, which I couldn't easily dig up; and in a critique of a book chapter, which I emailed to LukeProg and did not post to LessWrong. So I'm only going to talk about "Objection 1: It seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous." I've arranged my previous posts and comments on this point into categories. (Much of what I've said on the topic has been in comments on LessWrong and Overcoming Bias, and in email lists including SL4, and isn't here.)

The concept of "human values" cannot be defined in the way that FAI presupposes

Human errors, human values: Suppose all humans shared an identical set of values, preferences, and biases. We cannot retain human values without retaining human errors, because there is no principled distinction between them.

A comment on this post: There are at least three distinct levels of human values: The values an evolutionary agent holds that maximize their reproductive fitness, the values a society holds that maximizes its fitness, and the values a rational optimizer holds who has chosen to maximize social utility. They often conflict. Which of them are the real human values?

Values vs. parameters: Eliezer has suggested using human values, but without time discounting (= changing the time-discounting parameter). CEV presupposes that we can abstract human values and apply them in a different situation that has different parameters. But the parameters are values. There is no distinction between parameters and values.

A comment on "Incremental progress and the valley": The "values" that our brains try to maximize in the short run are designed to maximize different values for our bodies in the long run. Which are human values: The motivations we feel, or the effects they have in the long term? LukeProg's post Do Humans Want Things? makes a related point.

Group selection update: The reason I harp on group selection, besides my outrage at the way it's been treated for the past 50 years, is that group selection implies that some human values evolved at the group level, not at the level of the individual. This means that increasing the rationality of individuals may enable people to act more effectively in their own interests, rather than in the group's interest, and thus diminish the degree to which humans embody human values. Identifying the values embodied in individual humans - supposing we could do so - would still not arrive at human values. Transferring human values to a post-human world, which might contain groups at many different levels of a hierarchy, would be problematic.

I wanted to write about my opinion that human values can't be divided into final values and instrumental values, the way discussion of FAI presumes they can. This is an idea that comes from mathematics, symbolic logic, and classical AI. A symbolic approach would probably make proving safety easier. But human brains don't work that way. You can and do change your values over time, because you don't really have terminal values.

Strictly speaking, it is impossible for an agent whose goals are all indexical goals describing states involving itself to have preferences about a situation in which it does not exist. Those of you who are operating under the assumption that we are maximizing a utility function with evolved terminal goals, should I think admit these terminal goals all involve either ourselves, or our genes. If they involve ourselves, then utility functions based on these goals cannot even be computed once we die. If they involve our genes, they they are goals that our bodies are pursuing, that we call errors, not goals, when we the conscious agent inside our bodies evaluate them. In either case, there is no logical reason for us to wish to maximize some utility function based on these after our own deaths. Any action I wish to take regarding the distant future necessarily presupposes that the entire SIAI approach to goals is wrong.

My view, under which it does make sense for me to say I have preferences about the distant future, is that my mind has learned "values" that are not symbols, but analog numbers distributed among neurons. As described in "Only humans can have human values", these values do not exist in a hierarchy with some at the bottom and some on the top, but in a recurrent network which does not have a top or a bottom, because the different parts of the network developed simultaneously. These values therefore can't be categorized into instrumental or terminal. They can include very abstract values that don't need to refer specifically to me, because other values elsewhere in the network do refer to me, and this will ensure that actions I finally execute incorporating those values are also influenced by my other values that do talk about me.

Even if human values existed, it would be pointless to preserve them

Only humans can have human values:

The only preferences that can be unambiguously determined are the preferences a person (mind+body) implements, which are not always the preferences expressed by their beliefs.
If you extract a set of consciously-believed propositions from an existing agent, then build a new agent to use those propositions in a different environment, with an "improved" logic, you can't claim that it has the same values, since it will behave differently.
Values exist in a network of other values. A key ethical question is to what degree values are referential (meaning they can be tested against something outside that network); or non-referential (and hence relative).
Supposing that values are referential helps only by telling you to ignore human values.
You cannot resolve the problem by combining information from different behaviors, because the needed information is missing.
Today's ethical disagreements are largely the result of attempting to extrapolate ancestral human values into a changing world.
The future will thus be ethically contentious even if we accurately characterize and agree on present human values, because these values will fail to address the new important problems.

Human values differ as much as values can differ: There are two fundamentally different categories of values:

Non-positional, mutually-satisfiable values (physical luxury, for instance)
Positional, zero-sum social values, such as wanting to be the alpha male or the homecoming queen

All mutually-satisfiable values have more in common with each other than they do with any non-mutually-satisfiable values, because mutually-satisfiable values are compatible with social harmony and non-problematic utility maximization, while non- mutually-satisfiable values require eternal conflict. If you find an alien life form from a distant galaxy with non-positional values, it would be easier to integrate those values into a human culture with only human non-positional values, than to integrate already-existing positional human values into that culture.

It appears that some humans have mainly the one type, while other humans have mainly the other type. So talking about trying to preserve human values is pointless - the values held by different humans have already passed the most-important point of divergence.

Enforcing human values would be harmful

The human problem: This argues that the qualia and values we have now are only the beginning of those that could evolve in the universe, and that ensuring that we maximize human values - or any existing value set - from now on, will stop this process in its tracks, and prevent anything better from ever evolving. This is the most-important objection of all.

Re-reading this, I see that the critical paragraph is painfully obscure, as if written by Kant; but it summarizes the argument: "Once the initial symbol set has been chosen, the semantics must be set in stone for the judging function to be "safe" for preserving value; this means that any new symbols must be defined completely in terms of already-existing symbols. Because fine-grained sensory information has been lost, new developments in consciousness might not be detectable in the symbolic representation after the abstraction process. If they are detectable via statistical correlations between existing concepts, they will be difficult to reify parsimoniously as a composite of existing symbols. Not using a theory of phenomenology means that no effort is being made to look for such new developments, making their detection and reification even more unlikely. And an evaluation based on already-developed values and qualia means that even if they could be found, new ones would not improve the score. Competition for high scores on the existing function, plus lack of selection for components orthogonal to that function, will ensure that no such new developments last."

Averaging value systems is worse than choosing one: This describes a neural-network that encodes preferences, and takes some input pattern and computes a new pattern that optimizes these preferences. Such a system is taken as analogous for a value system and an ethical system to attain those values. I then define a measure for the internal conflict produced by a set of values, and show that a system built by averaging together the parameters from many different systems will have higher internal conflict than any of the systems that were averaged together to produce it. The point is that the CEV plan of "averaging together" human values will result in a set of values that is worse (more self-contradictory) than any of the value systems it was derived from.

A point I may not have made in these posts, but made in comments, is that the majority of humans today think that women should not have full rights, homosexuals should be killed or at least severely persecuted, and nerds should be given wedgies. These are not incompletely-extrapolated values that will change with more information; they are values. Opponents of gay marriage make it clear that they do not object to gay marriage based on a long-range utilitarian calculation; they directly value not allowing gays to marry. Many human values horrify most people on this list, so they shouldn't be trying to preserve them.

Group Selection

Personal Blog

13

New Comment

Rendering 0/431 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 10:13 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

13 Holden's Objection 1: Friendliness is dangerous

by PhilGoetz

18th May 2012

7 min read

431

13

The concept of "human values" cannot be defined in the way that FAI presupposes

Even if human values existed, it would be pointless to preserve them

Only humans can have human values:

The only preferences that can be unambiguously determined are the preferences a person (mind+body) implements, which are not always the preferences expressed by their beliefs.
If you extract a set of consciously-believed propositions from an existing agent, then build a new agent to use those propositions in a different environment, with an "improved" logic, you can't claim that it has the same values, since it will behave differently.
Values exist in a network of other values. A key ethical question is to what degree values are referential (meaning they can be tested against something outside that network); or non-referential (and hence relative).
Supposing that values are referential helps only by telling you to ignore human values.
You cannot resolve the problem by combining information from different behaviors, because the needed information is missing.
Today's ethical disagreements are largely the result of attempting to extrapolate ancestral human values into a changing world.
The future will thus be ethically contentious even if we accurately characterize and agree on present human values, because these values will fail to address the new important problems.

Human values differ as much as values can differ: There are two fundamentally different categories of values:

Non-positional, mutually-satisfiable values (physical luxury, for instance)
Positional, zero-sum social values, such as wanting to be the alpha male or the homecoming queen

Enforcing human values would be harmful

Group Selection

Personal Blog

13

New Comment

Rendering 0/431 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 10:13 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from PhilGoetz

Curated and popular this week

431Comments

431

Comment Permalink

DanArmak14y00

If the majority and the minority are so fundamentally different that their killing each other is not forbidden by the universal human CEV, then no.

I don't understand what you mean by "fundamentally different". You said the AI would not do anything not backed by an all-human consensus. If a majority of humanity wishes to kill a minority, obviously there won't be a consensus to stop the killing, and AI will not interfere. I prefer to live in a universe whose living AI does interfere in such a case.

On what moral grounds would it do the prevention?

Libertarianism is one moral principle that would argue for prevention. So would most varieties of utilitarianism (ignoring utility monsters and such). Again, I would prefer living with an AI hard-coded to one of those moral ideologies (though it's not ideal) over your view of CEV.

Until everybody agree that this new AGI is not good after all. Then the original AGI will interfere and dismantle the new one (the original is still the first and the strongest).

Forever keeping this capability in reserve is most of what being a singleton means. But think of the practical implications: it has to be omnipresent, omniscient, and prevent other AIs from ever being as powerful as it is - which restricts those other AIs' abilities in many endeavors. All the while it does little good itself. So from my point of view, the main effect of successfully implementing your view of CEV may be to drastically limit the opportunities for future AIs to do good.

And yet it doesn't limit the opportunity to do evil, at least evil of the mundane death & torture kind. Unless you can explain why it would prevent even a very straightforward case like 80% of humanity voting to kill the other 20%.

But I can be sure that CEV fixes values that are based on false factual beliefs - this is a part of the definition of CEV.

But you said it would only do things that are approved by a strong human consensus. And I assure you that, to take an example, the large majority of the world's population who today believe in the supernatural will not consent to having that belief "fixed". Nor have you demonstrated that their extrapolated volition would want for them to be forcibly modified. Maybe their extrapolated volition simply doesn't value objective truth highly (because they today don't believe in the concept of objective truth, or believe that it contradicts everyday experience).

I have no idea what your FAI will do But you can be sure that it is something about which you (and everybody) would agree, either directly or if you were more intelligent and knew more.

Yes, but I don't know what I would approve of if I were "more intelligent" (a very ill defined term). And if you calculate that something, according to your definition of intelligence, and present me with the result, I might well reject that result even if I believe in your extrapolation process. I might well say: the future isn't predetermined. You can't calculate what I necessarily will become. You just extrapolated a creature I might become, which also happens to be more intelligent. But there's nothing in my moral system that says I should adopt the values of someone else because they are more intelligent. If I don't like the values I might say, thank-you for warning me, now I shall be doubly careful not to evolve into that kind of creature! I might even choose to forego the kind of increased intelligence that causes such an undsired change in my values.

Short version: "what I would want if I were more intelligent (according to some definition)" isn't the same as "what I will likely want in the future", because there's no reason for me to grow in intelligence (by that definition) if I suspect it would twist my values. So you can't apply the heuristic of "if I know what I'm going to think tomorrow, I might as well think it today".

~Endorses(A1, CEV) ~Endorses(A2, CEV) Endorses(A1, CEV) Endorses(A2, CEV)

I think you may be missing a symbol there? If not, I can't parse it... Can you spell out for me what it means to just write the last three Endorses(...) clauses one after the other?

Does it look like the most rational behavior?

It may be quite rational for everyone individually, depending on projected payoffs. Unlike a PD, starting positions aren't symmetrical and players' progress/payoffs are not visible to other players. So saying "just cooperate" doesn't immediately apply.

Wouldn't it be better for everybody to cooperate in this Prisoner's Dilemma, and do it with a creditable precommitment?

How can a state or military precommit to not having a supersecret project to develop a private AGI?

And while it's beneficial for some players to join in a cooperative effort, it may well be that a situation of several competing leagues (or really big players working alone) develops and is also stable. It's all laid over the background of existing political, religious and personal enmities and rivalries - even before we come to actual disagreements over what the AI should value.

wedrifid14y00

If a majority of humanity wishes to kill a minority, obviously there won't be a consensus to stop the killing, and AI will not interfere.

This assumes that CEV uses something along the lines of a simulated vote as an aggregation mechanism. Currently the method of aggregation is undefined so we can't say this with confidence - certainly not as something obvious.

0gRR14y

The majority may wish to kill the minority for wrong reasons - based on false beliefs or insufficient intelligence. In which case their CEV-s won't endorse it, and the FAI will interfere. "Fundamentally different" means their killing each other is endorsed by someone's CEV, not just by themselves. [...] Strong consensus of their CEV-s. [...] Extrapolated volition is based on objective truth, by definition. [...] The process of extrapolation takes this into account. [...] Sorry, bad formatting. I meant four independent clauses: each of the agents does not endorse CEV, but endorses CEV. [...] That's a separate problem. I think it is easier to solve than extrapolating volition or building AI.

See in context