Holden's Objection 1: Friendliness is dangerous

PhilGoetz

13 Holden's Objection 1: Friendliness is dangerous

18th May 2012

7 min read

13

Nick_Beckstead asked me to link to posts I referred to in this comment. I should put up or shut up, so here's an attempt to give an organized overview of them.

Since I wrote these, LukeProg has begun tackling some related issues. He has accomplished the seemingly-impossible task of writing many long, substantive posts none of which I recall disagreeing with. And I have, irrationally, not read most of his posts. So he may have dealt with more of these same issues.

I think that I only raised Holden's "objection 2" in comments, which I couldn't easily dig up; and in a critique of a book chapter, which I emailed to LukeProg and did not post to LessWrong. So I'm only going to talk about "Objection 1: It seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous." I've arranged my previous posts and comments on this point into categories. (Much of what I've said on the topic has been in comments on LessWrong and Overcoming Bias, and in email lists including SL4, and isn't here.)

The concept of "human values" cannot be defined in the way that FAI presupposes

Human errors, human values: Suppose all humans shared an identical set of values, preferences, and biases. We cannot retain human values without retaining human errors, because there is no principled distinction between them.

A comment on this post: There are at least three distinct levels of human values: The values an evolutionary agent holds that maximize their reproductive fitness, the values a society holds that maximizes its fitness, and the values a rational optimizer holds who has chosen to maximize social utility. They often conflict. Which of them are the real human values?

Values vs. parameters: Eliezer has suggested using human values, but without time discounting (= changing the time-discounting parameter). CEV presupposes that we can abstract human values and apply them in a different situation that has different parameters. But the parameters are values. There is no distinction between parameters and values.

A comment on "Incremental progress and the valley": The "values" that our brains try to maximize in the short run are designed to maximize different values for our bodies in the long run. Which are human values: The motivations we feel, or the effects they have in the long term? LukeProg's post Do Humans Want Things? makes a related point.

Group selection update: The reason I harp on group selection, besides my outrage at the way it's been treated for the past 50 years, is that group selection implies that some human values evolved at the group level, not at the level of the individual. This means that increasing the rationality of individuals may enable people to act more effectively in their own interests, rather than in the group's interest, and thus diminish the degree to which humans embody human values. Identifying the values embodied in individual humans - supposing we could do so - would still not arrive at human values. Transferring human values to a post-human world, which might contain groups at many different levels of a hierarchy, would be problematic.

I wanted to write about my opinion that human values can't be divided into final values and instrumental values, the way discussion of FAI presumes they can. This is an idea that comes from mathematics, symbolic logic, and classical AI. A symbolic approach would probably make proving safety easier. But human brains don't work that way. You can and do change your values over time, because you don't really have terminal values.

Strictly speaking, it is impossible for an agent whose goals are all indexical goals describing states involving itself to have preferences about a situation in which it does not exist. Those of you who are operating under the assumption that we are maximizing a utility function with evolved terminal goals, should I think admit these terminal goals all involve either ourselves, or our genes. If they involve ourselves, then utility functions based on these goals cannot even be computed once we die. If they involve our genes, they they are goals that our bodies are pursuing, that we call errors, not goals, when we the conscious agent inside our bodies evaluate them. In either case, there is no logical reason for us to wish to maximize some utility function based on these after our own deaths. Any action I wish to take regarding the distant future necessarily presupposes that the entire SIAI approach to goals is wrong.

My view, under which it does make sense for me to say I have preferences about the distant future, is that my mind has learned "values" that are not symbols, but analog numbers distributed among neurons. As described in "Only humans can have human values", these values do not exist in a hierarchy with some at the bottom and some on the top, but in a recurrent network which does not have a top or a bottom, because the different parts of the network developed simultaneously. These values therefore can't be categorized into instrumental or terminal. They can include very abstract values that don't need to refer specifically to me, because other values elsewhere in the network do refer to me, and this will ensure that actions I finally execute incorporating those values are also influenced by my other values that do talk about me.

Even if human values existed, it would be pointless to preserve them

Only humans can have human values:

The only preferences that can be unambiguously determined are the preferences a person (mind+body) implements, which are not always the preferences expressed by their beliefs.
If you extract a set of consciously-believed propositions from an existing agent, then build a new agent to use those propositions in a different environment, with an "improved" logic, you can't claim that it has the same values, since it will behave differently.
Values exist in a network of other values. A key ethical question is to what degree values are referential (meaning they can be tested against something outside that network); or non-referential (and hence relative).
Supposing that values are referential helps only by telling you to ignore human values.
You cannot resolve the problem by combining information from different behaviors, because the needed information is missing.
Today's ethical disagreements are largely the result of attempting to extrapolate ancestral human values into a changing world.
The future will thus be ethically contentious even if we accurately characterize and agree on present human values, because these values will fail to address the new important problems.

Human values differ as much as values can differ: There are two fundamentally different categories of values:

Non-positional, mutually-satisfiable values (physical luxury, for instance)
Positional, zero-sum social values, such as wanting to be the alpha male or the homecoming queen

All mutually-satisfiable values have more in common with each other than they do with any non-mutually-satisfiable values, because mutually-satisfiable values are compatible with social harmony and non-problematic utility maximization, while non- mutually-satisfiable values require eternal conflict. If you find an alien life form from a distant galaxy with non-positional values, it would be easier to integrate those values into a human culture with only human non-positional values, than to integrate already-existing positional human values into that culture.

It appears that some humans have mainly the one type, while other humans have mainly the other type. So talking about trying to preserve human values is pointless - the values held by different humans have already passed the most-important point of divergence.

Enforcing human values would be harmful

The human problem: This argues that the qualia and values we have now are only the beginning of those that could evolve in the universe, and that ensuring that we maximize human values - or any existing value set - from now on, will stop this process in its tracks, and prevent anything better from ever evolving. This is the most-important objection of all.

Re-reading this, I see that the critical paragraph is painfully obscure, as if written by Kant; but it summarizes the argument: "Once the initial symbol set has been chosen, the semantics must be set in stone for the judging function to be "safe" for preserving value; this means that any new symbols must be defined completely in terms of already-existing symbols. Because fine-grained sensory information has been lost, new developments in consciousness might not be detectable in the symbolic representation after the abstraction process. If they are detectable via statistical correlations between existing concepts, they will be difficult to reify parsimoniously as a composite of existing symbols. Not using a theory of phenomenology means that no effort is being made to look for such new developments, making their detection and reification even more unlikely. And an evaluation based on already-developed values and qualia means that even if they could be found, new ones would not improve the score. Competition for high scores on the existing function, plus lack of selection for components orthogonal to that function, will ensure that no such new developments last."

Averaging value systems is worse than choosing one: This describes a neural-network that encodes preferences, and takes some input pattern and computes a new pattern that optimizes these preferences. Such a system is taken as analogous for a value system and an ethical system to attain those values. I then define a measure for the internal conflict produced by a set of values, and show that a system built by averaging together the parameters from many different systems will have higher internal conflict than any of the systems that were averaged together to produce it. The point is that the CEV plan of "averaging together" human values will result in a set of values that is worse (more self-contradictory) than any of the value systems it was derived from.

A point I may not have made in these posts, but made in comments, is that the majority of humans today think that women should not have full rights, homosexuals should be killed or at least severely persecuted, and nerds should be given wedgies. These are not incompletely-extrapolated values that will change with more information; they are values. Opponents of gay marriage make it clear that they do not object to gay marriage based on a long-range utilitarian calculation; they directly value not allowing gays to marry. Many human values horrify most people on this list, so they shouldn't be trying to preserve them.

Group Selection

Personal Blog

13

New Comment

Rendering 0/431 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 2:05 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

13 Holden's Objection 1: Friendliness is dangerous

by PhilGoetz

18th May 2012

7 min read

431

13

The concept of "human values" cannot be defined in the way that FAI presupposes

Even if human values existed, it would be pointless to preserve them

Only humans can have human values:

The only preferences that can be unambiguously determined are the preferences a person (mind+body) implements, which are not always the preferences expressed by their beliefs.
If you extract a set of consciously-believed propositions from an existing agent, then build a new agent to use those propositions in a different environment, with an "improved" logic, you can't claim that it has the same values, since it will behave differently.
Values exist in a network of other values. A key ethical question is to what degree values are referential (meaning they can be tested against something outside that network); or non-referential (and hence relative).
Supposing that values are referential helps only by telling you to ignore human values.
You cannot resolve the problem by combining information from different behaviors, because the needed information is missing.
Today's ethical disagreements are largely the result of attempting to extrapolate ancestral human values into a changing world.
The future will thus be ethically contentious even if we accurately characterize and agree on present human values, because these values will fail to address the new important problems.

Human values differ as much as values can differ: There are two fundamentally different categories of values:

Non-positional, mutually-satisfiable values (physical luxury, for instance)
Positional, zero-sum social values, such as wanting to be the alpha male or the homecoming queen

Enforcing human values would be harmful

Group Selection

Personal Blog

13

New Comment

Rendering 0/431 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 2:05 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from PhilGoetz

Curated and popular this week

431Comments

431

Comment Permalink

DanArmak14y00

These risks exist. However, I think it is very likely in our case that there will be strong consensus for values that reduce the problem a bit. Non-interference, for one, is much less controversial than transhumanism, but would allow transhumanism for those who accept it.

Why do you think this is "very likely"?

Today there are many people in the world (gross estimate: tens of percents of world population) who don't believe in noninterference. True believers of several major faiths (most Christian sects, mainstream Islam) desire enforced religious conversion of others, either as a commandment of their faith (for its own sake) or for the metaphysical benefit of those others (to save them from hell). Many people "believe" (if that is the right word) in the subjugation of certain minorities, or of women, children, etc. which involves interference of various kinds. Many people experience future shock which prompts them to want laws that would stop others from self-modifying in certain ways (some including transhumanism).

Why do you think it very likely these people's CEV will contradict their current values and beliefs? Please consider that:

We emphatically don't know the outcome of CEV. If we were sure that it would have any property X, we could hardcode X into the algorithm and make the CEV's task that much easier. Anything you think is very likely for CEV to decide, you should be proportionally willing for me to hardcode into my algorithm, constraining the possible results of CEV.
In these examples, you expect other people's extrapolated values to come to match your actual values. This seems on the outside view like a human bias. Do you expect an equal amount of your important, present-day values to be contradicted and disallowed by humanity's CEV? Can you think of probable examples?

This is one reason why I think doing the CEV of just the AI team or whoever is the best approach. We have strong reason to suspect that the eventual result will respect everyone, and bootstrapping from a small group (or even just one person) seems much more reliable and safer.

I agree completely - doing the CEV of a small trusted team, who moreover are likely to hold non-extrapolated views similar to ours (e.g. they won't be radical Islamists), would be much better than CEV; much more reliable and safe.

But you contradict yourself a little. If you really believed CEV looked a lot like CEV, you would have no reason to consider it safer. If you (correctly) think it's safer, that must be because you fear CEV will contain some pretty repugnant conclusions that CEV won't.

From this I understand that while you think CEV would have a term for "respecting" the rest of humanity, that respect would be a lot weaker than the equal (and possibly majority-voting-based) rights granted them by CEV.

I think that statement is too strong. Keep in mind that it's extrapolated volition. I doubt the islamists' values are reflectively consistent. Weaken it to the possibility of there being multiple attractors in EV-space, some of which are bad, and I agree. Infectious memeplexes that can survive CEV scare the crap out of me.

I doubt any one human's values are reflectively consistent. At the very least, every human's values contradict one another in the sense that they compete among themselves for the human's resources, and the human in different moods and at different points in time prefers to spend on different values.

Because infectious memeplexes scare me too, I don't want anyone to build CEV (or rather, to run a singleton AI that would implement it) - I would much prefer CEV or better CEV or better yet, a non-CEV process whcih more directly relies on my and other people's non-extrapolated preferences.

[anonymous]14y10

Why do you think this is "very likely"?

It just seems likely, based on my understanding of what people like and approve of.

Scrict non-interference is unlikely to end up in CEV, because there are many cases where interventions are the right thing to do. I just meant it as a proof that there are less controversial principles that will block a lot of bullshit. Not as a speculation of something that will actually end up in CEV.

religion, bigots, conservatism.
Why do you think it very likely these people's CEV will contradict their current values an

... (read more)

4TheOtherDave14y

A possibly related question: suppose you were about to go off on an expedition in a spaceship that would take you away from Earth for thirty years, and the ship is being stocked with food. Suppose further that, because of an insane bureaucratic process, you have only two choices: either (a) you get to choose what food to stock right now, with no time for nutritional research, or (b) food is stocked according to an expert analysis of your body's nutritional needs, with no input from you. What outcome would you anticipate from each of those choices? Suppose a hundred arbitrarily selected people were also being sent on similar missions on similar spaceships, and your decision of A or B applied to them as well (either they get to choose their food, or an expert chooses food for them). What outcome would you anticipate from each choice?

See in context