An overall schema for the friendly AI problems: self-referential convergence criteria

17 Stuart_Armstrong 13 July 2015 03:34PM

A putative new idea for AI control; index here.

After working for some time on the Friendly AI problem, it's occurred to me that a lot of the issues seem related. Specifically, all the following seem to have commonalities:

Speaking very broadly, there are two features all them share:

  1. The convergence criteria are self-referential.
  2. Errors in the setup are likely to cause false convergence.

What do I mean by that? Well, imagine you're trying to reach reflective equilibrium in your morality. You do this by using good meta-ethical rules, zooming up and down at various moral levels, making decisions on how to resolve inconsistencies, etc... But how do you know when to stop? Well, you stop when your morality is perfectly self-consistent, when you no longer have any urge to change your moral or meta-moral setup. In other words, the stopping point (and the the convergence to the stopping point) is entirely self-referentially defined: the morality judges itself. It does not include any other moral considerations. You input your initial moral intuitions and values, and you hope this will cause the end result to be "nice", but the definition of the end result does not include your initial moral intuitions (note that some moral realists could see this process dependence as a positive - except for the fact that these processes have many convergent states, not just one or a small grouping).

So when the process goes nasty, you're pretty sure to have achieved something self-referentially stable, but not nice. Similarly, a nasty CEV will be coherent and have no desire to further extrapolate... but that's all we know about it.

The second feature is that any process has errors - computing errors, conceptual errors, errors due to the weakness of human brains, etc... If you visualise this as noise, you can see that noise in a convergent process is more likely to cause premature convergence, because if the process ever reaches a stable self-referential state, it will stay there (and if the process is a long one, then early noise will cause great divergence at the end). For instance, imagine you have to reconcile your belief in preserving human cultures with your beliefs in human individual freedom. A complex balancing act. But if, at any point along the way, you simply jettison one of the two values completely, things become much easier - and once jettisoned, the missing value is unlikely to ever come back.

Or, more simply, the system could get hacked. When exploring a potential future world, you could become so enamoured of it, that you overwrite any objections you had. It seems very easy for humans to fall into these traps - and again, once you lose something of value in your system, you don't tend to get if back.

 

Solutions

And again, very broadly speaking, there are several classes of solutions to deal with these problems:

  1. Reduce or prevent errors in the extrapolation (eg solving the agent tiling problem).
  2. Solve all or most of the problem ahead of time (eg traditional FAI approach by specifying the correct values).
  3. Make sure you don't get too far from the starting point (eg reduced impact AI, tool AI, models as definitions).
  4. Figure out the properties of a nasty convergence, and try to avoid them (eg some of the ideas I mentioned in "crude measures", general precautions that are done when defining the convergence process).

 

Holden's Objection 1: Friendliness is dangerous

11 PhilGoetz 18 May 2012 12:48AM

Nick_Beckstead asked me to link to posts I referred to in this comment.  I should put up or shut up, so here's an attempt to give an organized overview of them.

Since I wrote these, LukeProg has begun tackling some related issues.  He has accomplished the seemingly-impossible task of writing many long, substantive posts none of which I recall disagreeing with.  And I have, irrationally, not read most of his posts.  So he may have dealt with more of these same issues.

I think that I only raised Holden's "objection 2" in comments, which I couldn't easily dig up; and in a critique of a book chapter, which I emailed to LukeProg and did not post to LessWrong.  So I'm only going to talk about "Objection 1:  It seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous."  I've arranged my previous posts and comments on this point into categories.  (Much of what I've said on the topic has been in comments on LessWrong and Overcoming Bias, and in email lists including SL4, and isn't here.)

 

The concept of "human values" cannot be defined in the way that FAI presupposes

Human errors, human values:  Suppose all humans shared an identical set of values, preferences, and biases.  We cannot retain human values without retaining human errors, because there is no principled distinction between them.

A comment on this post:  There are at least three distinct levels of human values:  The values an evolutionary agent holds that maximize their reproductive fitness, the values a society holds that maximizes its fitness, and the values a rational optimizer holds who has chosen to maximize social utility.  They often conflict.  Which of them are the real human values?

Values vs. parameters:  Eliezer has suggested using human values, but without time discounting (= changing the time-discounting parameter).  CEV presupposes that we can abstract human values and apply them in a different situation that has different parameters.  But the parameters are values.  There is no distinction between parameters and values.

A comment on "Incremental progress and the valley":  The "values" that our brains try to maximize in the short run are designed to maximize different values for our bodies in the long run.  Which are human values:  The motivations we feel, or the effects they have in the long term?  LukeProg's post Do Humans Want Things? makes a related point.

Group selection update:  The reason I harp on group selection, besides my outrage at the way it's been treated for the past 50 years, is that group selection implies that some human values evolved at the group level, not at the level of the individual.  This means that increasing the rationality of individuals may enable people to act more effectively in their own interests, rather than in the group's interest, and thus diminish the degree to which humans embody human values.  Identifying the values embodied in individual humans - supposing we could do so - would still not arrive at human values.  Transferring human values to a post-human world, which might contain groups at many different levels of a hierarchy, would be problematic.

I wanted to write about my opinion that human values can't be divided into final values and instrumental values, the way discussion of FAI presumes they can.  This is an idea that comes from mathematics, symbolic logic, and classical AI.  A symbolic approach would probably make proving safety easier.  But human brains don't work that way.  You can and do change your values over time, because you don't really have terminal values.

Strictly speaking, it is impossible for an agent whose goals are all indexical goals describing states involving itself to have preferences about a situation in which it does not exist.  Those of you who are operating under the assumption that we are maximizing a utility function with evolved terminal goals, should I think admit these terminal goals all involve either ourselves, or our genes.  If they involve ourselves, then utility functions based on these goals cannot even be computed once we die.  If they involve our genes, they they are goals that our bodies are pursuing, that we call errors, not goals, when we the conscious agent inside our bodies evaluate them.  In either case, there is no logical reason for us to wish to maximize some utility function based on these after our own deaths.  Any action I wish to take regarding the distant future necessarily presupposes that the entire SIAI approach to goals is wrong.

My view, under which it does make sense for me to say I have preferences about the distant future, is that my mind has learned "values" that are not symbols, but analog numbers distributed among neurons.  As described in "Only humans can have human values", these values do not exist in a hierarchy with some at the bottom and some on the top, but in a recurrent network which does not have a top or a bottom, because the different parts of the network developed simultaneously.  These values therefore can't be categorized into instrumental or terminal.  They can include very abstract values that don't need to refer specifically to me, because other values elsewhere in the network do refer to me, and this will ensure that actions I finally execute incorporating those values are also influenced by my other values that do talk about me.

Even if human values existed, it would be pointless to preserve them

Only humans can have human values:

  • The only preferences that can be unambiguously determined are the preferences a person (mind+body) implements, which are not always the preferences expressed by their beliefs.
  • If you extract a set of consciously-believed propositions from an existing agent, then build a new agent to use those propositions in a different environment, with an "improved" logic, you can't claim that it has the same values, since it will behave differently.
  • Values exist in a network of other values.  A key ethical question is to what degree values are referential (meaning they can be tested against something outside that network); or non-referential (and hence relative).
  • Supposing that values are referential helps only by telling you to ignore human values.
  • You cannot resolve the problem by combining information from different behaviors, because the needed information is missing.
  • Today's ethical disagreements are largely the result of attempting to extrapolate ancestral human values into a changing world.
  • The future will thus be ethically contentious even if we accurately characterize and agree on present human values, because these values will fail to address the new important problems.


Human values differ as much as values can differ:  There are two fundamentally different categories of values:

  • Non-positional, mutually-satisfiable values (physical luxury, for instance)
  • Positional, zero-sum social values, such as wanting to be the alpha male or the homecoming queen

All mutually-satisfiable values have more in common with each other than they do with any non-mutually-satisfiable values, because mutually-satisfiable values are compatible with social harmony and non-problematic utility maximization, while non- mutually-satisfiable values require eternal conflict.  If you find an alien life form from a distant galaxy with non-positional values, it would be easier to integrate those values into a human culture with only human non-positional values, than to integrate already-existing positional human values into that culture.

It appears that some humans have mainly the one type, while other humans have mainly the other type.  So talking about trying to preserve human values is pointless - the values held by different humans have already passed the most-important point of divergence.

 

Enforcing human values would be harmful

The human problem:  This argues that the qualia and values we have now are only the beginning of those that could evolve in the universe, and that ensuring that we maximize human values - or any existing value set - from now on, will stop this process in its tracks, and prevent anything better from ever evolving.  This is the most-important objection of all.

Re-reading this, I see that the critical paragraph is painfully obscure, as if written by Kant; but it summarizes the argument: "Once the initial symbol set has been chosen, the semantics must be set in stone for the judging function to be "safe" for preserving value; this means that any new symbols must be defined completely in terms of already-existing symbols.  Because fine-grained sensory information has been lost, new developments in consciousness might not be detectable in the symbolic representation after the abstraction process.  If they are detectable via statistical correlations between existing concepts, they will be difficult to reify parsimoniously as a composite of existing symbols.  Not using a theory of phenomenology means that no effort is being made to look for such new developments, making their detection and reification even more unlikely.  And an evaluation based on already-developed values and qualia means that even if they could be found, new ones would not improve the score.  Competition for high scores on the existing function, plus lack of selection for components orthogonal to that function, will ensure that no such new developments last."

Averaging value systems is worse than choosing one:  This describes a neural-network that encodes preferences, and takes some input pattern and computes a new pattern that optimizes these preferences.  Such a system is taken as analogous for a value system and an ethical system to attain those values.  I then define a measure for the internal conflict produced by a set of values, and show that a system built by averaging together the parameters from many different systems will have higher internal conflict than any of the systems that were averaged together to produce it.  The point is that the CEV plan of "averaging together" human values will result in a set of values that is worse (more self-contradictory) than any of the value systems it was derived from.


A point I may not have made in these posts, but made in comments, is that the majority of humans today think that women should not have full rights, homosexuals should be killed or at least severely persecuted, and nerds should be given wedgies.  These are not incompletely-extrapolated values that will change with more information; they are values.  Opponents of gay marriage make it clear that they do not object to gay marriage based on a long-range utilitarian calculation; they directly value not allowing gays to marry.  Many human values horrify most people on this list, so they shouldn't be trying to preserve them.

Objections to Coherent Extrapolated Volition

11 XiXiDu 22 November 2011 10:32AM

In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.

— Eliezer Yudkowsky, May 2004, Coherent Extrapolated Volition

Foragers versus industry era folks

Consider the difference between a hunter-gatherer, who cares about his hunting success and to become the new tribal chief, and a modern computer scientist who wants to determine if a “sufficiently large randomized Conway board could turn out to converge to a barren ‘all off’ state.”

The utility of the success in hunting down animals and proving abstract conjectures about cellular automata is largely determined by factors such as your education, culture and environmental circumstances. The same forager who cared to kill a lot of animals, to get the best ladies in its clan, might have under different circumstances turned out to be a vegetarian mathematician solely caring about his understanding of the nature of reality. Both sets of values are to some extent mutually exclusive or at least disjoint. Yet both sets of values are what the person wants, given the circumstances. Change the circumstances dramatically and you change the persons values.

What do you really want?

You might conclude that what the hunter-gatherer really wants is to solve abstract mathematical problems, he just doesn’t know it. But there is no set of values that a person “really” wants. Humans are largely defined by the circumstances they reside in.

  • If you already knew a movie, you wouldn’t watch it.
  • To be able to get your meat from the supermarket changes the value of hunting.

If “we knew more, thought faster, were more the people we wished we were, and had grown up closer together” then we would stop to desire what we learnt, wish to think even faster, become even different people and get bored of and rise up from the people similar to us.

A singleton is an attractor

A singleton will inevitably change everything by causing a feedback loop between itself as an attractor and humans and their values.

Much of our values and goals, what we want, are culturally induced or the result of our ignorance. Reduce our ignorance and you change our values. One trivial example is our intellectual curiosity. If we don’t need to figure out what we want on our own, our curiosity is impaired.

A singleton won’t extrapolate human volition but implement an artificial set values as a result of abstract high-order contemplations about rational conduct.

With knowledge comes responsibility, with wisdom comes sorrow

Knowledge changes and introduces terminal goals. The toolkit that is called ‘rationality’, the rules and heuristics developed to help us to achieve our terminal goals are also altering and deleting them. A stone age hunter-gatherer seems to possess very different values than we do. Learning about rationality and various ethical theories such as Utilitarianism would alter those values considerably.

Rationality was meant to help us achieve our goals, e.g. become a better hunter. Rationality was designed to tell us what we ought to do (instrumental goals) to achieve what we want to do (terminal goals). Yet what actually happens is that we are told, that we will learn, what we ought to want.

If an agent becomes more knowledgeable and smarter then this does not leave its goal-reward-system intact if it is not especially designed to be stable. An agent who originally wanted to become a better hunter and feed his tribe would end up wanting to eliminate poverty in Obscureistan. The question is, how much of this new “wanting” is the result of using rationality to achieve terminal goals and how much is a side-effect of using rationality, how much is left of the original values versus the values induced by a feedback loop between the toolkit and its user?

Take for example an agent that is facing the Prisoner’s dilemma. Such an agent might originally tend to cooperate and only after learning about game theory decide to defect and gain a greater payoff. Was it rational for the agent to learn about game theory, in the sense that it helped the agent to achieve its goal or in the sense that it deleted one of its goals in exchange for a allegedly more “valuable” goal?

Beware rationality as a purpose in and of itself

It seems to me that becoming more knowledgeable and smarter is gradually altering our utility functions. But what is it that we are approaching if the extrapolation of our volition becomes a purpose in and of itself? Extrapolating our coherent volition will distort or alter what we really value by installing a new cognitive toolkit designed to achieve an equilibrium between us and other agents with the same toolkit.

Would a singleton be a tool that we can use to get what we want or would the tool use us to do what it does, would we be modeled or would it create models, would we be extrapolating our volition or rather follow our extrapolations?

(This post is a write-up of a previous comment designated to receive feedback from a larger audience.)

Friendlier AI through politics

1 Jonathan_Graehl 16 August 2009 09:29PM

David Brin suggests that some kind of political system populated with humans and diverse but imperfectly rational and friendly AIs would evolve in a satisfactory direction for humans.

I don't know whether creating an imperfectly rational general AI is any easier, except that limited perceptual and computational resources obviously imply less than optimal outcomes; still, why shouldn't we hope for optimal given those constraints?  I imagine the question will become more settled before anyone nears unleashing a self-improving superhuman AI.

An imperfectly friendly AI, perfectly rational or not, is a very likely scenario.  Is it sufficient to create diverse singleton value-systems (demographically representative of humans' values) rather than a consensus (over all humans' values) monolithic Friendly?  

What kind of competitive or political system would make fragmented squabbling AIs safer than an attempt to get the monolithic approach right?  Brin seems to have some hope of improving politics regardless of AI participation, but I'm not sure exactly what his dream is or how to get there - perhaps his "disputation arenas" would work if the participants were rational and altruistically honest).