The lotus-eaters are examples of humans who have followed hedonism all the way through to its logical conclusion. In contrast, the "mindless outsourcers" are a possible consequence of the urge to efficiency: competitive pressures making uploads choose to destroy their own identity.

In my "Mahatma Armstrong" version of Eliezer's CEV, a somewhat altruistic entity ends up destroying all life, after a series of perfectly rational self-improvements. And in many examples where AIs are supposed to serve human preferences, these preferences are defined by a procedure (say, a question-answer process) that the AI can easily manipulate.

Stability and stopping properties

Almost everyone agrees that human values are under-determined (we haven't thought deeply and rigorously about every situation) and changeable by life experience. Therefore, it makes no sense to use "current human values" as a goal; this concept doesn't even exist in any rigorous sense.

So we need some way of extrapolating true human values. All the previous examples could be considered examples of extrapolation, and they all share the same problem: they are defined by their "stopping criteria" more than by their initial conditions.

For example, the lotus eaters have reached a soporific hedonism they don't want to wake out of. There no longer is "anyone there" to change anything in the mindless outsourcers. CEV is explicitly assumed to be convergent: convergent to a point where the idealised entity no longer sees any need to change. The AI example is a bit different in flavour, but the "stopping criteria" are whatever the human /chooses/is tricked into/is forced into/ saying. This means that the AI could be an optimisation process pushing the human to say whatever it wants us to.

Importantly, all these stopping criteria are local: they explicitly care only about the situation when the stopping criteria is reached, not about the journey there, nor the initial conditions.

Processes can drift very far from their starting point, if they have local stopping criteria, even under very mild selection pressure. Consider the following game: each of two players is to name an number, and , between and . The player with the highest number gets that much in euros, and the one with the strictly lowest one gets that much plus two in euros. Each player starts at , and each in turn is allowed to adjust their numbers until they don't want to any more.

Then if both players are greedy and myopic, one player will start by dropping to , followed by the next player dropping theirs to , and so on, going back and forth between the players until one stands at and the other at . Obviously if the could be chosen from a larger range, there's no limit to the amount of loss that such a process could generate.

Similarly, if our process of extrapolating human values have local stopping criteria, there's no limit to how bad they could end up being, or how "far away" in the space of values they could go.

This, by the way, explains my intuitive dislike for some types of moral realism. If there are true objective moral facts that humans can access, then whatever process counts as "accessing them" becomes... a local stopping condition for defining value. So I don't tend to focus on arguments about how correct or intuitive that process is; instead, I want to know where it ends up.

Barriers to total drift

So, how can we prevent local stopping conditions from shooting far across the landscape of possible values? This is actually a problem I've been working on for a long time; you can see this in my old paper "Chaining God: A qualitative approach to AI, trust and moral systems". I would not recommend reading that paper - it's hopelessly amateurish, anthropomorphisising, and confused - but it shows one of the obvious solution: tie values to their point of origin.

There seem roughly three interventions that can be done to overcome the problem of local stopping criteria.

  • I. The first is to tie the process to the starting point, as above. Now, initial human values are not properly defined; nevertheless it seems possible to state that some values are further away from this undefined starting point than others (paperclipers are very far, money-maximiser quite far, situations where recognisably human beings do recognisably human stuff are much closer). Then the extrapolation process gets a penalty for wandering too far afield. The stopping conditions are no longer purely local.
  • II. If there is an agent-like piece in the extrapolation process, we can remove rigging (previously called bias) or influence, so that the agent can't manipulate the extrapolation process. This is a partial measure: it replaces a targeted extrapolation process with a random walk, which removes one major issue but doesn't solve the whole problem.
  • III. Finally, it is often suggested that constraints be added to the extrapolation process. For example, if the human values are determined by human feedback, then we can forbid the AI from coercing the human in any way, or restrict it to only using some methods (such as relaxed conversation). I am dubious about this kind of approach. It firstly assumes that concepts like "coercion" and "relaxed conversation" can be defined - but if that were the case, we'd be closer to solving the issue directly. And secondly, it assumes that restrictions that apply to humans also apply to AIs: we can't easily change the core values of fellow humans with conversation, but super-powered AIs may be able to do so.

In my methods, I'll normally be using mostly interventions of type I and II.

New Comment
16 comments, sorted by Click to highlight new comments since:
Similarly, if our process of extrapolating human values have local stopping criteria, there's no limit to how bad they could end up being, or how "far away" in the space of values they could go.

I feel like there's a distinct difference between "human values could end up arbitrarily distant from current ones" and "human values could end up arbitrarily bad".

In that, I feel that I have certain values that I care a lot about (for example, wanting there to be less intense suffering in the world) and which I wouldn't want to change; and also other values for which I don't care about how much they'd happen to drift.

If you think of this as my important values being points on dimension x1-x10, and my non-important values as being points on dimensions x11-x100, then assuming that the value space is infinite, then my values on dimensions x11-x100 could drift arbitrarily far away from their current positions. So the distance between my future and current values could end up being arbitrarily large, but if my important values remained in their current positions, then this would still not be arbitrarily bad, since the values that I actually care about have not drifted.

Obviously this toy model is flawed, since I don't think it actually makes sense to model values as being totally independent of each other, but maybe you get the intuition that I'm trying to point at anyway.

This would suggest that the problem is not that "values can get arbitrarily distant" but rather something like "the meta-values that make us care about some of our values having specific values, may get violated". (Of course, "values can get arbitrarily distant" can still be the problem if you have a meta-value that says that they shouldn't do that.)

>and also other values for which I don't care about how much they'd happen to drift.

Hum. In what way can you be said to have these values then? Maybe these are un-endorsed preferences? Do you have a specific example?

Off the top of my head, right now I value things such as nature, literature, sex, democracy, the rule of law, the human species and so on, but if my descendants had none of those things and had replaced them with something totally different and utterly incomprehensible, that'd be fine with me as long as they were happy and didn't suffer much.

If I said that some of these were instrumental preferences, and some of these were weak preferences, would that cover it all?

Some are instrumental yes, though I guess that for "weak preferences", it would be more accurate to say that I value some things for my own sake rather than for their sake. That is, I want to be able to experience them myself, but if others find them uninteresting and they vanish entirely after I'm gone, that's cool.

(There has to be some existing standard term for this.)

That doesn't sound complicated or mysterious at all - you value these for yourself, but not necessarily for everyone. So if other people lack these values, then that's not far from your initial values, but if you lack them, then it is far.

This seems to remove the point of your initial answer?

So if other people lack these values, then that's not far from your initial values, but if you lack them, then it is far.

Well, that depends on how you choose the similarity metric. Like, if you code "the distance between Kaj's values and Stuart's values" as the Jaccard distance between them, then you could make the distance between our values arbitrarily large by just adding values I have but you don't, or vice versa. So if you happened to lack a lot of my values, then our values would be far.

Jaccard distance probably isn't a great choice of metric for this purpose, but I don't know what a good one would be.

If we make the (false) assumption that we both have utility/reward functions, and E_U(V) is the expected utility of utility V if we assume a U maximiser is maximising it, then we can measure the distance between utility U and V as d(U,V)=E_U(U)-E_V(U).

This is non-symmetric and doesn't obey the triangle inequality, but it is a very natural measure - it represents the cost to U to replace a U-maximiser with a V-maximiser.

Equivalently, we can say that we don't know how we should define the dimensions of the human values or the distance measure from current human values, and if we pick these definitions arbitrarily, we will end up with arbitrary results.

This, by the way, explains my intuitive dislike for some types of moral realism. If there are true objective moral facts that humans can access, then whatever process counts as "accessing them" becomes... a local stopping condition for defining value.

I'm not sure I understand what you're getting at here. Yes, they are both local stopping conditions, but there seems to be a clear dis-analogy. The other local stopping conditions seem to be bad not because they are stopping conditions, but because most contemporary people don't want to end up as Lotus-Eaters, or as mindless outsourcers. We would oppose such a development even if it wasn't stable! For example, a future where we oscillate between lotus-eaters and mindless outsourcers seems about as bad as either individual scenario. So it's not really the stability we object to.

But in that case, it's not clear why we should be opposed to the moral realism. After all, many people would like to go there, even if we are presently a long way away.

Answer 2:

My general position: if the destination is good, we should be able to tell, now, that it is at least an acceptable destination, maybe with some oddities. And by saying that, I'm saying that the destination needs to be tied to our current values, at least to some extent.

If you claim that the destination of moral realism is good, then a) demonstrate this, or b) let me add something tying it our current values anyway (this shouldn't change much, if you already assume the destination is good, and would help if it isn't).

(note: moral realism is complex and varied and I only sorta-understand a small portion of it; this answer is about that small portion)

We don't know anything about the features of the moral realism destination, just that it is, in some sense, the true morality. Some moral realists seem to explicitly ok with the destination being something that horrifies us today; in fact, this is even expected.

So this is advocating that we push off for an unknown moral destination, with the expectation that we will find it awful, and only local properties defined in it. Pardon my lack of enthusiasm ^_^

nevertheless it seems possible to state that some values are further away from this undefined starting point than others (paperclipers are very far, money-maximiser quite far, situations where recognisably human beings do recognisably human stuff are much closer)

Whether a value system recommends creating humans doing human stuff depends not just on the value system but also on the relative costs of creating humans doing human stuff versus creating other good things. So it seems like defining value distance requires either making some assumptions about the underlying universe, or somehow measuring the distance between utility functions and not just the distance between recommendations. Maybe you'd end up with something like "if a hedon is a thousand times cheaper than a unit of eudaimonia and the values recommend using the universe's resources for hedons, that means the values are very distant from ours, but if a hedon is a million times cheaper than a unit of eudaimonia and the values recommend using the universe's resources for hedons, the values could still be very close to ours".

For example, if the human values are determined by human feedback, then we can forbid the AI from coercing the human in any way, or restrict it to only using some methods (such as relaxed conversation).

It seems to me the natural way to do this is by looking for coherence among all the possible ways the AI could ask the relevant questions. This seems like something you'd want anyway, so if there's a meta-level step at the beginning where you consult humans on how to consult their true selves, you'd get it for free. Maybe the meta-level step itself can be hacked through some sort of manipulation, but it seems at least harder (and probably there'd be multiple levels of meta).

See my next post - human values are contradictory, meta-values especially so.

It seems to me that the fact that we're having conversations like this implies that there's some meta level where we agree on the "rules of the game".