Toby Ord commented:

Eliezer,  I've just reread your article and was wondering if this is a good quick summary of your position (leaving apart how you got to it):

'I should X' means that I would attempt to X were I fully informed.

Toby's a pro, so if he didn't get it, I'd better try again.  Let me try a different tack of explanation—one closer to the historical way that I arrived at my own position.

Suppose you build an AI, and—leaving aside that AI goal systems cannot be built around English statements, and all such descriptions are only dreams—you try to infuse the AI with the action-determining principle, "Do what I want."

And suppose you get the AI design close enough—it doesn't just end up tiling the universe with paperclips, cheesecake or tiny molecular copies of satisfied programmers—that its utility function actually assigns utilities as follows, to the world-states we would describe in English as:

<Programmer weakly desires 'X',   quantity 20 of X exists>:  +20
<Programmer strongly desires 'Y',
quantity 20 of X exists>:  0
<Programmer weakly desires 'X',   quantity 30 of Y exists>:  0
<Programmer strongly desires 'Y', quantity 30 of Y exists>:  +60

You perceive, of course, that this destroys the world.

...since if the programmer initially weakly wants 'X' and X is hard to obtain, the AI will modify the programmer to strongly want 'Y', which is easy to create, and then bring about lots of Y.  Y might be, say, iron atoms—those are highly stable.

Can you patch this problem?  No.  As a general rule, it is not possible to patch flawed Friendly AI designs.

If you try to bound the utility function, or make the AI not care about how much the programmer wants things, the AI still has a motive (as an expected utility maximizer) to make the programmer want something that can be obtained with a very high degree of certainty.

If you try to make it so that the AI can't modify the programmer, then the AI can't talk to the programmer (talking to someone modifies them).

If you try to rule out a specific class of ways the AI could modify the programmer, the AI has a motive to superintelligently seek out loopholes and ways to modify the programmer indirectly.

As a general rule, it is not possible to patch flawed FAI designs.

We, ourselves, do not imagine the future and judge, that any future in which our brains want something, and that thing exists, is a good future.  If we did think this way, we would say: "Yay!  Go ahead and modify us to strongly want something cheap!"  But we do not say this, which means that this AI design is fundamentally flawed: it will choose things very unlike what we would choose; it will judge desirability very differently from how we judge it.  This core disharmony cannot be patched by ruling out a handful of specific failure modes.

There's also a duality between Friendly AI problems and moral philosophy problems—though you've got to structure that duality in exactly the right way.  So if you prefer, the core problem is that the AI will choose in a way very unlike the structure of what is, y'know, actually right—never mind the way we choose.  Isn't the whole point of this problem, that merely wanting something doesn't make it right?

So this is the paradoxical-seeming issue which I have analogized to the difference between:

A calculator that, when you press '2', '+', and '3', tries to compute:
        "What is 2 + 3?"

A calculator that, when you press '2', '+', and '3', tries to compute:
        "What does this calculator output when you press '2', '+', and '3'?"

The Type 1 calculator, as it were, wants to output 5.

The Type 2 "calculator" could return any result; and in the act of returning that result, it becomes the correct answer to the question that was internally asked.

We ourselves are like unto the Type 1 calculator.  But the putative AI is being built as though it were to reflect the Type 2 calculator.

Now imagine that the Type 1 calculator is trying to build an AI, only the Type 1 calculator doesn't know its own question.  The calculator continually asks the question by its very nature, it was born to ask that question, created already in motion around that question—but the calculator has no insight into its own transistors; it cannot print out the question, which is extremely complicated and has no simple approximation.

So the calculator wants to build an AI (it's a pretty smart calculator, it just doesn't have access to its own transistors) and have the AI give the right answer.  Only the calculator can't print out the question.  So the calculator wants to have the AI look at the calculator, where the question is written, and answer the question that the AI will discover implicit in those transistors.  But this cannot be done by the cheap shortcut of a utility function that says "All X: <calculator asks 'X?', answer X>: utility 1; else: utility 0" because that actually mirrors the utility function of a Type 2 calculator, not a Type 1 calculator.

This gets us into FAI issues that I am not going into (some of which I'm still working out myself).

However, when you back out of the details of FAI design, and swap back to the perspective of moral philosophy, then what we were just talking about was the dual of the moral issue:  "But if what's 'right' is a mere preference, then anything that anyone wants is 'right'."

Now I did argue against that particular concept in some detail, in The Meaning of Right, so I am not going to repeat all that...

But the key notion is the idea that what we name by 'right' is a fixed question, or perhaps a fixed framework. We can encounter moral arguments that modify our terminal values, and even encounter moral arguments that modify what we count as a moral argument; nonetheless, it all grows out of a particular starting point.  We do not experience ourselves as embodying the question "What will I decide to do?" which would be a Type 2 calculator; anything we decided would thereby become right.  We experience ourselves as asking the embodied question:  "What will save my friends, and my people, from getting hurt?  How can we all have more fun?  ..." where the "..." is around a thousand other things.

So 'I should X' does not mean that I would attempt to X were I fully informed.

'I should X' means that X answers the question, "What will save my people?  How can we all have more fun? How can we get more control over our own lives?  What's the funniest jokes we can tell?  ..."

And I may not know what this question is, actually; I may not be able to print out my current guess nor my surrounding framework; but I know, as all non-moral-relativists instinctively know, that the question surely is not just "How can I do whatever I want?"

When these two formulations begin to seem as entirely distinct as "snow" and snow, then you shall have created distinct buckets for the quotation and the referent.

Added:  This was posted automatically and the front page got screwed up somehow.  I have no idea how.  It is now fixed and should make sense.

New Comment
50 comments, sorted by Click to highlight new comments since:

Eliezer, you sometimes make me think that the solution to the friendly AI problem is to pass laws mandating death by torture for anyone who even begins to attempt to make a strong AI, and hope that we catch them before they get far enough.

I frequently think things like that. The problem is, as I'm sure you're aware, that such a law would be less effective against those less interested in making it Friendly.

"You perceive, of course, that this destroys the world."

If the AI modifies humans so that humans want whatever happens to already exist (say, diffuse clouds of hydrogen), then this is clearly a failure scenario.

But what if the Dark Lords of the Matrix reprogrammed everyone to like murder, from the perspective of both the murderer and the murderee? Should the AI use everyone's prior preferences as morality, and reprogram us again to hate murder? Should the AI use prior preferences, and forcibly stop everyone from murdering each other, even if this causes us a great deal of emotional trauma? Or should the AI recalibrate morality to everyone's current preferences, and start creating lots of new humans to enable more murders?

So that gets down to the following question: should the AI set up its CEV-based utility function only once when the AI is first initialized over the population of humanity that exists at that time (or otherwise somehow cache that state so that CEV calculations can refer to it), or should it be continuously recalibrating it as humanity changes?

Which of these approaches (or some third one I haven't anticipated) does EY's design use? I'm not able to pick it out of the CEV paper, though that's likely because I don't have the necessary technical background.

Edit: The gloss definition of "interpreted as we wish that interpreted", part of the poetic summary description of CEV, seems to imply that the CEV will update itself to match humanity as it updates itself. So: if the Dark Lords change our preferences so significantly that we can't be coherently argued out of it, then we'd end up in the endless murder scenario. Hopefully that doesn't happen.

hmmm...It seems to me that the actions we choose to take consist in derivatives with of our utility function with respect to information about the world. so if we have utility (programmer desires X, quantity 20 of X exists) = 20, then isn't it just a question of ensuring that the derivative is taken only with respect to the latter variable, keeping the first fixed?

Tom McCabe: speaking as someone who morally disapproves of murder, I'd like to see the AI reprogram everyone back, or cryosuspend them all indefinitely, or upload them into a sub-matrix where they can think they're happily murdering each other without all the actual murder. Of course your hypothetical murder-lovers would call this imoral, but I'm not about to start taking the moral arguments of murder-lovers seriously. You just have to come to grips with the fact that the thing we call Morality isn't anything special from a global, physical perspective. It isn't written in the stars, it doesn't follow from pure logic, it isn't simple or easy to describe. It's a big messy, complicated aspect of our specific nature as a species.

Coming to grips with this fact doesn't mean you have to turn into a moral relativist, or claim that morality is made of nothing but arbitrary individual preference. Those conclusions just don't follow.

And I may not know what this question is, actually; I may not be able to print out my current guess nor my surrounding framework; but I know, as all non-moral-relativists instinctively know, that the question surely is not just "How can I do whatever I want?"

I'm not sure you've done enough to get away from being a "moral relativist", which is not the same as being an egoist who only cares about his own desires. "Moral relativism" just means this (Wikipedia):

In philosophy, moral relativism is the position that moral or ethical propositions do not reflect objective and/or universal moral truths [...] Moral relativists hold that no universal standard exists by which to assess an ethical proposition's truth.

Unless I've radically misunderstood, I think that's close to your position. Admittedly, it's an objective matter of fact whether some action is good according to the "blob of a computation" (i.e. set of ethical concerns) that any specific person cares about. But there's no objective way to determine that one "blob" is any more correct than another - except by the standards of those blobs themselves.

(By the way, I hope this isn't perceived as particular hostility on my part: I think some very ethical and upstanding people have been relativists. It's also not an argument that your position is wrong.)

It's fairly clear that, at least according to EY, the blobs are universal across all humans.

Wait a sec: I'm not sure people do outright avoid modifying their own desires so as to make the desires easier to satisfy, as you are claiming here:

We, ourselves, do not imagine the future and judge, that any future in which our brains want something, and that thing exists, is a good future. If we did think this way, we would say: "Yay! Go ahead and modify us to strongly want something cheap!"

Isn't that exactly what people do when they study ascetic philosophies and otherwise try to see what living simply is like? And would people turn down a pill that made vegetable juice taste like a milkshake and vice versa?

That's a good point. I think the distinction is that these people are modifying their own instrumental values, but leaving their terminal values (the big meaning of life blob of computation) unchanged. I'd go so far as to say that people frequently do this trick by mistake, when they convince themselves that they have various terminal values. This certainly explains things like happy death spirals.

On the other hand, this would be very difficult (impossible?) to test.

EDIT: I've given this a bit more thought, and I wonder what it would feel like from the inside to be a machine learning algorithm that could make limited small self-modifications to it's own utility function, including it's optimization criteria. This seems like a "simple" enough hack that evolution could have generated it. This also seems to mirror real human psychology surprisingly well.

I'm imagining trying to answer the question "what I would like to change my utility function to", while simultaneously not fully understanding the dangers of messing around like that. It seems like this could easily generate people like religious extremists, even if earlier versions of those people would never have deliberately tried to become that twisted. If the other side seems completely wrong and evil, then I can picture disliking parts of myself that resemble the other side, as well as well as any empathy I may have for them. I can imagine how suppressing those parts of myself would lead to extremism.

I wonder what the official Yudkowski position on this is. More importantly, I wonder what happens if you get this question wrong while trying to build a Friendly AI. It seems like there might be issues if you assume a static Coherent Extrapolated Volition if it is actually dynamically changing, or vice versa.

Tom,

The AI would use the previous morality to select its actions: depending on the content of that morality it might or might not reverse the reprogramming.

I wouldn't mind being blissed out by iron atoms, to be quite honest.

"Tom McCabe: speaking as someone who morally disapproves of murder, I'd like to see the AI reprogram everyone back, or cryosuspend them all indefinitely, or upload them into a sub-matrix where they can think they're happily murdering each other without all the actual murder. Of course your hypothetical murder-lovers would call this immoral, but I'm not about to start taking the moral arguments of murder-lovers seriously."

Beware shutting yourself into a self-justifying memetic loop. If you had been born in 1800, and just recently moved here via time travel, would you have refused to listen to all of our modern anti-slavery arguments, on the grounds that no moral argument by negro-lovers could be taken seriously?

"The AI would use the previous morality to select its actions: depending on the content of that morality it might or might not reverse the reprogramming."

Do you mean would, or should? My question was what the AI should do, not what a human-constructed AI is likely to do.

It should be possible for an AI, upon perceiving any huge changes in renormalized human morality, to scrap its existing moral system and recalibrate from scratch, even if nobody actually codes an AI that way. Obviously, the previous morality will determine the AI's very next action, but the interesting question is whether the important actions (the ones that directly affect people) map on to a new morality or the previous morality.

Funny how the meaning changes if it's desire for gold atoms compared to desire for iron atoms.

I'm real unclear about the concept here, though. Is an FAI going to go inside people's heads and change what we want? Like, it figures out how to do effective advertising?

Or is it just deciding what goals it should follow, to get us what we want? Like, if what the citizens of two countries each with population about 60 million most want is to win a war with the other country, should the FAI pick a side and give them what they want, or should it choose some other way to make the world a better place?

If a billion people each want to be Olympic gold medalists, what can a nearly-omnipotent FAI do for them? Create a billion different Olympic events, each tailored to a different person? Maybe it might choose to improve water supplies instead? I really don't see a problem with doing things that are good for people instead of what they want most, if what they want is collectively self-defeating.

Imagine that we suddenly got a nearly-god-like FAI. It studies physics and finds a way to do unlimited transmutation, any element into any other element. It can create almost unlimited energy by converting iron atoms entirely to energy, and it calculates that the amount of iron the earth receives per day as micrometeorites will more than power anything we want to do. It studies biology and sees how to do various wonders. And it studies humans, and then the first thing it actuall does is to start philosophy classes.

"Study with me and you will discover deep meaning in the world and in your own life. You will become a productive citizen, you will tap into your inner creativity, you will lose all desire to hurt other people and yet you will be competent to handle the challenges life hands you." And it works, people who study the philosophy find these claims are true.

What's wrong with that?

Thanks for responding to my summary attempt. I agree with Robin that it is important to be able to clearly and succinctly express your main position, as only then can it be subject to proper criticism to see how well it holds up. In one way, I'm glad that you didn't like my attempted summary as I think the position therein is false, but it does mean that we should keep looking for a neat summary. You currently have:

'I should X' means that X answers the question, "What will save my people? How can we all have more fun? How can we get more control over our own lives? What's the funniest jokes we can tell? ..."

But I'm not clear where the particular question is supposed to come from. I understand that you are trying to make it a fixed question in order to avoid deliberate preference change or self-fulling questions. So lets say that for each person P, there is a specific question Q_P such that:

For a person P, 'I should X', means that X answers the question Q_P.

Now how is Q_P generated? Is it what P would want were she given access to all the best empirical and moral arguments (what I called being fully informed)? If so, do we have to time index the judgment as well? i.e. if P's preferences change at some late time T1, then did the person mean something different by 'I should X' before and after T1 , or was the person just incorrect at one of those times? What if the change is just through acquiring better information (empirical or moral)?

[-]Roko00

Larry and Eliezer: I agree with Allan Crossman here. The position that you are advocating is commonly described as moral relativism, or in more technical language as moral anti-realism. I have argued before that realism is a position which those of a transhumanist or singularitarian persuasion should be hoping for/advocating.

"in the absence of any serious ethical guidance, people will stay exactly where they are – status quo bias. This effectively says that transhumanism is either a realist theory or a failure from the start."

Toby Ord:

So lets say that for each person P, there is a specific question Q_P such that:

For a person P, 'I should X', means that X answers the question Q_P.

Now how is Q_P generated?

Generated? By that do you mean, causally generated? Q_P is causally generated by evolutionary psychology and memetic history.

Do you mean how would a correctly structured FAI obtain an internal copy of Q_P? By looking/guessing at person P's empirical brain state.

Do you mean how is Q_P justified? Any particular guess by P at "What is good?" will be justified by appeals to Q_P; if they somehow obtained an exact representation of Q_P then its pieces might or might not all look individually attractive.

These are all distinct concepts!

Is it what P would want were she given access to all the best empirical and moral arguments (what I called being fully informed)? If so, do we have to time index the judgment as well? i.e. if P's preferences change at some late time T1, then did the person mean something different by 'I should X' before and after T1 , or was the person just incorrect at one of those times? What if the change is just through acquiring better information (empirical or moral)?

(Items marked in bold have to be morally evaluated.)

I do believe in moral progress, both as a personal goal and as a concept worth saving; but if you want to talk about moral progress in an ideal sense rather than a historical sense, you have to construe a means of extrapolating it - since it is not guaranteed that our change under moral arguments resolves to a unique value system or even a unique transpersonal value system.

So I regard Q_P as an initial state that includes the specification of how it changes; if you construe a volition therefrom, I would call that EV_Q_P.

If you ask where EV_Q_P comes from causally, it is ev-psych plus memetic history plus your own construal of a specific extrapolation of reactivity to moral arguments.

If you ask how an FAI learns EV_Q_P it is by looking at the person, from within a framework of extrapolation that you (or rather I) defined.

If you ask how one would justify EV_Q_P, it is, like all good things, justified by appeal to Q_P.

If P's preferences change according to something that was in Q_P or EV_Q_P then they have changed in a good way, committed an act of moral progress, and hence - more or less by definition - stayed within the same "frame of moral reference", which is how I would refer to what the ancient Greeks and us have in common but a paperclip maximizer does not.

Should P's preferences change due to some force that was / would-be unwanted, like an Unfriendly AI reprogramming their brain, then as a moral judgment, I should say that they have been harmed, that their moral frame of reference has changed, and that their actions are now being directed by something other than "should".

Eliezer,

Sorry for not being more precise. I was actually asking what a given person's Q_P is, put in terms that we have already defined. You give a partial example of such a question, but it is not enough for me to tell what metaethical theory you are expressing. For example, suppose Mary currently values her own pleasure and nothing else, but that were she exposed to certain arguments she would come to value everyone's pleasure (in particular, the sum of everyone's pleasure) and that no other arguments would ever lead her to value anything else. This is obviously unrealistic, but I'm trying to determine what you mean via a simple example. Would Q_Mary be 'What maximizes Mary's pleasure?' or 'What maximizes the sum of pleasure?' or would it be something else? On my attempted summary, Q_Mary would be the second of these questions as that is what she would want if she knew all relevant arguments. Also, does it matter whether we suppose that Mary is open to change to her original values or if she is strongly opposed to change to her original values?

(Items marked in bold have to be morally evaluated.)

I don't think so. For example, when I said 'incorrect' I meant 'made a judgement which was false'. When I said 'best' arguments, I didn't mean the morally superior arguments, just the ones that are most convincing (just as the 'best available scientific theory' is not a moral claim). Feel free to replace that with something like 'if she had access to all relevant arguments', or 'if there exists an argument which would convince her' or the like. There are many ways this could be made precise, but it is not my task to do so: I want you to do so, so that I can better see and reply to your position.

Regarding the comment about assessing future Q_Ps from the standpoint of old ones, I still don't see a precise answer here. For example, if Q_P,T1 approves of Q_P,T2 which approves of Q_P,T3 but Q_P,T1 doesn't approve of Q_P,T3, then what are we to say? Did two good changes make a bad change?

I think what Eliezer is saying is that our evolutionary psychology, memetic history and reaction to current moral arguments form the computational trajectory for our moral judgment. All the points on this trajectory are acceptable moral judgments but when new experiences are fed back through base program this trajectory can shift. The shift takes place at the base of the line as it extends from the program, rather than curving in the middle to include all the current moral values. Moral values that are contacted by this line are good and any that aren't contacted by this line are not good, like an off-on switch. This is because current moral judgments flow backwards.

The aggregate moral trajectory adds up to humanity's morality when the function is filtered through the base program once again. So it continues to perform an update loop. Now if we edit the base program then it no longer provides consistent answers. This would be like taking a pill that makes it 'morally right' to kill people. What I am stuck on is how we could edit the base program and still have it produce consistent answers.

Eliezer, you’re assuming a very specific type of AI here. There are at least three different types, each with its own challenges: 1.An AI created by clever programmers who grasp the fundamentals of intelligence. 2.An AI evolved in iterative simulations. 3.An AI based on modeling human intelligence, simulating our neural interactions based on future neuroscience.

Type 1 is dangerous because it will interpret whatever instructions literally and has as you say “no ghost.” Type 2 is possibly the most dangerous because we will have no idea how it actually works. There are already experiments that evolve circuits that perform specific tasks but whose actual workings are not understood. In Type 3, we actually can anthropomorphize the AI, but it’s dangerous because the AI is basically a person and has all the problems of a person.

Given current trends it seems to me that slow progress is being made towards Type 2 and Type 3 Type 1 has stymied us for many years.

From the SIAI website, presumably by Eliezer: what makes us think we can outguess genuinely smarter-than-human intelligence?

Yet we keep having long discussions about what kind of morality to give the smarter-than-human AI. What am I missing?

Yet we keep having long discussions about what kind of morality to give the smarter-than-human AI. What am I missing?
T,mJ: for some time now, Eliezer has been arguing from a position of moral relativism, implicitly adopting the stance that increased intelligence has no implications for the sorts of moral or ethical systems an entity will possess.

He has essentially been saying that we need to program a moral system we feel is appropriate into the AI and constrain it so that it cannot operate outside of that system. Its greater intelligence will then permit it to understand the implications of actions better than we can, and it will act in ways aligned with our chosen morality while having greater ability to plan and anticipate.

If you try to rule out a specific class of ways the AI could modify the programmer, the AI has a motive to superintelligently seek out loopholes and ways to modify the programmer indirectly.

There's a sci-fi story there about an AI that always follows orders manipulating everyone into ordering it to do something horrible.

Of course the humans would realize this at the last minute and stop themselves.

Were you thinking of All The Troubles Of The World by Isaac Asimov?

@Tom McCabe: "Beware shutting yourself into a self-justifying memetic loop. If you had been born in 1800, and just recently moved here via time travel, would you have refused to listen to all of our modern anti-slavery arguments, on the grounds that no moral argument by negro-lovers could be taken seriously?"

Generally I think this is a valid point. One shouldn't lightly accuse a fellow human of being irredeemably morally broken, simply because they disagree with you on any particular conclusion. But in this particular case, I'm willing to take that step. If I know anything at all about morality, then I know murder is wrong.

@Alan Crossman, Roko: No, I do not think that the moral theory that Eliezer is arguing for is relativism. I am willing to say a paperclip maximizer is an abomination. It is a thing that should not be. Wouldn't a relativist say that passing moral judgments on a thing as alien as that isn't meaningful? Don't we lack a common moral context by which to judge (according to the relativist)?

Let me attempt a summary of Eliezer's theory:

Morality is real, but it is something that arose here, on this planet, among this species. It is nearly universal among humans and that is good enough. We shouldn't expect it to be universal among all intelligent beings. Also it is not possible to concisely write down a definition for "should", any more than it is possible to write a concise general AI program.

Eliezer, I have a few practical questions for you. If you don't want to answer them in this tread, that's fine, but I am curious:

1) Do you believe humans have a chance of achieving uploading without the use of a strong AI? If so, where do you place the odds?

2) Do you believe that uploaded human minds might be capable of improving themselves/increasing their own intelligence within the framework of human preference? If so, where do you place the odds?

3) Do you believe that increased-intelligence-uploaded humans might be able to create an fAI with more success than us meat-men? If so, where do you place the odds?

4) Where do you place the odds of you/your institute creating an fAI faster than 1-3 occurring?

5) Where do you place the odds of someone else creating an unfriendly AI faster than 1-3 occurring?

Thank you!!!

I should also add:

6) Where do you place the odds of you/your institute creating an unfriendly AI in an attempt to create a friendly one?

7) Do you have any external validation (ie, unassociated with your institute and not currently worshiping you) for this estimate, or does it come exclusively from calculations you made?

Whoops! This system doesn't link to the exact comment. Here's the text quote:

@Eliezer: Sophiesdad, you should be aware that I'm not likely to take your advice, or even take it seriously. You may as well stop wasting the effort.

For example, suppose Mary currently values her own pleasure and nothing else, but that were she exposed to certain arguments she would come to value everyone's pleasure (in particular, the sum of everyone's pleasure) and that no other arguments would ever lead her to value anything else. This is obviously unrealistic, but I'm trying to determine what you mean via a simple example. Would Q_Mary be 'What maximizes Mary's pleasure?' or 'What maximizes the sum of pleasure?' or would it be something else?

Q_Mary includes both 'What maximizes Mary's pleasure?' and her responsivity to the moral arguments that will change this view. EV_Q_Mary may well be construed as 'What maximizes the sum of pleasure?' It seems to me that the ordinary usage of 'should' takes into account responsivity to moral arguments; and so, rationalizing it, it should refer to EV_Q_Mary.

Also, does it matter whether we suppose that Mary is open to change to her original values or if she is strongly opposed to change to her original values?

An interesting question. On the one hand, administering to you a drug, is not an argument; we would normally say that you could reject this on a moral level even though it would produce an empirical change in your utility function. On the other hand, fundamentalist theists may insist that their value is to not be allowed to change, ever. I would at the least say that responsiveness to factual arguments is always valid - but that itself is a moral judgment on my part.

For example, if Q_P,T1 approves of Q_P,T2 which approves of Q_P,T3 but Q_P,T1 doesn't approve of Q_P,T3, then what are we to say? Did two good changes make a bad change?

That's progress. The ancient Greeks might well be horrified at certain aspects of our civilization.

With Toby, I'm still not clear on what is being suggested. Apparently you approve of some processes that would change your moral beliefs and disapprove of others, but you aren't willing to describe your approval in terms of how close your beliefs would get to some ideal counterfactual such as "having heard and understood all relevant arguments." So you need some other way to differentiate approved vs. disapproved influences.

Apparently you approve of some processes that would change your moral beliefs and disapprove of others,

Well, yes. For example, learning a new fact is approved. Administering to me a drug is unapproved. Would you disagree with these moral judgments?

you aren't willing to describe your approval in terms of how close your beliefs would get to some ideal counterfactual such as "having heard and understood all relevant arguments"

Oh, I'd be perfectly willing to describe it in those terms, if I thought I could get away with it. But you can't get away with that in FAI work.

Words like "relevant" assume precisely that distinction between approved and unapproved.

Humans don't start out all that tremendously coherent, so the "ideal counterfactual" cannot just be assumed into existence - it's at least possible that different orders in which we "hear and understand" things would send us into distinct attractors.

You would have to construe some specific counterfactual, and that choice itself would be morally challengeable; it would be a guess, part of your Q_P. It's not like you can call upon an ideal to write code; let alone, write the code that defines itself.

For EV_Q_P to be defined coherently, it has to be bootstrapped out of Q_P with a well-defined order of operations in which no function is called before it has been defined. You can't say that EV_Q_P is whatever EV_Q_P says it should be. That either doesn't halt, or outputs anything.

When you use a word like ideal in "ideal counterfactual", how to construe that counterfactual is itself a moral judgment. If that counterfactual happens to define "idealness", you need some non-ideal definition of it to start with, or the recursion has no foundation.

Eliezer, it is not clear you even approve of changes due to learning new facts, as you'd distrust an AI persuading you only via telling you new facts. And yes if your approval of the outcome depends on the order in which you heard arguments then you also need a way to distinguish approved from not approved argument orders.

The outcome depends on intervention, and outcome following from no intervention can't be the guiding light. Without intervention, humans consistently die. Does it mean that it is morally right for them to die? Does it mean that a dead human should be left dead, that a human that was forced to take a pill that makes him want to kill people should be aided in killing people? Intervention is judged from within the framework of current morality, and it is not enough to look at the actions. Morality is an algorithm that is designed to work in many contexts, most of which won't actually occur. Reasoning about possible changes requires considering this algorithm, and meanings of morality-related concepts, such as "should", "could" and "better" are rooted in the structure of this algorithm. To build the question-seeking question, it is not enough to refer to the actual dynamics of question's implementation, it is also necessary to present this implementation through the lens of algorithmic structure, as recognized from our moral framework.

This suggests that even describing a fixed number of elements of this algorithm might not be enough to capture its meaning, the meaning of moral progress as we envision from within our moral framework. Saying "that thing over there" doesn't capture it, because understanding this question requires being able to look in the right way.

Eliezer, it is not clear you even approve of changes due to learning new facts, as you'd distrust an AI persuading you only via telling you new facts. And yes if your approval of the outcome depends on the order in which you heard arguments then you also need a way to distinguish approved from not approved argument orders.

Well, for that matter, there are some AIs I'd trust to administer drugs to me, just as there are some AIs I wouldn't trust to tell me true facts.

At this point, though, it becomes useful to distinguish metaethics for human use and metaethics in FAI. In terms of metaethics for human use, any human telling you a true fact that affects your morality, is helping you; in terms of metaethics for human use, we don't worry too much about the orderings of valid arguments.

In FAI you've got to worry about a superintelligence searching huge amounts of argument-sequence space. My main thought for handling this sort of thing, consists of searching all ways and superposing them and considering the coherence or incoherence thereof as the extrapolated volition, rather than trying to map out one unique/optimal path.

Eliezer: In FAI you've got to worry about a superintelligence searching huge amounts of argument-sequence space. My main thought for handling this sort of thing, consists of searching all ways and superposing them and considering the coherence or incoherence thereof as the extrapolated volition, rather than trying to map out one unique/optimal path.

But how to determine what kinds of modifications are allowed in the mix? Is hitting with an iron rod an argument? What is the coherence of superposition of extrapolations? Is lying dead and gradually decaying coherent?

Eliezer's famous AI-in-a-box experiment is a good example of the questionable nature of minds changed by information. Two independent people who were adamant that they would keep the simulated AI in the box were persuaded otherwise, by merely a very intelligent human. Presumably a super-intelligent AI would be far more persuasive. In that context it is hard to say what information is safe and what is unsafe.

For the people who changed their minds and let Eliezer out of the box, was that a moral action? Beforehand, they would have been horrified at the prospect of failing their responsibility so dramatically, and would probably have viewed it as a moral failure. Yet in the end, presumably they thought they were doing right.

Unfortunately, the shield of privacy which has been drawn over these fascinating experiments has prevented a full-scale discussion of the issues they raise.

Given that the morality we want to impose on a FAI is kind of incoherent, maybe we should get an AI to make sense of it first?

Tarzan, me Jane: "Yet we keep having long discussions about what kind of morality to give the smarter-than-human AI. What am I missing?"

See "Knowability of FAI."

Fascinating discussion and blog. Surely one obvious safeguard to a super-smart AI agent going morally astray or running amok would be to inseparably associate with it a dumber "confessor" AI agent which, while lacking its prowess and flexibility, would have at least the run-of-the-mill intelligence to detect when a proposal might conflict with acceptable human moral standards.

I called it a confessor, by analogy with priests privy to the sins and wicked thoughts of even powerful people. But loads of analogies come to mind, for example an eight stone jockey controlling a half ton race horse, faster than the rider, or a high-resistance loop off a million volt power line to which a small instrument can be rigged to indicate the current flowing in the main line.

You could even have a cascade of agents, each somewhat dumber and less flexible than the next, but all required to justify their decisions down the line prior to action, and the first which failed to agree on a plan (by either not understanding it or concluding it was immoral) would flag up a warning to human observers.

One thing I kind of like about this idea is the 'confessor' could be faster than the 'horse' simply by being dumber (and taking less code to run).

Well then the answer is simple: Instead of setting the goal as doing what you would do at that specific point in time (which might actually work, assuming that you didn't want your will to be modified to want something thats cheap), you set it to do what it thinks you, at the time you created it would want. If you assume that the AI would know you would want it to do what the you in the future would want it to do, but not sickly modify you to want weird things (like death which is the cheapest thing.) Problem solved, although your AI is going to have to have alot of background knowlege and intelligence to actually pull this off.

It seems to me that while the terminal values of morality per individuals might be fixed, and per species might be relatively invariant, the things to do to get the most 'points' (utility?) might well seem as though the supposedly friendly AI was behaving in a pretty evil manner to us. I wonder if, whether the friendly ai project succeeds or not, how soon if at all we would really know that it had worked. I suppose, though, that's putting it in terms of human levels of intelligence. To us the only solution to overpopulation for instance might seem having a bunch of us die off so the rest, and future generations, can live more comfortably (birth control alone results in problems where you have too many grandparents and not enough caretakers, like the "four-two-one" problem in China). If overpopulation turned out to be a huge problem, a sufficiently advanced AI might be able to mobilize enough infrastructure to house people rapidly enough that their quality of life might not be diminished enough that euthanizing a good portion to preserve the lives and sanity of the remainder might not be the only option. Some high-population-density structures seem like they might actually be enjoyable places to live... Stil, its entirely possible that for non-terminal reasons a perfectly friendly AI might scare the hell out of us, although if it was forced to do that, it would very likely be better than the alternative consequence it was seeking to avoid.

Let me see if I get this by repeating it in my own words: The answer to the question, "what do I want?" is: "Whatever will save the most people/create the most happiness/etc." The answer to the question "what is right?" is: Whatever action will save the most people/create the most happiness/etc. Possibly donating to charity.

[-]ata90

The answer to the question "what is right?" is: Whatever action will save the most people/create the most happiness/etc.

As I understand it, it's not that the latter is the answer to that question, it's that they're the same question.

It's not that it just happens to be Right to do what a certain Huge Blob of Computation tells you to do; the idea is that moral words like "right" and "should" are pointers to that Huge Blob and nothing else. Beyond that, there's no separable essence of morality such that it's anything other than a tautology to say "What is right? Whatever action will save the most lives, create the most fun, ..." The feeling of separable epiphenomenal essences is a common (not just about morality) fallacy that's discussed in several posts in the Words sequence, IIRC.

That's a good way to explain it. Thank you.

where the "..." is around a thousand other things.

Do you mean literally a thousand? That's tiny!

After reading this, the thought just spins in my head that ALL people initially perceive morality as something objective, but not just as your preferences, so they may even wonder "does something become right simply because someone wants it?" , though there really isn't any true general right, just what someone thinks is right. As noted earlier, people perceive correctness as action.morality, when in fact it is human.moralityOf(action).

[-]TAG10

ALL people initially perceive morality as something objective, but as your preferences, so they may even wonder “does something become right simply because someone wants it?”

Was that supposed to read "as something objective, but ALSO as your preferences".

I fixed thus artifact of Google Translate