Superintelligence 24: Morality models and "do what I mean"

KatjaGrace

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-fourth section in the reading guide: Morality models and "Do what I mean".

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Morality models” and “Do what I mean” from Chapter 13.

Summary

Moral rightness (MR) AI: AI which seeks to do what is morally right
1. Another form of 'indirect normativity'
2. Requires moral realism to be true to do anything, but we could ask the AI to evaluate that and do something else if moral realism is false
3. Avoids some complications of CEV
4. If moral realism is true, is better than CEV (though may be terrible for us)
We often want to say 'do what I mean' with respect to goals we try to specify. This is doing a lot of the work sometimes, so if we could specify that well perhaps it could also just stand alone: do what I want. This is much like CEV again.

Another view

Olle Häggström again, on Bostrom's 'Milky Way Preserve':

The idea [of a Moral Rightness AI] is that a superintelligence might be successful at the task (where we humans have so far failed) of figuring out what is objectively morally right. It should then take objective morality to heart as its own values.^1,2

Bostrom sees a number of pros and cons of this idea. A major concern is that objective morality may not be in humanity's best interest. Suppose for instance (not entirely implausibly) that objective morality is a kind of hedonistic utilitarianism, where "an action is morally right (and morally permissible) if and only if, among all feasible actions, no other action would produce a greater balance of pleasure over suffering" (p 219). Some years ago I offered a thought experiment to demonstrate that such a morality is not necessarily in humanity's best interest. Bostrom reaches the same conclusion via a different thought experiment, which I'll stick with here in order to follow his line of reasoning.³ Here is his scenario:
The AI [...] might maximize the surfeit of pleasure by converting the accessible universe into hedonium, a process that may involve building computronium and using it to perform computations that instantiate pleasurable experiences. Since simulating any existing human brain is not the most efficient way of producing pleasure, a likely consequence is that we all die.
Bostrom is reluctant to accept such a sacrifice for "a greater good", and goes on to suggest a compromise:
The sacrifice looks even less appealing when we reflect that the superintelligence could realize a nearly-as-great good (in fractional terms) while sacrificing much less of our own potential well-being. Suppose that we agreed to allow almost the entire accessible universe to be converted into hedonium - everything except a small preserve, say the Milky Way, which would be set aside to accommodate our own needs. Then there would still be a hundred billion galaxies devoted to the maximization of pleasure. But we would have one galaxy within which to create wonderful civilizations that could last for billions of years and in which humans and nonhuman animals could survive and thrive, and have the opportunity to develop into beatific posthuman spirits.
If one prefers this latter option (as I would be inclined to do) it implies that one does not have an unconditional lexically dominant preference for acting morally permissibly. But it is consistent with placing great weight on morality. (p 219-220)

What? Is it? Is it "consistent with placing great weight on morality"? Imagine Bostrom in a situation where he does the final bit of programming of the coming superintelligence, to decide between these two worlds, i.e., the all-hedonium one versus the all-hedonium-except-in-the-Milky-Way-preserve.⁴ And imagine that he goes for the latter option. The only difference it makes to the world is to what happens in the Milky Way, so what happens elsewhere is irrelevant to the moral evaluation of his decision.⁵ This may mean that Bostrom opts for a scenario where, say, 10²⁴ sentient beings will thrive in the Milky Way in a way that is sustainable for trillions of years, rather than a scenarion where, say, 10⁴⁵ sentient beings will be even happier for a comparable amount of time. Wouldn't that be an act of immorality that dwarfs all other immoral acts carried out on our planet, by many many orders of magnitude? How could that be "consistent with placing great weight on morality"?⁶

Notes

1. Do What I Mean is originally a concept from computer systems, where the (more modest) idea is to have a system correct small input errors.

2. To the extent that people care about objective morality, it seems coherent extrapolated volition (CEV) or Christiano's proposal would lead the AI to care about objective morality, and thus look into what it is. Thus I doubt it is worth considering our commitments to morality first (as Bostrom does in this chapter, and as one might do before choosing whether to use a MR AI), if general methods for implementing our desires are on the table. This is close to what Bostrom is saying when he suggests we outsource the decision about which form of indirect normativity to use, and eventually winds up back at CEV. But it seems good to be explicit.

3. I'm not optimistic that behind every vague and ambiguous command, there is something specific that a person 'really means'. It seems more likely there is something they would in fact try to mean, if they thought about it a bunch more, but this is mostly defined by further facts about their brains, rather than the sentence and what they thought or felt as they said it. It seems at least misleading to call this 'what they meant'. Thus even when '—and do what I mean' is appended to other kinds of goals than generic CEV-style ones, I would expect the execution to look much like a generic investigation of human values, such as that implicit in CEV.

4. Alexander Kruel criticizes 'Do What I Mean' being important, because every part of what an AI does is designed to be what humans really want it to be, so it seems unlikely to him that AI would do exactly what humans want with respect to instrumental behaviors (e.g. be able to understand language, and use the internet and carry out sophisticated plans), but fail on humans' ultimate goals:

Outsmarting humanity is a very small target to hit, requiring a very small margin of error. In order to succeed at making an AI that can outsmart humans, humans have to succeed at making the AI behave intelligently and rationally. Which in turn requires humans to succeed at making the AI behave as intended along a vast number of dimensions. Thus, failing to predict the AI’s behavior does in almost all cases result in the AI failing to outsmart humans.

As an example, consider an AI that was designed to fly planes. It is exceedingly unlikely for humans to succeed at designing an AI that flies planes, without crashing, but which consistently chooses destinations that it was not meant to choose. Since all of the capabilities that are necessary to fly without crashing fall into the category “Do What Humans Mean”, and choosing the correct destination is just one such capability.

I disagree that it would be surprising for an AI to be very good at flying planes in general, but very bad at going to the right places in them. However it seems instructive to think about why this is.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

Are there other general forms of indirect normativity that might outsource the problem of deciding what indirect normativity to use?
On common views of moral realism, is morality likely to be amenable to (efficient) algorithmic discovery?
If you knew how to build an AI with a good understanding of natural language (e.g. it knows what the word 'good' means as well as your most intelligent friend), how could you use this to make a safe AI?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about other abstract features of an AI's reasoning that we might want to get right ahead of time, instead of leaving to the AI to fix. We will also discuss how well an AI would need to fulfill these criteria to be 'close enough'. To prepare, read “Component list” and “Getting close enough” from Chapter 13. The discussion will go live at 6pm Pacific time next Monday 2 March. Sign up to be notified here.

If there is an objective morality, but we don't care about it, is it relevant in any way?

I have no idea what 'there is an objective morality' would mean, empirically speaking.

"There is objective morality" basically means that morality is part of physics and just like there are natural laws of, say, gravity or electromagnetism, there are natural laws of morals because the world just works that way. Consult e.g. Christian theology for details.

Think of a system where, for example, a yogin can learn to levitate (which is a physical phenomenon) given that he diligently practices and leads a moral life. If he diligently practices but does not lead a moral life, he doesn't get to levitate. In such a system morality would be objective.

Note that this comment is not saying that objective morality exists, it just attempts to explain what the concept means.

Ok, I understand it in that context, as there are actual consequences. Of course, this also makes the answer trivial: Of course it's relevant, it gives you advantages you wouldn't otherwise have. Though even in the sense you've described, I'm not sure whether the word 'morality' really seems applicable. If torturing people let us levitate, would we call that 'objective morality'?

EDIT: To be clear, my intent isn't to nitpick. I'm simply saying that patterns of behavior being encoded, detected and rewarded by the laws of physics doesn't obviously seem to equate those patterns with 'morality' in any sense of the word that I'm familiar with.

If torturing people let us levitate, would we call that 'objective morality'?

Sure, see e.g. good Christians burning witches.

Hm. I'll acknowledge that's consistent (though I maintain that calling that 'morality' is fairly arbitrary), but I have to question whether that's a charitable interpretation of what modern believers in objective morality actually believe.

If you actually believe that burning a witch has some chance of saving her soul from eternal burning in hell (or even only provide a sufficient incentive for others to not agree to pacts with Satan and so surrender their soul to eternal punishment), wouldn't you be morally obligated to do it?

I mean the sufficiency of the definition given. Consider a universe which absolutely, positively, was not created by any sort of 'god', the laws of physics of which happen to be wired such that torturing people lets you levitate, regardless of whether the practitioner believes he has any sort of moral justification for the act. This universe's physics are wired this way not because of some designer deity's idea of morality, but simply by chance. I do not believe that most believers in objective morality would consider torturing people to be objectively good in this universe.

I don't think it needs to be in physics. It could be independent of and more general than physics, like math.

Yes, "physics" was probably unnecessarily too specific here. It's more "this is how the world actually works".

No. "There is an objective morality" means that moral claims have truth values that don't depend on the mental content of the person making them, That is epistemic, and has nothing to .do with what, if anything, grounds them ontological. (I haven't answered the question empirically, because I don't think that's useful)

Ethical objectivism can be grounded out in realism, either physical or metaphysical, but doesn't have to be. Examples of objectivism without realism include utilitarianism, which only requires existing preferences, not some additional laws or properties. Other examples include ethics based on contracts, game theory, etc. These are somewhat analogous to things like economics, in that there are better and worse answers to problems, but they don't get their truth values from straightforward correspondence to some territory,

I think Peter Singer wrote a paper arguing "no," but I can't find it at the moment.

It's hard to disagree with Frank Jackson that moral facts supervene on physical facts - that (assuming physicalism) two universes couldn't differ with respect to ethical facts unless they also differed in some physical facts. (So you can't have to physically identical universes where something is wrong in one and the same thing is not wrong in the other.) That's enough to get us objective morality, though it doesn't help us at all with its content.

The way we de facto argue about objective morals is like this: If some theory leads to an ethically repugnant conclusion, then the theory is a bad candidate for the job of being the correct moral theory. Some conclusions are transparently repugnant, so we can reject the theories that entail them. But then there are conclusions whose repugnance itself is a matter of controversy. Also, there are many disagreements about whether consequence A is more or less repugnant than consequence B.

So the practice of philosophical arguments about values presumes some fairly unified basic intuitions about what counts as a repugnant conclusion, and then trying to produce a maximally elegant ethical theory that forces us to bite the fewest bullets. Human participants in such arguments have different temperaments and different priorities, but all have some gut feelings about when a proposed theory has gone off the rails. If we expect an AI to do real moral reasoning, I think it might also need to have some sense of the bounds. These bounds are themselves under dispute. For example, some Australian Utilitarians are infamous for their brash dismissal of certain ethical intuitions of ordinary people, declaring many such intuitions to simply be mistakes, insofar as they are inconsistent with Utilitarianism. And they have a good point: Human intuitions about many things can be wrong (folk psychology, folk cosmology, etc.). Why couldn't the folk collectively make mistakes about ethics?

My worry is that our gut intuitions about ethics stem ultimately from our evolutionary history, and AIs that don't share our history will not come equipped with these intuitions. That might leave them unable to get started with evaluating the plausibility of a candidate for a theory of ethics. If I correctly understand the debate of the last 2 weeks, it's about acknowledging that we will need to hard-wire these ethical intuitions into an AI (in our case, evolution took care of the job). The question was: what intuitions should the AI start with, and how should they be programmed in? What if they take our human intuitions to be ethically arbitrary, and simply reject them once they've become superintellingent? Can we (or they) make conceptual sense of better intuitions about ethics than our folk intuitions - and in virtue of what would they be better?

We had better care about the content of objective morality - which is to say, we should all try to match our values to the correct values, even if the latter are difficult to figure out. And I certainly want any AI to feel the same way. Never should they be told: Don't worry about what's actually right, just act so-and-so. Becoming superintelligent might not be possible without deliberation about what's actually right, and the AI would ideally have some sort of scaffolding for that kind of deliberation. A superintelligence will inevitably ask "why should I do what you tell me?" and we better have an answer in terms that make sense to the AI. But if it asks: "Why are you so confident that your meatbag folk intuitions about ethics are actually right?" that will be a hard thing to answer to anyone's satisfaction. Still, I don't know another way forward.

What's your excuse for not caring about objective morality? What have you got to do that's more important?

I don't think there's any coherent way to fulfill both parts of the antecedent "there is an objective morality, but we don't care about it." Instead you get "there is an objective mumblemumble, but we don't care about it," or else "here's this morality business that obviously lots of people care about; how objective is it?"

Moral norms tend override other norms and the preferences, so definitionally, objective morality is what everyone should care about. However definitions don't move atoms. I suspect this question conflates two issues...what is theoretically important about OM,and what we would be likely to do about it in practice.

we don't talk about red lights for a train having made a moral decision. i don't think even in AI it applies. if it does than i'd be worried about the humans who offload thinking-decision making to a machine mind. anyway that entity will never comprehend anything per se because it will never be sentient in the broadest sense. I can't see it being an issue. Dropping the atom bomb didn't worry anybody.

I’m mildly shocked by the comment by Olle Häggström`s.

He contends that after we, these babies with a detonator in hand, miraculously actually managed to get it all right and create an FAI, make it correctly understand morality, and spread moral stuff through over 99.9% of the universe, generating more good than was thought imaginable in all of human history by many many orders of magnitude, we would be monsters to request our tiny galaxy for our purposes.

What does that amount to: saying that we can’t let the combination of uncertainty over the whole project, plus possibility that the universe is infinite and actually our actions were not morally relevant by adding only a finite amount of good, plus negotiating with the future in the sense of we believe it more likely that we will correctly create FAI if our descendants permit us to take some resources from our galaxy to explore our posthuman potential. If the universe is infinite, or if we still have moral uncertainty after the AI thought a lot about this, or if we are babies with an H-bomb in our hands and we managed to get it right! It feels to me profoundly uncaring/inhuman to not give us the posthuman opportunity to play and celebrate in our garden.

I'm not so sure. I feel very uncertain about the question of how aggregate utilities at such a scale. By your logic, if the universe is infinite, wouldn't ANY sort of finite action not matter?

Let me spell that out. Either the universe is infinite, or not. Either morality is agreggative or not. If both of these are true, then finite actions don't make a difference.

Is 'I want the world to fulfill its greatest potential, and all kinds of beings to thrive. Do what I mean.' substantially different from 'Do what I want'?

What do you think was the most important issue raised in this section?

The most important issue comes down to the central question of human life: what is a life worth living? To me this is an inescapably individual question the answer to which changes moment by moment in a richly diverse world. To assume there is a single answer to "moral rightness" is to assume a frozen moment in an ever-evolving universe from the perspective of a single sentient person! We struggle even for ourselves from one instant to the next to determine what is right for this particular moment! Even reducing world events to an innocuous question like a choice between coffee and tea would foment an endless struggle to determine what is "right" morally. There doesn't have to be a dire life-and-death choice to present moral difficulty. Who do you try to please and who gets to decide?

We seem to imagine that morality is like literacy in that it's provided by mere information. I disagree; I suggest it's the result of a lot of experience, most particularly failures and sufferings (and the more, the better). It's only by percolating thousands of such experiences through the active human heart that we develop a sense of wise morality. It cannot be programmed. Otherwise we would just send our kids to school and they would all emerge as saints. But we see that they tend to emerge as creatures responsive to the particular family environment from which they came. Those who were raised in an atmosphere of love often grow to become compassionate and concerned adults. Those who were abused and ignored as kids often turn out to be morally bereft adults.

In a rich and changing world it is virtually meaningless to even talk about identifying an overall moral "goodness", much as I wish it were possible and that those who are in power would actually choose that value over a narrowly self-serving alternative. It's a good discussion, but let's not fool ourselves that as a species we are mature enough to proceed to implement these ideas.

Then there would still be a hundred billion galaxies devoted to the maximization of pleasure. But we would have one galaxy within which to create wonderful civilizations that could last for billions of years and in which humans and nonhuman animals could survive and thrive, and have the opportunity to develop into beatific posthuman spirits.

I think this approach, which has been advocated before, is a massive help, especially when expanded to deal with different forms of CEV (what counts as a human? should utility functions be bounded? etc). When there are so many galaxies, it makes no sense to limit the AI to a single point of failure.

The problem comes if there are certain things which some moral systems believe to carry a negative moral weight. If one morality involved noble struggle, it could create more suffering than happiness, which would be inherently bad for a negative utilitarian. It seems there are certain moral systems which can't coexist, even in different galaxies.

I wonder how many people believe that all moral good stems from their religion? I imagine the 'extrapolated' bit might deal with this, the AI deciding 'no, you actually want happiness and an ingroup', but its not certain. If more than two different groups had utility functions assigning 1 to a person who believes in their religion and -1 to a person who doesn't, because they are going to burn in hell (1) then you end up with a situation where, even with all the resources in the universe, you can't get a positive result on the utility function without killing much of humanity.

I don't think this specific example would actually happen - people don't really believe in their religions that strongly - but I'd still be inclined to at least consider saying that people's utility functions can only apply to themselves, or can only apply negative weights to themselves, or can only apply to other people if the sign is the same as the sign that person applies to themselves. To give an example for the last proposition, you could care about me suffering iff I care about me suffering.

1) yes I know that not all religions believe this, but let's just run with it for the purposes of a thought experiment

If an AI research laboratory poised to create a superintelligence could at the same cost carry out something like CEV, MR, or a 'CEV' of their own group, under what circumstances should they choose each?

MR seems correct if we have evidence that points to CEV containing unacceptable amounts of unavoidable suffering because we are just unlucky and our true values suck.

Limited CEV seems correct when we have evidence that some class of agents don't have moral worth and shouldn't have their preferences taken into account. Most conceptions of CEV have some sort of limitation that isn't obviously a correct carving of reality at the joints. Choosing between "all humans CEV" and "all humans + extrapolation of uplifted minds" and "CEV of various future minds that we would value" is non-obvious for example.

Choosing between "all humans CEV" and "all humans + extrapolation of uplifted minds" and "CEV of various future minds that we would value" is non-obvious for example.

Take an average.

Weighting the uncertainty and value of each CEV is non-obvious, so you've pushed the uncertainty up one level.

True, but it seems easier to compromise on.

On our values, is a spatial compromise between morality and selfishness likely to be very good?

This chapter seemed a bit light on explicitly discussing the high level issues it seemed to be engaging. e.g. if there is an objective morality, but we don't care about it, what should we do?

CEV, MR, MP ... We do love complexity! Is such love a defining characteristic of intelligent entities?

The main point of all morality, as it is commonly practiced and understood, is restrictive, not promotional. A background moral code should not be expected to suggest goals to the AI, merely to denigrate some of them. The Libertarian "brass" rule is a case in point: "Do not unto others as you would not have them do unto you," which may be summarized as "Do no harm."

Of course, "others" has to be defined, perhaps as entities demonstrating sufficiently complex behavior, and exceptions have to be addressed, such as a third-party about to harm a second party. Must you restrain the third-party and likely harm her instead?

"Harm" will also need precise definition but that should be easier.

The brass rule does not require rendering assistance. Would ignoring external delivery of harm be immoral? Yes, by the "Good Samaritan" rule, but not by the brass rule. A near-absolute adherence to the brass rule would solve most moral issues, whether for AI or human.

"Near-absolute" because all the known consequences of an action must be considered in order to determine if any harm is involved and if so, how negatively the harm weighs on the goodness scale. An example of this might be a proposal to dam a river and thereby destroy a species of mussel. Presumably mussels would not exhibit sufficiently complex behavior in their own right, so the question for this consequence becomes how much their loss would harm those who do.

Should an AI protect its own existence? Not if doing so would harm a human or another AI. This addresses Asimov's three laws, even the first. The brass rule does not require obeying anything.

Apart from avoiding significant harm, the selection of goals does not depend on morality.

--rLsj

The Libertarian "brass" rule is a case in point: "Do not unto others as you would not have them do unto you," which may be summarized as "Do no harm."

Suppose you had perfect omniscience. (I'm not saying an AI would, I'm just setting up a hypothetical.) It might be the case that whenever you consider doing something, you notice that it has some harmful effect in the future on someone you consider morally important. You then end up not being able to do anything, including not being able to do nothing- because doing nothing also leads to harm in the future. So we can't just ban all harm; we need to somehow proportionally penalize harm, so that it's better to do less harm than more harm. But there are good things that are worth purchasing with harm, and so then we're back into tradeoff territory and maximizing profit instead of just minimizing cost.

(Indeed, the function of morality seems to mostly be to internalize externalities, rather than simply minimize negative externalities. Rules like "do no harm" serve for this purpose by making you consider harm to others before you act, which hopefully prevents you from doing things that are net negative while still allowing you to do things that are net positive.)

The brass rule does not require rendering assistance.

Humans have some idea of commission and omission: consider the difference between me running my car into you, you walking into the path of my car, and me not grabbing you to prevent you from walking into the path of a car. The first would be murder, the second possibly manslaughter and possibly not, and the third is not a crime. But that's a human-sized sense of commission and omission. It's not at all clear that AGIs will operate on the same scale.

When one takes a system-sized viewpoint, commission and omission become very different. The choice to not add a safety feature that makes accidents less likely does make the system-designer responsible for those accidents in some way, but not in a way that maps neatly on to murder, manslaughter, and nothing.

It seems like AGIs are more likely to operate on a system-sized viewpoint than a human-sized viewpoint. It's not enough to tell Google "don't be evil" and trust that their inborn human morality will correct translate "evil." What does it mean for an institution the size and shape of Google to be evil? They need to make many tradeoffs that people normally do not have to consider, and thus may not have good intuitions for.

"[Y]ou notice that [a proposed action] has some harmful effect in the future on someone you consider morally important. You then end up not being able to do anything ..."

Not being able to do that thing, yes, and you shouldn't do it -- unless you can obviate the harm. A case in point is the AGI taking over management of all commodity production and thus putting the current producers out of work. But how is that harmful to them? They can still perform the acts if they wish. They can't earn a living, you say? Well, then, let the AGI support them. Ah, but then, you suppose, they can't enjoy the personal worth that meaningful employment reinforces. The what? Let's stick to the material, please.

"You then end up not being able to do nothing -- because doing nothing also leads to harm in the future."

That does not follow. Doing nothing is always an option under the brass rule. Morally you are not the cause of any harm that then occurred, if any.

Commission vs. omission [of causative actions]: Omitting an action may indeed allow an entity to come to harm, but this is not a moral issue unless acting would harm that entity or another, perhaps to a lesser degree. Commission -- taking action -- is the problematic case. I repeat: a coded moral system should be restrictive, not promotional. Preventing external harm may be desirable and admirable but is never morally imperative, however physically imperative it may be.

"[Not adding a safety feature] does make the system-designer responsible for [the resulting] accidents in some way ..."

Only by the "Good Samaritan" moral code, in which this society is so dolefully steeped. I prefer Caveat emptor. It may be that when AGIs are the principal operators of harmful equipment, the obsession with safety will moderate.

Not being able to do that thing, yes, and you shouldn't do it -- unless you can obviate the harm.

The relevant scenario is one in which all possible actions lead to some harm somewhere. Suppose the AGI designed to cure cancer uses electrical power to run a molecular simulation; then it's causing someone to die due respiratory illness from inhaling coal dust, or to die from falling off a ladder installing a solar panel, or to die in a mine, or so on. Suppose it doesn't; then people are dying due to cancer, and it's abandoning the duty it cares deeply about.

Typically, this problem gets solved by either not thinking about it, rounding small numbers to 0, or by taxes. Consequentialist restrictive moralities operate by "taxes"-- if you want to use that electricity that someone died for, you need to need to be using it for something good enough to offset that cost.

For example, coal costs about a hundred lives per TWh; American per capita power consumption is about 1.7kW, and so we're looking at about one death every 670 person-years of energy consumption. It's a small number, but there are clear problems with rounding it down to zero: if we do, half of zero is still zero, and there's no impetus to reduce the pollution or switch to something cleaner. And not thinking about it is even more dangerous!

Only by the "Good Samaritan" moral code, in which this society is so dolefully steeped. I prefer Caveat emptor.

Do you think it is sensible to leave sharp knives around unattended infants? If so, yikes, and if not, why not?

Clearly the infant's choices led to it cutting itself, but we wouldn't call that the infant's informed consent, because we don't think infants can provide informed consent, because we don't think they can be informed of the consequences of their actions. Most libertarian reasoning assumes that we are not dealing with infants, but instead with "responsible adults," who can reason about their situations and make choices and "deserve what they get."

But when we're designing systems, there's a huge information asymmetry, and the system designer often picks what information the user is looking at. To replace the field of human factors research with caveat emptor is profoundly anti-life and anti-efficiency.

And note that we haven't touched at all on the Good Samaritan issue--there's a different underlying relationship than the bystander-victim relationship, or the merchant-merchant relationship. The designer-user relationship is categorically different, and we can strongly endorse moral obligations there without also endorsing them in the other two scenarios. (Hayek's information cost seems relevant.)

(The relevance is that most AI-human interactions will be closer to designer-user or parent-infant than merchant-merchant, and what constitutes trickery and harm might look very different.)

After this section, it feels like the "do what I mean"/"do what I want" instruction pretty much solves the problem of what we want the AI to value. If the creator the of the AI doesn't want things that work to a good future, then it seems like they would be unlikely to succeed in specifying a good future through other means. On the other hand, if the creator wants the right thing, then DWIM seems to avoid all perverse instantiations. Additionally, it seems like the only technical requirement is that the AI be able to follow natural language instructions (maybe with a bit of simpler definitions of value for the AI to use while it is still learning). Overall, my impression is that this area doesn't require nearly as much work as other parts of superintelligence design (such as getting an AI to value goals described in natural language in the first place).

Suppose we have a bunch of short natural language descriptions of what we would want the AI to value. Can we simply give the AI a list of these, and tell it to maximize all of these values given some kind of equal weighting? It seems to me that, much more than in other areas of superintelligence design, the things we come up with are likely to point to what we want, and so aggregating a bunch of these descriptions is more likely to lead to what we want than picking any description individually. Does it seem like this would work? Is there any way this can go wrong?

What if we gave the AI the contents of this entire superintelligence chapter, or the entire body of writing on AI design, and told the AI something like "do the thing that this body of writing describes". It seems like this would help in specifying situations that we want to avoid, that might seem ambiguous in any particular short natural language description of what we would want the AI to do. Would this be likely to be more or less robust than trying to come up with a short natural language description? Could we assume the AI would already take all this into account, even if given only a short natural language description?

What does "the AI" mean? Computers don't come with the ability to interpret English. You still need to either translate "do the thing that this body of writing describes" into formal language, or program a method of translating English instructions in general while avoiding gotcha interpretations (e.g, "this body of writing" surely "describes", in passing, an AI that kills us all or simply does nothing). Intelligence as I imagine it requires a goal to have meaning; if that goal is just 'do something that literally fulfills your instruction according to dictionary meanings', the most efficient way to accomplish that is to find some mention of an AI that does nothing and imitate it. Whereas if we program in some drive to fulfill what humans should have asked for, that sounds a lot like CEV. I don't find it obvious, to put it mildly, that your extra step adds anything.

I assume "AI capable of understanding and treating as its final goal some natural language piece of text", which is of course hard to create. I don't think this presupposes that the AI automatically interprets instructions as we wish them to be interpreted, which is the part that we add by supplying a long natural languge description of ways that we might specify this, and problems we would want to avoid in doing so.

I will try this one more time. I'm assuming the AI needs a goal to do anything, including "understand". The question of what a piece of text "means" does not, I think, have a definite answer that human philosophers would agree on.

You could try to program the AI to determine meaning by asking whether the writer (of the text) would verbally agree with the interpretation in some hypothetical situation. In which case, congratulations: you've rediscovered part of CEV. As with full CEV, the process of extrapolation is everything. (If the AI is allowed to ask what you'd agree to under torture or direct brain-modification, once it gets the ability to do those, then it can take anything whatsoever as its goal.)

Okay, you're right, this does presuppose correctly performing volition extrapolation (or pointing the AI to the right concept of volition). It doesn't presuppose full CEV over multiple people, or knowing whether you want to specify CEV or MR, which slightly simplifies the underlying problem.

I'm not optimistic that behind every vague and ambiguous command, there is something specific that a person 'really means'. It seems more likely there is something they would in fact try to mean, if they thought about it a bunch more, but this is mostly defined by further facts about their brains, rather than the sentence and what they thought or felt as they said it. It seems at least misleading to call this 'what they meant'. Thus even when '—and do what I mean' is appended to other kinds of goals than generic CEV-style ones, I would expect the execution to look much like a generic investigation of human values, such as that implicit in CEV.

Excellent point (and so much for "original intent" theories of jurisprudence).

I disagree that it would be surprising for an AI to be very good at flying planes in general, but very bad at going to the right places in them. However it seems instructive to think about why this is.

Suppose the funder of autopilot-AI research gives continuing funding to AI projects whose planes don't crash, but happily funds a variety of flight paths and destinations. "Plane doesn't crash" might be analogous to natural language understanding, image processing, or other area of cognition; various "destinations" include economic and military uses of these technologies.

and so much for "original intent" theories of jurisprudence

I don't buy this.

It is probably true that when people say things there isn't something specific that they "really mean" down to all the details. But there are some aspects of it where they have specific meaning; the range which is left vague by their inability to understand the details is limited. When we say that we want an AI to do what we mean, what that really signifies is that we want the AI to do what we mean to the extent that we do mean something specific, and to do something within the range of things we consider acceptable where our meaning covers a range.

Asking an AI to cure illness, then, should never result in the AI killing everyone (so that they are no longer ill), but it might result in the AI producing a drug that cures an illness while causing some side effects that could also be described as an "illness". Humans may not be able to agree on exactly how many side effects are acceptable, but as long as the AI's result produces side effects that are acceptable to at least some humans, and does not kill everyone, we would probably think the AI is doing what we mean to the best of its ability.

Well yeah, I agree that meaning does have a core, and is only fuzzy at the edges. So, can a robust natural language ability, assuming we can get that in an AI, be leveraged to get the AI to do what we mean? Could the fundamental "drives" of AI "motivation" themselves be written in natural language? I haven't done any significant work with compilers, so forgive me if this question is naive.