Vaniver comments on Debunking Fallacies in the Theory of AI Motivation - Less Wrong

8 Post author: Richard_Loosemore 05 May 2015 02:46AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (343)

You are viewing a single comment's thread. Show more comments above.

Comment author: Richard_Loosemore 05 May 2015 06:25:51PM *  6 points [-]

You make a valid point - and one worth discussing at length, sometime - but the most important thing right now is that you have misunderstood my position on the question.

First of all, there is a very big distinction between a few people (or even the whole population!) making a deliberate choice to wirehead, and the nanny AI deciding to force everyone to wirehead because that is its interpretation of "making humans happy" (and doing so in a context in which those humans do not want to do it).

You'll notice that in the above quote from my essay, I said that most people would consider it a sign of insanity if a human being were to suggest forcing ALL humans to wirehead, and doing so on the grounds that this was the best way to achieve universal human happiness. If that same human were to suggest that we should ALLOW some humans to wirehead if they believed it would make them happy, then I would not for a moment label that person insane, and quite a lot of people would react the same way.

So I want to be very clear: I very much do acknowledge that the questions regarding various forms of voluntary wireheading are serious and unanswered. I'm in complete agreement with you on that score. But in my paper I was talking only about the apparent contradiction between (a) forcing people to do something as they screamed their protests, while claiming that those people had asked for this to be done, and (b) an assessment that this behavior was both intelligent and sane. My claim in the above quote was that there is a prima facie case to be made, that the proper conclusion is that the behavior would indeed not be intelligent and sane.

(Bear in mind that the quote was just a statement of a prima facie. The point was not to declare that the AI really is insane and/or not intelligent, but to say that there are grounds for questioning. Then the paper goes on to look into the whole problem in more detail. And, most important of all, I am trying to suggest that responding to this my simply declaring that the AI has a 'different' type of intelligence, or even a superior type of 'intelligence' would be a glaring example of sweeping the prima facie case straight into the trashcan without even looking at it.)

You go on to make a different point:

I also suspect that this attitude generalizes: there are many places where [...] you seem to be responding with "but clearly X will be generated," or "but superintelligence will generate X." Is it actually clear? Will it be obvious in prospect that an AI design is superintelligent at determining what we want, rather than just obvious in retrospect? What features of the AI design should we be looking for?

Well, I hope I have reassured you that, even if I do do that, it would not be a generalization of my attitude.

But now, do I do that? I try really hard not to take anything for granted and simply make an appeal to the obviousness of any idea. So you will have to give me some case-by-case examples if you think I really have done that.

You mention only one, and it is a biggie.

The whole issue of proving safety is deeply problematic (a whole essay in itself). I tried to talk about it in the above, but there was not enough space to develop it fully.

The basic points are these:

1) Proof-of-correctness techniques can indeed help with some things, but as Selmer Bringsjord found out to his embarrassment at the 2009 AGI conference, there is a problem if anyone thinks that proofs can be had in high-complexity situations. As I explained at that time, there are two things that go into a proof-of-correctness machine: (a) the target whose correctness is to be proved, and (b) a specification of what qualifies as correctness. In those cases where the specification of what qualifies is simply a list of syntactic rules that must be followed, this approach is valid -- but when the specification of what qualifies as "correct" becomes huge (for example, it involves a massive, open-ended stipulation of the full meaning of "moral behavior" or "friendliness toward humanity"), the specification ITSELF will have bugs in it, and the specification will be of the same magnitude as the target! This means that the specification needs to be checked for correctness first ... and you can see that this quickly leads to an infinite regress.

What this means is that when you point to the security engended by type-checking in functional programming, you seem to be implying that we really need something like that for checking or proving the safety of AGI systems, and really that is never going to be possible. Yes, certain classes of errors can be guaranteed not to occur in some types of programming, but that will never generalize to systems in which the specification of what counts as correct is large. That is a dream that we need to let go of.

(I started writing a paper about that a few weeks ago, in fact, but I stopped doing it because someone pointed out that MIRI no longer even claims that a rigorous proof of Friendliness is even possible. My understanding is, then, that this point I just made is now generally accepted.)

2) However, statistically-oriented 'proofs' can be useful in systems where the overall behavior is caused by a large ensemble of interacting processes, and where the overall behavior cannot be hijacked by any one of the atoms in that ensemble. To wit: we can make dramatically precise statements about the ensemble properties of thermodynamic systems, even though the individual atoms can do whatever they like.

It is that second type of proof-of-friendliness (it is not really a 'proof' it is only a statistical argument) is what I was alluding to when I talked about Swarm Relaxation systems.

So, I don't think I was just assuming that Swarm Relaxation could lead to stronger claims about friendliness, I was basing the claim on a line of reasoning.

Comment author: Vaniver 05 May 2015 11:20:20PM 3 points [-]

the most important thing right now is that you have misunderstood my position on the question.

Thanks for the clarification!

But in my paper I was talking only about the apparent contradiction between (a) forcing people to do something as they screamed their protests, while claiming that those people had asked for this to be done, and (b) an assessment that this behavior was both intelligent and sane.

I'm glad to hear that we're focusing on this narrow issue, so let me try to present my thoughts on it more clearly. Unfortunately, this involves bringing up many individual examples of issues, none of which I particularly care about; I'm trying to point at the central issue that we may need to instruct an AI how to solve these sorts of problems in general, or we may run into issues where an AI extrapolates its models incorrectly.

When people talk about interpersonal ethics, they typically think in terms of relationships. Two people who meet in the street have certain rules for interaction, and teachers and students have other rules for interaction, and doctors and patients other rules, and so on. When considering superintelligences interacting with intelligences, the sort of rules we will need seem categorically different, and the closest analogs we have now are system designers and engineers interacting with users.

When we consider people interacting with people, we can rely on 'informed consent' as our gold standard, because it's flexible and prevents most bad things while allowing most good things. But consent has its limits; society extends children only limited powers of consent, reserving many (but not all) of them for their parents; some people are determined mentally incapable, and so on. We have complicated relationships where one person acts in trust for another person (I might be unable to understand a legal document, but still sign it on the advice of my lawyer, who presumably can understand that document, or be unable to understand the implications of undergoing a particular medical treatment, but still do it on the advice of my doctor), because the point of those relationships is that one person can trade their specialized knowledge to another person, but the second person is benefited by a guarantee the first person is actually acting in their interest.

We can imagine a doctor wireheading their patient when the patient did not in fact want to be wireheaded, for a wide variety of reasons. I'm neither a doctor nor a lawyer, so I can't tell you what sort of consent a doctor needs to inject someone with morphine--but it seems to me that sometimes things will be uncertain, and the doctor will drug someone when a doctor with perfect knowledge would not have, but we nevertheless endorse that decision as the best one the doctor could have made at the time.

But informed consent starts being less useful when we get to systems. Consider a system that takes biomedical data from every person in America seeking an organ transplant, and the biomedical data from donated organs as they become available, and matches organs to patients. Everyone involved likely consented to be involved (or, at least, didn't not consent if there's an opt-out organ donation system), but there are still huge ethical questions remaining to be solved. What tradeoffs are we willing to accept between maximizing QALYs and ensuring equity? How much biomedical data can we use to predict the QALYs from any particular transplant, and what constitutes unethical discrimination?

It seems unlikely to me that the main challenge of superintelligence is that a superintelligence will force or trick us into doing things that it is obvious that we don't want to do them. It seems likely to me that the main challenge is that it will set up systems, potentially with some form of mandatory participation, and we thus need to create a generalized system architect that can solve those ethical, moral, and engineering problems for us while designing arbitrary systems, without us having coded in the exact solutions.

Notice also that consent applies to individual rights, not community rights, but many people's happiness and livelihoods may rest on community rights. There are already debates over whether or not deafness should be cured: how sad to always be the youngest deaf person, or for deaf culture to disappear, but to avoid that harm, we need some people to be deaf instead of hearing, which is its own harm. Managing this in a way that truly maximizes human flourishing seems like it requires a long description.

Many human ethical problems are solved by rounding small numbers to zero, but superintelligences represent the ability to actually track those small numbers, which means entire legal categories that rest on a sharp divide between 0 and 1 could become smooth. For example, consider 'sexual harassment' defined as 'unwanted advances.' Should a SI censor any advances it thinks that the receiver will not want, or is that taking sovereignty from the receiver to determine whether or not they want any advance?

My understanding is, then, that this point I just made is now generally accepted.

Right, and I agree with it as well. I think the remaining useful insight of functional programming is that minimizing side effects increases code legibility, and if we want to be confident in the reasoning of an AI system (or an AI system wants to be able to confidently predict the impact of a proposed self-modification) we likely want the code to be partitioned as neatly as possible, so any downstream changes or upstream dependencies can be determined simply.

Neural nets, and related systems, do not have a preference for legibility built into the underlying structure, and so we may not want to use them or related systems for goal management code, or take especial care when connecting them to goal management code.

where the overall behavior cannot be hijacked by any one of the atoms in that ensemble.

Hmm. I'm going to have to think about this claim longer, but right now I disagree. I think the model of human reasoning that seems most natural to me is hierarchical control systems. When I think of "swarms," that implies to me some sort of homogeneity between the agents, such as might describe groups of humans but not necessarily individual humans. (If we consider humans as swarms of neurons, which is how I originally read your statement, then the 'swarm-like' properties map on the control loops of the hierarchical view.)

But it seems to me that if the atoms in the swarm have specialized roles (like neurons), then a small number of atoms behaving strangely can lead to the swarm behaving strangely. (This is easier to see in the controls case, but I think is also sensible in the swarm of neurons model.) I'm thinking of the various extreme cases of psychology as examples: stuff like destroying parts of cats' brains and seeing how they behave, or the various exotic psychological disorders whose causes can be localized in the brain, or so on. That a system is built out of many subcomponents, instead of being a single logical reasoner, does not seem to me to be a significant source of safety.

(Now, I do think that various 'moral congress' ideas might represent some sort of safety--if you need many value systems to all agree that something is a good idea, then it seems less likely that extreme alternatives will be chosen in exotic scenarios, and a single value system can be a composite of many simpler value systems. This is ideas like 'bagging' from machine learning applied to goals--but the gains seem minor to me.)