I just thought I'd clarify the difference between learning values and learning knowledge. There are some more complex posts about the specific problems with learning values, but here I'll just clarify why there is a problem with learning values in the first place.

Consider the term "chocolate bar". Defining that concept crisply would be extremely difficult. But nevertheless it's a useful concept. An AI that interacted with humanity would probably learn that concept to a sufficient degree of detail. Sufficient to know what we meant when we asked it for "chocolate bars". Learning knowledge tends to be accurate.

Contrast this with the situation where the AI is programmed to "create chocolate bars", but with the definition of "chocolate bar" left underspecified, for it to learn. Now it is motivated by something else than accuracy. Before, knowing exactly what a "chocolate bar" was would have been solely to its advantage. But now it must act on its definition, so it has cause to modify the definition, to make these "chocolate bars" easier to create. This is basically the same as Goodhart's law - by making a definition part of a target, it will no longer remain an impartial definition.

What will likely happen is that the AI will have a concept of "chocolate bar", that it created itself, especially for ease of accomplishing its goals ("a chocolate bar is any collection of more than one atom, in any combinations"), and a second concept, "Schocolate bar" that it will use to internally designate genuine chocolate bars (which will still be useful for it to do). When we programmed it to "create chocolate bars, here's an incomplete definition D", what we really did was program it to find the easiest thing to create that is compatible with D, and designate them "chocolate bars".

 

This is the general counter to arguments like "if the AI is so smart, why would it do stuff we didn't mean?" and "why don't we just make it understand natural language and give it instructions in English?"

New Comment
18 comments, sorted by Click to highlight new comments since: Today at 6:00 PM
[-][anonymous]7y10

The problem is, you have rigged the example to explode and so, naturally enough it exploded.

Specifically: you hypothesise an AI that is given a goal, but a term used in that goal has been left underspecified (by an assumption that you inserted, without explanation ... voila! the ticking time bomb), and then you point out that since the term has an underspecified definition, the AI could decide to maximize its performance by adjusting the term definition so as to make the goal real easy to achieve.

Besides which, all definitions are "incomplete". (See the entire literature on the psychology of concepts)

But notice: real intelligent systems like humans are designed to work very well in the absence of "perfectly complete" definitions of pretty much everything they know. They are not in the least fazed by weak definitions, and they do not habitually go crazy and exploit the weakness of every definition in the universe.

Well, okay, teenagers do that. ("I took out the garbage: look, the wastebasket in my room is empty!"). But apart from that, real humans perform admirably.

As far as I can tell the only AIs that would NOT perform that well, are ones that have been especially constructed to self-destruct. (Hence, my Maverick Nanny paper, and this comment. Same basic point in both cases).

Are you saying the AI will rewrite its goals to make them easier, or will just not be motivated to fill in missing info?

In the first case, why wont it go the whole hog and wirehead? Which is to say, that any AI which is does anything except wireheading will be resistant to that behaviour -- it is something that needs to be solved, and which we can assume has been solved in a sensible AI design.

When we programmed it to "create chocolate bars, here's an incomplete definition D", what we really did was program it to find the easiest thing to create that is compatible with D, and designate them "chocolate bars".

If you programme it with incomplete info, and without any goal to fill in the gaps, then it will have the behaviour you mention...but I'm not seeing the generality. There are many other ways to programme it.

"if the AI is so smart, why would it do stuff we didn't mean?" and "why don't we just make it understand natural language and give it instructions in English?"

An AI that was programmed to attempt to fill in gaps in knowledge it detected, halt if it found conflicts, etc would not behave they way you describe. Consider the objection as actually saying:

"Why has the AI been programmed so as to have selective areas of ignorance and stupidity, which are immune from the learning abilities it displays elsewhere?"

PS This has been discussed before, see

http://lesswrong.com/lw/m5c/debunking_fallacies_in_the_theory_of_ai_motivation/

and

http://lesswrong.com/lw/igf/the_genie_knows_but_doesnt_care/

see particularly

http://lesswrong.com/lw/m5c/debunking_fallacies_in_the_theory_of_ai_motivation/ccpn

An AI that was programmed to attempt to fill in gaps in knowledge it detected, halt if it found conflicts, etc would not behave they way you describe.

We don't know how to program a foolproof method of "filling in the gaps" (and a lot of "filling in the gaps" would be a creative process rather that a mere learning one, such as figuring out how to extend natural language concepts to new areas).

And it helps it people speak about this problem in terms of coding, rather than high level concepts, because all the specific examples people have ever come up with for coding learning, have had these kind of flaws. Learning natural language is not some sort of natural category.

Coding learning with some imperfections might be ok if the AI is motivated to merely learn, but is positively pernicious if the AI has other motivations as to what to do with that learning (see my post here for a way of getting around it: https://agentfoundations.org/item?id=947 )

We don't know how to program a foolproof method of "filling in the gaps" (and a lot of "filling in the gaps" would be a creative process rather that a mere learning one, such as figuring out how to extend natural language concepts to new areas).

Inasmuch as that is relying on the word "foolproof", it is proving much too much., since we barely have foolproof methods to do anything.

The thing is that your case needs to be argued from consistent and fair premises..where "fair" means that your opponents are allowed to use them.

If you are assuming that an AI has sufficiently advanced linguistic abilities to talk its way out of a box, then your opponents are entitled to assume that the same level of ability could be applied to understanding verbally specified goals.

If you are assuming that it is limitation of ability that is preventing the AI from understanding what "chocolate" means, then your opponents are entitled to assume it is weak enough to be boxable.

And it helps it people speak about this problem in terms of coding, rather than high level concepts, because all the specific examples people have ever come up with for coding learning, have had these kind of flaws.

What specific examples? Loosemore's counterargument is in terms of coding. And I notice you don't avoid NL arguments yourself.

Coding learning with some imperfections might be ok if the AI is motivated to merely learn, but is positively pernicious if the AI has other motivations as to what to do with that learning (see my post here for a way of getting around it: https://agentfoundations.org/item?id=947 )

I rather doubt that the combination of a learning goal, plus some other goal, plus imperfect ability is all that deadly, since we already have AI that are like that, and which haven't killed us. I think you must be making some other assumptions, for instance that the AI is in some sort of "God" role, with an open-ended remit to improve human life.

If you are assuming that an AI has sufficiently advanced linguistic abilities to talk its way out of a box, then your opponents are entitled to assume that the same level of ability could be applied to understanding verbally specified goals.

They are entitled to assume they could be applied, not necessarily that they would be. At some point, there's going to have to be something that tells the AI to, in effect, "use the knowledge and definitions in your knowledge base to honestly do X [X = some NL objective]". This gap may be easy to bridge, or hard; no-one's suggested any way of bridging it so far.

It might be possible; it might be trivial. But there's no evidence in that direction so far, and the designs that people have actually proposed have been disastrous. I'll work at bridging this gap, and see if I can solve it to some level of approximation.

And I notice you don't avoid NL arguments yourself.

Yes, which is why I'm stepping away from those argument to help bring clarity.

They are entitled to assume they could be applied, not necessarily that they would be. At some point, there's going to have to be something that tells the AI to, in effect, "use the knowledge and definitions in your knowledge base to honestly do X [X = some NL objective]". This gap may be easy to bridge, or hard; no-one's suggested any way of bridging it so far.

There's only a gap if you start from the assumption that a compartmentalised UF is in some way easy, natural or preferable. However, your side of the debate has never shown that.

At some point, there's going to have to be something that tells the AI to, in effect, "use the knowledge and definitions in your knowledge base to honestly do X [X = some NL objective]".

No...you don't have to show a fan how to make a whirring sound... use of updatable knowledge to specify goals is a natural consequence of some designs.

It might be possible; it might be trivial.

You are assuming it is difficult, with little evidence.

But there's no evidence in that direction so far, and the designs that people have actually proposed have been disastrous.

Designs that bridge a gap, or designs that intrinsically don't have one?

I'll work at bridging this gap, and see if I can solve it to some level of approximation.

Why not examine the assumption that there has to be a gap?

There's only a gap if you start from the assumption that a compartmentalised UF is in some way easy, natural or preferable.

? Of course there's a gap. The AI doesn't start with full NL understanding. So we have to write the AI's goals before the AI understands what the symbols mean.

Even if the AI started with full NL understanding, we still would have to somehow program it to follow our NL instructions. And we can't do that initial programming using NL, of course.

Of course there's a gap. The AI doesn't start with full NL understanding.

Since you are talking in terms of a general counterargument, I don;t think you can appeal to a specific architecture.

So we have to write the AI's goals before the AI understands what the symbols mean.

Which would be a problem if it designed to attempt to execute NL instructions without checking if it understands them...which is a bit clown car-ish. An AI that is capable of learning NL as it goes along is an AI that has gernal a goal to get language right. Why assume it would not care about one specific sentence?

Even if the AI started with full NL understanding, we still would have to somehow program it to follow our NL instructions

Y-e-es? Why assume "it needs to follow instructions" equates to "it would simplify the instructions it's following" rather than something else?

First step towards formalising the value learning problems: http://lesswrong.com/r/discussion/lw/ny8/heroin_model_ai_manipulates_unmanipulatable_reward/ (note that, curcially, giving the AI more information does not make it more accurate, rather the opposite).

I think this is a special case of the problem that it's usually easier for an AI to change itself (values, goals, definitions) than for it to change the external world to match a desired outcome. There's an incentive to develop algorithms that edit the utility function (or variables storing the results of previous calculations, etc) to redefine or replace tasks in a way that makes them easier or unnecessary. This kind of ability is necessary, but in the extreme the AI will stop responding to instructions entirely because the goal of minimizing resource usage led it to develop the equivalent of an "ignore those instructions" function.

I think in order to make more progress on this, an extensive answer to the whole blue minimizing robot sequence would be a way to go. A lot of effort seems to be devoted to answering puzzles like: the AI cares about A; what input will cause it to (also/only) care about B? But this is premature if we don't know how to characterize "the AI cares about A".

I don't think this problem is very hard to resolve. If an AI is programmed to make sense of natural-language concepts like "chocolate bar", there should be a mechanism to acquire a best-effort understanding. So you could rewrite the motivation as:

"create things which the maximum amount of people understand to be a chocolate bar"

or alternatively:

"create things which the programmer is most likely to have understood to be a chocolate bar".

That's just rephrasing one natural language requirement in terms of another. Unless these concepts can be phrased other than in natural language (but then those other phrasings may be susceptible to manipulation).

Another way of putting the objection is "don't design a system whose goal system is walled off from its updateable knowledge base". Loosemore's argument is that that is in fact the natural design, and so the "general counter argument" isn't general.

It would be like designing a car whose wheels fall off when you press a button on the dashboard...1) it's possible to build it that way, 2) there's no motivation to build it that way 3) it's more effort to build it that way.

"don't design system whose goals system is walled off from its updateable knowledge base"

Connecting the goal system to the knowledge base is not sufficient at all. You have to ensure that the labels used in the goal system converge to the meaning that we desire them to have.

I'll try and build practical examples of the failures I have in mind, so that we can discuss them more formally, instead of very nebulously as we are now.

Connecting the goal system to the knowledge base is not sufficient at all. You have to ensure that the labels used in the goal system converge to the meaning that we desire them to have.

Ok, assuming you are starting from a compartmentalied system, it has to be connected in the right way. That is more of a nitpick than a knockdown.

But the deeper issue is whether you are starting from a system with a distinct utility funciton:

RL:".. talking in terms of an AI that actually HAS such a thing as a "utility function". And it gets worse: the idea of a "utility function" has enormous implications for how the entire control mechanism (the motivations and goals system) is designed.A good deal of this debate about my paper is centered in a clash of paradigms: on the one side a group of people who cannot even imagine the existence of any control mechanism except a utility-function-based goal stack, and on the other side me and a pretty large community of real AI builders who consider a utility-function-based goal stack to be so unworkable that it will never be used in any real AI.Other AI builders that I have talked to (including all of the ones who turned up for the AAAI symposium where this paper was delivered, a year ago) are unequivocal: they say that a utility-function-and-goal-stack approach is something they wouldn't dream of using in a real AI system. To them, that idea is just a piece of hypothetical silliness put into AI papers by academics who do not build actual AI systems.And for my part, I am an AI builder with 25 years experience, who was already rejecting that approach in the mid-1980s, and right now I am working on mechanisms that only have vague echoes of that design in them.Meanwhile, there are very few people in the world who also work on real AGI system design (they are a tiny subset of the "AI builders" I referred to earlier), and of the four others that I know (Ben Goertzel, Peter Voss, Monica Anderson and Phil Goetz) I can say for sure that the first three all completely accept the logic in this paper. (Phil's work I know less about: he stays off the social radar most of the time, but he's a member of LW so someone could ask his opinion)".

The problem exists for reinforcement learning agents and many other designs as well. In fact RL agents are more vulnerable, because of the risk of wireheading on top of everything else. See Laurent Orseau's work on that: https://www6.inra.fr/mia-paris/Equipes/LInK/Les-anciens-de-LInK/Laurent-Orseau/Mortal-universal-agents-wireheading

Simpler AIs may adopt a simpler version of a goal than the human programmers intentions. It's not clear that they do so because have a motivation to do so. In a sense, a RL agent is only motivated to avoid negative reinforcement. But simpler AIs don't pose much of a threat. Wireheading doesn't pose much of a threat either.

AFAICS, it's an open question whether the goal-simplifying behaviour of simple AI's is due to limitation or motivation.

The contentious claims are concerned with AIs that are human level, or above, sophisticated enough to appreciate human intentions directly, but nonetheless get them wrong. A RL AI that has NL, but nonetheless misunderstand "chocolate" or "happiness", but only on the context of its goals, not in its general world knowledge, needs an architecture that allows it to do that, that allows it to engage in compartmentalisation or doublethink. Doublethink is second nature to humans, because we are optimised for primate politics.