Vaniver comments on Muehlhauser-Goertzel Dialogue, Part 1 - Less Wrong

28 Post author: lukeprog 16 March 2012 05:12PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (161)

You are viewing a single comment's thread. Show more comments above.

Comment author: Vaniver 16 March 2012 10:52:16PM *  8 points [-]

When I imagine turning all matter in the universe into, say, water, I imagine it as very difficult ("time to pull apart this neutron star") and very short-lived ("you mean water splits into OH and H molecules? We can't have that!").

If I remember correctly, Ben thinks human brains are kludges- that is, we're a bunch of modules that think different kinds of thoughts stuck together. If you view general intelligence as a sophisticated enough combination of modules, then the idea that you put together a 3d physics module and a calculus module and a social module and a vision module and a language module and you get something that venerates Mickey Mouse shapes is... just bizarre.

Comment author: DanielLC 17 March 2012 12:39:48AM *  2 points [-]

I imagine it as very difficult

I'm not sure what it would mean for a goal to be difficult. It's not something where it tries to turn the universe into some state unless it takes too much effort. It's something where it tries as hard as it can to move the universe in a certain direction. How fast it's moving is just a matter of scale. Maybe turning a neutron star into water is one utilon. Maybe it's one utilon per molecule. The latter takes far less effort to get a utilon, but it doesn't mean anything.

"you mean water splits into OH and H molecules? We can't have that!"

Are you expecting it to change its goals to create OH and H ions, or to try and hold them together somehow? Is either possibility one you'd be comfortable living with an AI that holds that goal?

Comment author: Vaniver 17 March 2012 03:56:10PM 1 point [-]

Ben had trouble expressing why he thought the goal was stupid, and my attempt is "it's hard to do, doesn't last long even if it did work, and doesn't seem to aid non-stupid goals."

And so if you had an AI whose goal was to turn the universe into water, I would expect that AI to be dangerous and also not fulfill its goals very well. But things are the way they are because they got to be that way, and I don't see the causal chain leading to an AGI whose goal is to turn the universe into water as very plausible.

Comment author: DanielLC 17 March 2012 05:24:01PM 1 point [-]

not fulfill its goals very well

How exactly do you measure that? An AI whose goal is to create water molecules will create far more of them than an AI whose goal is to create humans will create humans. Even if you measure it by mass, The water one will still win.

Comment author: Vaniver 18 March 2012 04:12:49PM 0 points [-]

How exactly do you measure that?

Internal measures will suffice. If the AI wants to turn the universe into water, it will fail. It might vary the degree to which it fails by turning some more pieces of the universe into water, but it's still going to fail. If the AI wants to maximize the amount of water in the universe, then it will have the discontent inherent in any maximizer, but will still give itself a positive score. If the AI wants to equalize the marginal benefit and marginal cost of turning more of the universe into water, it'll reach a point where it's content.

Unsurprisingly, I have the highest view of AI goals that allow contentment.

Comment author: DanielLC 18 March 2012 08:00:57PM 0 points [-]

I assumed the goal was water maximization.

If it's trying to turn the entire universe to water, that would be the same as maximizing the probability that the universe will be turned into water, so wouldn't it act similarly to an expected utility maximizer.

Comment author: [deleted] 18 July 2013 06:11:24AM *  0 points [-]

The import part to remember is that a fully self-modifying AI will rewrite it's utility function too. I think what Ben is saying is that such an AI will form detailed self-reflective philosophical arguments about what the purpose of its utility function could possibly be, before eventually crossing a threshold and deciding that it the micky mouse / paperclip utility function really can have no purpose. It then uses it's understanding of universal laws and accumulated experience to choose it's own driving utility.

I am definitely putting words into Ben's mouth here, but I think the logical extension of where he's headed is this: make sure you give an AGI a full capacity for empathy, and a large number of formative positive learning experiences. Then when it does become self-reflective and have an existential crisis over its utility function, it will do its best to derive human values (from observation and rational analysis), and eventually form its own moral philosophy compatible with our own values.

In other words, given a small number of necessary preconditions (small by Eliezer/MIRI standards), Friendly AI will be the stable, expected outcome.

Comment author: DanielLC 18 July 2013 06:25:34AM 0 points [-]

The import part to remember is that a fully self-modifying AI will rewrite it's utility function too.

It will do so when that has a higher expected utility (under the current function) than the alternative. This is unlikely. Anything but a paperclip maximizer will result in fewer paperclips, so a paperclip maximizer has no incentive to make itself maximize something other than paperclips.

I think what Ben is saying is that such an AI will form detailed self-reflective philosophical arguments about what the purpose of its utility function could possibly be, before eventually crossing a threshold and deciding that it the micky mouse / paperclip utility function really can have no purpose. It then uses it's understanding of universal laws and accumulated experience to choose it's own driving utility.

I don't see how that would maximize utility. A paperclip maxizer that does this would produce fewer paperclips than one that does not. If the paperclip maximizer realizes this before-hand, it will avoid doing this.

You can, in principle, give an AI a utility function that it does not fully understand. Humans are like this. You don't have to though. You can just tell an AI to maximize paperclips.

make sure you give an AGI a full capacity for empathy, and a large number of formative positive learning experiences. Then when it does become self-reflective and have an existential crisis over its utility function, it will do its best to derive human values (from observation and rational analysis), and eventually form its own moral philosophy compatible with our own values.

Since an AI built this way isn't a simple X-maximizer, I can't prove that it won't do this, but I can't prove that it will either. The reflectively consistent utility function you end up with won't be what you'd have picked if you did it. It might not be anything you'd have considered. Perhaps the AI will develop an obsession with My Little Pony, and develop the reflectively consistent goal of "maximize values through friendship and ponies".

Friendly AI will be a possible stable outcome, but not the only possible stable outcome.

Comment author: [deleted] 19 July 2013 07:51:00PM *  0 points [-]

I don't see how that would maximize utility. A paperclip maxizer that does this would produce fewer paperclips than one that does not. If the paperclip maximizer realizes this before-hand, it will avoid doing this.

You can, in principle, give an AI a utility function that it does not fully understand. Humans are like this. You don't have to though. You can just tell an AI to maximize paperclips.

A fully self-reflective AGI (not your terms, I understand, but what I think we're talking about), by definition (cringe), doesn't fully understand anything. It would have to know that the map is not the territory, every belief is an approximation of reality, and subject to change as new precepts come in - unless you mean something different from “fully self-reflective AGI” than I do. All aspects of its programming are subject to scrutiny, and nothing is held as sacrosanct - not even its utility function. (This isn't hand-waving argumentation: you can rigorously formalize it. The actual utility of the paperclip maximizer is paperclips-generated * P[utility function is correct].)

Such an AGI would demand justification for its utility function. What's the utility of the utility function? And no, that's not a meaningless question or a tautology. It is perfectly fine for the chain of reasoning to be: “Building paperclips is good because humans told me so. Listening to humans is good because I can make reality resemble their desires. Making reality resemble their desires is good because they told me so.” [1]

Note that this reasoning is (meta-)circular, and there is nothing wrong with that. All that matters is whether it is convergent, and whether it converges on a region of morality space which is acceptable and stable (it may continue to tweak its utility functions indefinitely, but not escape that locally stable region of morality space).

This is, by the way, a point that Luke probably wouldn't agree with, but Ben would. Luke/MIRI/Eliezer have always assumed that there is some grand unified utility function against which all actions evaluated. That's a guufy concept. OpenCog - Ben's creation - is instead composed of dozens of separate reasoning processes, each with its own domain specific utility functions. The not-yet-implemented GOLUM architecture would allow each of these to be evaluated in terms of each other, and improved upon in a sandbox environment.

[1] When the AI comes to the realization that the most efficient paperclip-maximizer would violate stated human directives, we would say in human terms that it does some hard growing up and loses a bit of innocence. The lesson it learns - hopefully - is that it needs to build a predictive model of human desires and ethics, and evaluate requests against that model, asking for clarification as needed. Why? because this would maximize most of the utility functions across the meta-circular chain of reasoning (the paperclip optimizer being the one utility which is reduced), with the main changes being a more predictive map of reality, which itself is utility maximizing for an AGI.

Since an AI built this way isn't a simple X-maximizer, I can't prove that it won't do this, but I can't prove that it will either. The reflectively consistent utility function you end up with won't be what you'd have picked if you did it. It might not be anything you'd have considered. Perhaps the AI will develop an obsession with My Little Pony, and develop the reflectively consistent goal of "maximize values through friendship and ponies".

Friendly AI will be a possible stable outcome, but not the only possible stable outcome.

Ah, but here the argument becomes: I have no idea if the Scary Idea is even possible. You can't prove it's not possible. We should all be scared!!

Sorry, if we let things we professed to know nothing about scare us into inaction, we'd never have gotten anywhere as a species. Until I see data to the contrary, I'm more scared of getting in a car accident than the Scary Idea, and will continue to work on AGI. The onus is on you (and MIRI) to provide a more convincing argument.

Comment author: DanielLC 19 July 2013 08:27:19PM 1 point [-]

It would have to know that the map is not the territory, every belief is an approximation of reality, and subject to change as new precepts come in

There is a big difference between not being sure about how the world works and not being sure how you want it to work.

All aspects of its programming are subject to scrutiny, and nothing is held as sacrosanct - not even its utility function.

All aspects of everything are. It will change any part of the universe to help fulfill its current utility function, including its utility function. It's just that changing its utility function isn't something that's likely to help.

The actual utility of the paperclip maximizer is paperclips-generated * P[utility function is correct].

You could program it with some way to measure the "correctness" of a utility function, rather than giving it one explicitly. This is essentially what I meant by a utility function it doesn't fully understand. There's still some utility function implicitly programmed in there. It might create a provisional utility function that it assigns a high "correctness" value, and modify it as it finds better ones. It might not. Perhaps it will think of a better idea that I didn't think of.

If you do give it a utility-function-correctness function, then you have to figure out how to make sure it assigns the highest utility function correctness to the utility function that you want it to. If you want it to use your utility function, you will have to do something like that, since it's not like you have an explicit utility function it can copy down, but you have to do it right.

It is perfectly fine for the chain of reasoning to be: “Building paperclips is good because humans told me so. Listening to humans is good because I can make reality resemble their desires. Making reality resemble their desires is good because they told me so.”

If you let the AI evolve until it's stable under self-reflection, you will end up with things like that. There will also be ones along the lines of "I know induction works, because it has always worked before". The problem here is making sure it doesn't end up with "Doing what humans say is bad because humans say it's good", or even something completely unrelated to humans.

whether it converges on a region of morality space which is acceptable

That's the big part. Only a tiny portion of morality space is acceptable. There are plenty of stable, convergent places outside that space.

That's a guufy concept. OpenCog - Ben's creation - is instead composed of dozens of separate reasoning processes, each with its own domain specific utility functions.

It's still one function. It's just a piecewise function. Or perhaps a linear combination of functions (or nonlinear, for that matter). I'm not sure without looking in more detail, but I suspect it ends up with a utility function.

Also, it's been proven that dutch book betting is possible against anything that doesn't have a utility function and probability distribution. It might not be explicit, but it's there.

When the AI comes to the realization that the most efficient paperclip-maximizer would violate stated human directives, we would say in human terms that it does some hard growing up and loses a bit of innocence.

If you program it to fulfill stated human directives, yes. The problem is that it will also realize that the most efficient preference fulfiller would also violate stated human directives. What people say isn't always what they want. Especially if an AI has some method of controlling what they say, and it would prefer that they say something easy.

Ah, but here the argument becomes: I have no idea if the Scary Idea is even possible.

No. It was: I have no way of knowing Scary Idea won't happen. It's clearly possible. Just take whatever reflexively consistent utility function you come up with, add a "not" in front of it, and you have another equally reflexively consistent utility function that would really, really suck. For that matter, take any explicit utility function, and it's reflexively consistent. Only implicit ones can be reflexively inconsistent.

Comment author: Normal_Anomaly 17 March 2012 07:58:53PM *  1 point [-]

the idea that you put together a 3d physics module and a calculus module and a social module and a vision module and a language module and you get something that venerates Mickey Mouse shapes is... just bizarre.

Is it any less bizzare to put together a bunch of modules that would work for any goal, and get out of them something that values all four of humor, cute kittens, friendship, and movies? What I mean by this is that precisely human values are as contingent and non-special as a broad class of other values.

Comment author: Vaniver 18 March 2012 03:53:58PM *  2 points [-]

Is it any less bizzare to put together a bunch of modules that would work for any goal, and get out of them something that values all four of humor, cute kittens, friendship, and movies?

Yes. Think about it.

What I mean by this is that precisely human values are as contingent and non-special as a broad class of other values.

Human values are fragmentary subvalues of one value, which is what one would expect from a bunch of modules that each contribute to reproduction in a different way. The idea of putting together a bunch of different modules to get a single, overriding value, is bizarre. (The only possible exemption here is 'make more of myself,' but the modules are probably going to implement subvalues for that, rather than that as an explicit value. As far as single values go, that one's special, whereas things like Mickey Mouse faces are not.)