[Link] A review of proposals toward safe AI

XiXiDu

LESSWRONG
LW

[Link] A review of proposals toward safe AI — LessWrong

13 [Link] A review of proposals toward safe AI

by XiXiDu

5th Apr 2011

1 min read

13

Eliezer Yudkowsky set out to define more precisely what it means for an entity to have “what people really want” as a goal. Coherent Extrapolated Volition was his proposal. Though CEV was never meant as more than a working proposal; his write-up provides the best insights to date into the challenges of the Friendly AI problem, the pitfalls and possible paths to a solution.

[...]

Ben Goertzel responded with Coherent Aggregated Volition, a simplified variant of CEV. In CAV, the entity’s goal is a balance between the desires of all humans, but it looks at the volition of humans directly, without extrapolation to a wiser future. This omission is not just to make the computation easier (it is still quite intractable), but rather to show some respect to humanity’s desires as they are, without extrapolation to a hypothetical improved morality.

[...]

Stuart Armstrong’s “Chaining God” is a different approach, aimed at the problem of interacting with and trusting the good will of an ultraintelligence so far beyond us that we have nothing in common with it. A succession of AIs, of gradually increasing intelligence, each guarantees the trustworthiness of one which is slightly smarter than it. This resembles Yudkowsy’s idea of a self-improving machine which verifies that its next stage has the same goals, but the successive levels of intelligence remain active simultaneously, so that they can continue to verify Friendliness.

Ray Kurzweil thinks that we will achieve safe ultraintelligence by gradually becoming that ultraintelligence. We will merge with the rising new intelligence, whether by interfacing with computers or by uploading our brains to a computer substrate.

Link: adarti.blogspot.com/2011/04/review-of-proposals-toward-safe-ai.html

Personal Blog

13

[Link] A review of proposals toward safe AI

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:18 AM

[-]jimrandomh15y80

I just took a look at Ben Goetzel's CAV (Coherent Aggregated Volition). As far as I can tell, it includes peoples' death-to-outgroups volitions unmodified and thereby destroys the world, whereas CEV (which came first) doesn't. And he presents the desire to murder as an example and then fails to address it, then goes on to talk about running experiments on aggregating the volitions of trivial, non-human agents. That looks like a serious rationality failure in the direction of ignoring danger, and I get the same impression from his other writing, too.

The more of Ben Goertzel's writing I read, the less comfortable I am with him controlling OpenCog. If OpenCog turns into a seed AI, I don't think it's safe for him to be the one making the launch/no-launch decision. I also don't think it's safe for him to be setting directions for the project before then, either.

[-]TheOtherDave15y50

it includes peoples' death-to-outgroups volitions unmodified [..] whereas CEV (which came first) doesn't

Is there a pointer available to the evidence that an "extrapolation" process a la CEV actually addresses this problem? (Or, if practical, can it be summarized here?)

I've read some but not all of the CEV literature, and I understand that this process intended to solve this problem, but I haven't been able to grasp from that how we know it actually does.

It seems to depend on the idea that if we had world enough and time, we would outgrow things like "death-to-outgroups," and therefore a sufficiently intelligent seed AI tasked with extrapolating what we would want given world enough and time will naturally come up with a CEV that doesn't include such things... perhaps because such things are necessarily instrumental values rather than reflectively stable terminal values, perhaps for other reasons.

But surely there has to be more to it than that, as the "world enough and time" theory seems itself unjustified.

[-]jimrandomh15y20

Is there a pointer available to the evidence that an "extrapolation" process a la CEV actually addresses this problem?

I think there's some uncertainty about that, actually. The extrapolation procedure is never really specified in CEV, and I could imagine some extrapolation procedures which probably do eliminate the death-to-outgroups volition, and some extrapolation procedures which don't. So an actual implementation would have a lot of details to fill in, and there are ways of filling in those details which would be bad (but this is true of everything about AI, really).

Underspecification is a problem with CEV, and Goertzel's CAV paper rightly complains about it. The problem is where he proposes to stick an identity function into the extrapolation procedure slot, which is one of the procedures that fails.

[-]TheOtherDave15y60

(nods) I haven't read the CAV paper and am very much not defending it. You just sounded knowledgeable about CEV so I figured I'd take the opportunity to ask an unrelated question that's been bugging me pretty much since I got here.

Agreed that there are lots of implementation details to work out either way (to say the least), but this seems like more than an implementation detail.

The whole idea underlying the FAI-through-CEV enterprise is that providing a seed AI with direct access to human minds is a better source for a specification of the values we want an AI to optimize for than, say, explicitly trying to develop an ideal ethical theory.

And the argument I've seen in favor of that idea has mostly focused on the many ways that a formalized explicit ethical theory can go horribly horribly wrong, which is entirely fair.

But it's absurd to say "A won't work, let's do B" if B will fail just as badly as A does. If pointing to human minds and saying "go!" doesn't reliably exclude values that, if optimized for, go horribly horribly wrong, then it really does seem like a fundamentally different third option is needed.

That's not just an implementation detail, and it's not just a place where it's possible to get something wrong.

[-]XiXiDu15y20

Here is an interesting interview between Hugo de Garis and Ben Goertzel:

Gut feeling: I’d probably sacrifice myself to create a superhuman artilect, but not my kids…. I do have huge ambitions and interests going way beyond the human race – but I’m still a human.

[...]

And the better an AGI theory we have, the more intelligently we’ll be able to bias the odds. But I doubt we’ll be able to get a good AGI theory via pure armchair theorizing. I think we’ll get there via an evolving combination of theory and experiment – experiment meaning, building and interacting with early-stage proto-AGI systems of various sorts.

[-]Normal_Anomaly15y00

experiment meaning, building and interacting with early-stage proto-AGI systems of various sorts.

I'm not very familiar with Goertzel's ideas. Does he recognize the importance of not letting the proto-AGI systems self-improve while their values are uncertain?

[-]benelliott15y20

From what I've gathered Ben thinks that these experiments will reveal that friendliness is impossible, that 'be nice to humans' is not a stable value. I'm not sure why he thinks this.

[-]cafesofie15y10

OpenCog is open source anyway: anything Goertzel can do can be done by anyone else. If Goertzel didn't think it was safe to run, what's stopping someone else from running it?

[-]benelliott15y00

Isn't that even worse?

[-]timtyler15y-10

Doesn't practically everyone want their intelligent machines to be nice? More examples:

Respectful AI Project Page - Tim Freeman

A Proposed Design for Distributed Artificial General Intelligence - Matt Mahoney

Moderation Log