Eliezer_Yudkowsky comments on The Hidden Complexity of Wishes - Less Wrong

58 Post author: Eliezer_Yudkowsky 24 November 2007 12:12AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (121)

Sort By: Old

You are viewing a single comment's thread. Show more comments above.

Comment author: Eliezer_Yudkowsky 31 August 2013 06:15:30PM 5 points [-]

It'd work great if 'affecting' wasn't secretly a Magical Category based on how you partition physical states into classes that are instrumentally equivalent relative to your end goals.

Comment author: Kawoomba 31 August 2013 06:59:38PM *  0 points [-]

Point. I'd still expect some variant of "keep (general) interference minimal / do not perturb human activity / build your models using the minimal actions possible" to be easier to formalize than human friendliness, wouldn't you?

Comment author: shminux 31 August 2013 07:20:24PM *  0 points [-]

One usual caveat is reflective consistency: are you OK with creating a faithful representation of humans in these models and then terminating them? If so, how do you know you are not one of those models?

Comment author: RobbBB 31 August 2013 07:30:08PM *  1 point [-]

A relatively non-scary possibility: The AI destroys itself, because that's the best way to ensure it doesn't positively 'affect' others in the intuitive sense you mean. (Though that would still of course have effects, so this depends on reproducing in AI our intuitive concept of 'side-effect' vs. 'intended effect'....)

Scarier possibilities, depending on how we implement the goal:

  • the AI doesn't kill you and then simulate you; rather, it kills you and then simulates a single temporally locked frame of you, to minimize the possibility that it (or anything) will change you.

  • the AI just kills everyone, because a large and drastic change now reduces to ~0 the probability that it will cause any larger perturbations later (e.g., when humans might have a big galactic civilization that it would be a lot worse to perturb).

  • the AI has a model of physics on which all of its actions (eventually) have a roughly equal effect on the atoms that at present compose human beings. So it treats all its possible actions (and inactions) as equivalent, and ignores your restriction in making decisions.

Comment author: Kawoomba 31 August 2013 08:19:47PM 0 points [-]

Yes, implementing such a goal is not easy and has pitfalls of its own, however it's probably easi-er than the alternative, since a metric for "no large scale effects" seems easier to formalize than "human friendliness", where we have little idea of what's that even supposed to mean.

Comment author: Eliezer_Yudkowsky 31 August 2013 09:39:33PM 2 points [-]

The trouble is that communicating with a human or helping them build the real FAI in any way is going to strongly perturb the world. So actually getting anything useful this way requires solving the problem of which changes to humans, and consequent changes to the world, are allowed to result from your communication-choices.

Comment author: Kawoomba 31 August 2013 09:55:00PM *  0 points [-]

helping them build the real FAI

Except it's not, as far as the artificial agent is concerned:

Its goals are strictly limited to "develop your models using the minimal actions possible [even 'just parse the internet, do not use anything beyond wget' could suffice], after x number of years have passed, accept new goals from y source." The new goals could be anything. (It could even be a boat!).

The usefulness regarding FAI becomes evident only at that latter stage, stemming from the foom'ed AI's models being used to parse the new goals of "do that which I'd want you to do". It's sidestepping the big problem (aka "cheating"), but so what?

Comment author: Eliezer_Yudkowsky 31 August 2013 11:28:12PM 5 points [-]

It's allowed to emit arbitrary HTTP GETs? You just lost the game.

Comment author: Kawoomba 01 September 2013 06:41:15AM *  0 points [-]

Ah, you mean because you can invoke e.g. php functions with wget / inject SQL code, thus gaining control of other computers etc.?

A more sturdy approach to just get data would be to only allow it to passively listen in on some Tier 1 provider's backbone (no manipulation of the data flow other than mirroring packets, which is easy to formalize). Once that goal is formulated, the agent wouldn't want to circumvent it.

Still seems plenty easier to solve than "friendliness", as is programming it to ask for new goals after x time. Maintaining invariants under self-modification remains, as a task.

It's not fruitful for me to propose implementations (even though I just did, heh) and for someone else to point out holes (I don't mean to solve that task in 5 minutes), same as with you proposing full-fledged implementations for friendliness and for someone else to point out holes. Both are non-trivial tasks.

My question is this: given your current interpretation of both approaches ("passively absorb data, ask for new goals after x time" vs. "implement friendliness in the pre-foomed agent outright"), which seems more manageable while still resulting in an FAI?