Stuart_Armstrong comments on Domesticating reduced impact AIs - All

9 Post author: Stuart_Armstrong 14 February 2013 04:59PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (104)

You are viewing a single comment's thread. Show more comments above.

Comment author: Stuart_Armstrong 14 February 2013 05:45:08PM *  1 point [-]

Thinking about possible ways to munchkin this setup...

:-D That's exactly what I want.

One example: create a copy of the universe...

Don't get that - can you develop the idea? "create a copy of the universe" seems... non-trivial.

Comment author: shminux 14 February 2013 05:59:14PM *  1 point [-]

Non-trivial indeed, but why would it stop a capable AI determined to maximize its utility?

Further notes:

F contains everything that can be reached from O, along a geodesic with proper-time more than two hours.

Geodesic is probably too weak, as it implies ballistic motion only. A timelike or null future-pointing path is somewhat safer (one should be allowed to use engines). Still, anything outside O's original lightcone, or within a two-hour proper-time window appears to be a fair game for paperclipping.

Moreover, any timelike curve can be mimicked by a near-zero-proper-time curve. Does this mean the AI can extend the window out into the future indefinitely?

Comment author: Stuart_Armstrong 14 February 2013 06:13:55PM *  0 points [-]

Do you mean the actual AI making an actual physical copy of the actual universe?

Added a clause that the AI must obey the laws of physics; it was implicit, but now it's explicit.

Still, anything outside O's original lightcone, or within a two-hour proper-time window appears to be a fair game for paperclipping.

Yes. But the AI can only reach that if it breaks the laws of physics, and if it can do that, we likely have time travel so our efforts are completely for naught.

Moreover, any timelike curve can be mimicked by a near-zero-proper-time curve. Does this mean the AI can extend the window out into the future indefinitely?

No. I defined F as anything that can be reached by a timeline geodesic of length two hours (though you're right that it needn't be a geodesic). Just because the point can be reached by a path of zero length, doesn't mean that it's excluded from F.

Comment author: shminux 14 February 2013 06:25:44PM 3 points [-]

But the AI can only reach that if it breaks the laws of physics

... As we know them now. Even then, not quite. The AI might just build a version of the Alcubierre drive, or a wormhole, or... In general, it would try to exploit any potential discrepancy between the domains of U and R.

Comment author: Stuart_Armstrong 14 February 2013 06:29:38PM 0 points [-]

Ok, I concede that if the AI can break physics as we understand it, the approach doesn't work. A valid point, but a general one for all AI (if the AI can break our definitions, then even a friendly AI isn't safe, even if the definitions in it seem perfect).

Any other flaws in the model?

Comment author: Larks 16 February 2013 09:12:02AM 5 points [-]

There's a big difference between UFAI because it turned out tha peano arithmetic was inconsistant, which no-one think possible, and UFAI because our current model of physics was wrong/the true model was given negligible probability, which seems very likely.

Comment author: Stuart_Armstrong 16 February 2013 09:25:45AM 1 point [-]

Yes.

This is related to ontology crises - how does the AI generalise old concepts across new models of physics?

But it may be a problem for most FAI designs, as well.

Comment author: Eliezer_Yudkowsky 17 February 2013 02:29:21AM 3 points [-]

Um, I wouldn't hurt people if I discovered I could violate the laws of physics. Why should a Friendly AI?

Comment author: Stuart_Armstrong 18 February 2013 01:10:01PM 2 points [-]

Here's my intuition: Eliezer and other friendly humans have got their values partially through evolution and selection. Genetic algorithms tend to be very robust - even robust to the problem not being properly specified. So I'd assume that Eliezer and evolved FAIs would preserve their friendliness if the laws of physics were changed.

An AI with a designed utility function is very different, however. These are very vulnerable to ontology crises, as they're grounded in formal descriptions - and if the premises of the description change, their whole values change.

Now, presumably we can do better than that, and design a FAI to be robust across ontology changes - maybe mix in some evolution, or maybe some cunning mathematics. If this is possible, however, I would expect the same approach to succeed with a reduced impact AI.

Comment author: Eliezer_Yudkowsky 18 February 2013 05:55:39PM 11 points [-]

I got 99 psychological drives but inclusive fitness ain't one.

In what way is evolution supposed to be robust? It's slow, stupid, doesn't reproduce the content of goal systems at all and breaks as soon as you introduce it to a context sufficiently different from the environment of evolutionary ancestry because it uses no abstract reasoning in its consequentialism. It is the opposite of robust along just about every desirable dimension.

Comment author: ialdabaoth 17 February 2013 02:34:53AM 1 point [-]

Why shouldn't it? To rephrase, why do you intuitively generalize your own utility function to that of a FAI?

Comment author: gjm 17 February 2013 02:50:41AM 9 points [-]
  1. Because having a utility function that somewhat resembles humans' (including Eliezer's) is part of what Eliezer means by "Friendly".

  2. Maybe some Friendly AIs would in fact do that. But Eliezer's saying there's no obvious reason why they should; why would finding that the laws of physics aren't what we think they are cause an AI to stop acting Friendly, any more than (say) finding much more efficient algorithms for doing various things, discovering new things about other planets, watching an episode of "The Simpsons", or any of the countless other things an AI (or indeed a human) might do from time to time?

If I'm right that #2 is part of what Eliezer is saying, maybe I should add that I think it may be missing the point Stuart_Armstrong is making, which (I think) isn't that an otherwise-Friendly AI would discover it can violate what we currently believe to be the laws of physics and then go mad with power and cease to be Friendly, but that a purported Friendly AI design's Friendlines might turn out to depend on assumptions about the laws of physics (e.g., via bounds on the amount of computation it could do in certain circumstances or how fast the number of intelligent agents within a given region of spacetime can grow with the size of the region, or how badly the computations it actually does can deviate from some theoretical model because of noise etc.), and if those assumptions then turned out to be wrong it would be bad.

(To which my model of Eliezer says: So don't do that, then. And then my model of Stuart says: Avoiding it might be infeasible; there are just too many, too non-obvious, ways for a purported proof of Friendliness to depend on how physics works -- and the best we can do might turn out to be something way less than an actual proof, anyway. But by now I bet my models have diverged from reality. It's just as well I'm just chattering in an LW discussion and not trying to predict what a superintelligent machine might do.)

Comment author: shminux 17 February 2013 03:20:36AM *  0 points [-]

Presumably all the math you are working on is required for your proof of friendliness? And if the assumptions behind the math do not match the physics, wouldn't it invalidate the proof, or at least its relevance to the world we live in? And then all bets are off?

Comment author: Eliezer_Yudkowsky 17 February 2013 04:56:44AM 0 points [-]

Even invalidating a proof doesn't automatically mean the outcome is the opposite of the proof. The key question is whether there's a cognitive search process actively looking for a way to exploit the flaws in a cage. An FAI isn't looking for ways to stop being Friendly, quite the opposite. More to the point, it's not actively looking for a way to make its servers or any other accessed machinery disobey the previously modeled laws of physics in a way that modifies its preferences despite the proof system. Any time you have a system which sets that up as an instrumental goal you must've done the Wrong Thing from an FAI perspective. In other words, there's no super-clever being doing a cognitive search for a way to force an invalidating behavior - that's the key difference.