You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

eli_sennesh comments on Steelmaning AI risk critiques - Less Wrong Discussion

26 Post author: Stuart_Armstrong 23 July 2015 10:01AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (98)

You are viewing a single comment's thread. Show more comments above.

Comment author: AndreInfante 28 July 2015 03:40:37AM *  0 points [-]

A ULM also requires a utility function or reward circuitry with some initial complexity, but we can also use the same universal learning algorithms to learn that component. It is just another circuit, and we can learn any circuit that evolution learned.

Okay, so we just have to determine human terminal values in detail, and plug them into a powerful maximizer. I'm not sure I see how that's different from the standard problem statement for friendly AI. Learning values by observing people is exactly what MIRI is working on, and it's not a trivial problem.

For example: say your universal learning algorithm observes a human being fail a math test. How does it determine that the human being didn't want to fail the math test? How does it cleanly separate values from their (flawed) implementation? What does it do when peoples' values differ? These are hard questions, and precisely the ones that are being worked on by the AI risk people.

Other points of critique:

Saying the phrase "safe sandbox sim" is much easier than making a virtual machine that can withstand a superhuman intelligence trying to get out of it. Even if your software is perfect, it can still figure out that its world is artificial and figure out ways of blackmailing its captors. Probably doing what MIRI is looking into, and designing agents that won't resist attempts to modify them (corrigibility) is a more robust solution.

You want to be careful about just plugging in a learned human utility function into a powerful maximizer, and then raising it. If it's maximizing its own utility, which is necessary if you want it to behave anything like a child, what's to stop it from learning human greed and cruelty, and becoming an eternal tyrant? I don't trust a typical human to be god.

And even if you give up on that idea, and have to maximize a utility function defined in terms of humanity's values, you still have problems. For starters, you want to be able to prove formally that its goals will remain stable as it self-modifies, and it won't create powerful sub-agents who don't share those goals. Which is the other class of problems that MIRI works on.

Comment author: [deleted] 03 August 2015 03:59:33AM 1 point [-]

Okay, so we just have to determine human terminal values in detail, and plug them into a powerful maximizer.

Why do you even go around thinking that the concept of "terminal values", which is basically just a consequentialist steelmanning Aristotle, cuts reality at the joints?

For starters, you want to be able to prove formally that its goals will remain stable as it self-modifies

That part honestly isn't that hard once you read the available literature about paradox theorems.