eli_sennesh comments on Steelmaning AI risk critiques - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (98)
Okay, so we just have to determine human terminal values in detail, and plug them into a powerful maximizer. I'm not sure I see how that's different from the standard problem statement for friendly AI. Learning values by observing people is exactly what MIRI is working on, and it's not a trivial problem.
For example: say your universal learning algorithm observes a human being fail a math test. How does it determine that the human being didn't want to fail the math test? How does it cleanly separate values from their (flawed) implementation? What does it do when peoples' values differ? These are hard questions, and precisely the ones that are being worked on by the AI risk people.
Other points of critique:
Saying the phrase "safe sandbox sim" is much easier than making a virtual machine that can withstand a superhuman intelligence trying to get out of it. Even if your software is perfect, it can still figure out that its world is artificial and figure out ways of blackmailing its captors. Probably doing what MIRI is looking into, and designing agents that won't resist attempts to modify them (corrigibility) is a more robust solution.
You want to be careful about just plugging in a learned human utility function into a powerful maximizer, and then raising it. If it's maximizing its own utility, which is necessary if you want it to behave anything like a child, what's to stop it from learning human greed and cruelty, and becoming an eternal tyrant? I don't trust a typical human to be god.
And even if you give up on that idea, and have to maximize a utility function defined in terms of humanity's values, you still have problems. For starters, you want to be able to prove formally that its goals will remain stable as it self-modifies, and it won't create powerful sub-agents who don't share those goals. Which is the other class of problems that MIRI works on.
Why do you even go around thinking that the concept of "terminal values", which is basically just a consequentialist steelmanning Aristotle, cuts reality at the joints?
That part honestly isn't that hard once you read the available literature about paradox theorems.