Moss_Piglet comments on The genie knows, but doesn't care - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (515)
(Note: I'm also a layman, so my non-expert opinions necessarily come with a large salt side-dish)
My guess here is that most of the "AI Drives" to self-improve, be rational, retaining it's goal structure, etc. are considered necessary for a functional learning/self-improving algorithm. If the program cannot recognize and make rules for new patterns observed in data, make sound inferences based on known information or keep after it's objective it will not be much of an AGI at all; it will not even be able to function as well as a modern targeted advertising program.
The rest, such as self-preservation, are justified as being logical requirements of the task. Rather than having self-preservation as a terminal value, the paperclip maximizer will value it's own existence as an optimal means of proliferating paperclips. It makes intuitive sense that those sorts of 'drives' would emerge from most-any goal, but then again my intuition is not necessarily very useful for these sorts of questions.
This point might also be a source of confusion;
As Dr Valiant (great name or the greatest name?) classifies things in Probably Approximately Correct, Winning Chess would be a 'theoryful' task while Discovering (Interesting) Mathematical Proofs would be a 'theoryless' one. In essence, the theoryful has simple and well established rules for the process which could be programmed optimally in advance with little-to-no modification needed afterwards while the theoryless is complex and messy enough that an imperfect (Probably Approximately Correct) learning process would have to be employed to suss out all the rules.
Now obviously the program will benefit from labeling in it's training data for what is and is not an "interesting" mathematical proof, otherwise it can just screw around with computationally-cheap arithmetic proofs (1 + 1 = 2, 1.1 + 1 = 2.1, 1.2 + 1 = 2.2, etc.) until the heat death of the universe. Less obviously, as the hidden tank example shows, insufficient labeling or bad labels will lead to other unintended results.
So applying that back to Friendliness; despite attempts to construct a Fun Theory, human value is currently (and may well forever remain) theoryless. A learning process whose goal is to maximize human value is going to have to be both well constructed and have very good labels initially to not be Unfriendly. Of course, it could very well correct itself later on, that is in fact at the core of a PAC algorithm, but then we get into questions of FOOM-ing and labels of human value in the environment which I am not equipped to deal with.