Moss_Piglet comments on The genie knows, but doesn't care - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (515)
I am trying to understand if the kind of AI, that is underlying the scenario that you have in mind, is a possible and likely outcome of human AI research.
As far as I am aware, as a layman, goals and capabilities are intrinsically tied together. How could a chess computer be capable of winning against humans at chess without the terminal goal of achieving a checkmate?
Coherent and specific goals are necessary to (1) decide which actions are instrumental useful (2) judge the success of self-improvement. If the given goal is logically incoherent, or too vague for the AI to be able to tell apart success from failure, would it work at all?
If I understand your position correctly, you would expect a chess playing general AI, one that does not know about checkmate, instead of "winning at chess", to improve against such goals as "modeling states of affairs well" or "make sure nothing intervenes chess playing". You believe that these goals do not have to be programmed by humans, because they are emergent goals, an instrumental consequence of being general intelligent.
These universal instrumental goals, these "AI drives", seem to be a major reason for why you believe it to be important to make the AI care about behaving correctly. You believe that these AI drives are a given, and the only way to prevent an AI from being an existential risk is to channel these drives, is to focus this power on protecting and amplifying human values.
My perception is that these drives that you imagine are not special and will be as difficult to get "right" than any other goal. I think that the idea that humans not only want to make an AI exhibit such drives, but also succeed at making such drives emerge, is a very unlikely outcome.
As far as I am aware, here is what you believe an AI to want:
What AIs that humans would ever want to create would require all of these drives, and how easy will it be for humans to make an AI exhibit these drives compared to making an AI that can do what humans want without these drives?
Take mathematics. What are the difficulties associated with making an AI better than humans at mathematics, and will an AI need these drives in order to do so?
Humans did not evolve to play chess or do mathematics. Yet it is considerably more difficult to design a chess AI than an AI that is capable of discovering interesting and useful mathematics.
I believe that the difficulty is due to the fact that it is much easier to formalize what it means to play chess than doing mathematics. The difference between chess and mathematics is that chess has a specific terminal goal in the form of a clear definition of what constitutes winning. Although mathematics has unambiguous rules, there is no specific terminal goal and no clear definition of what constitutes winning.
The progress of the capability of artificial intelligence is not only related to whether humans have evolved for a certain skill or to how much computational resources it requires but also to how difficult it is to formalize the skill, its rules and what it means to succeed.
In the light of this, how difficult would it be to program the drives that you imagine, versus just making an AI win against humans at a given activity without exhibiting these drives?
All these drives are very vague ideas, not like "winning at chess", but more like "being better at mathematics than Terence Tao".
The point I am trying to make is that these drives constitute additional complexity, rather than being simple ideas that you can just assume, and from which you can reason about the behavior of an AI.
It is this context that the "dumb superintelligence" argument tries to highlight. It is likely incredibly hard to make these drives emerge in a seed AI. They implicitly presuppose that humans succeed at encoding intricate ideas about what "winning" means in all those cases required to overpower humans, but not in the case of e.g. winning at chess or doing mathematics. I like to analogize such a scenario to the creation of a generally intelligent autonomous car that works perfectly well at not destroying itself in a crash but which somehow manages to maximize the number of people to run over.
I agree that if you believe that it is much easier to create a seed AI to exhibit the drives that you imagine, than it is to make a seed AI use its initial resources to figure out how to solve a specific problem, then we agree about AI risks.
(Note: I'm also a layman, so my non-expert opinions necessarily come with a large salt side-dish)
My guess here is that most of the "AI Drives" to self-improve, be rational, retaining it's goal structure, etc. are considered necessary for a functional learning/self-improving algorithm. If the program cannot recognize and make rules for new patterns observed in data, make sound inferences based on known information or keep after it's objective it will not be much of an AGI at all; it will not even be able to function as well as a modern targeted advertising program.
The rest, such as self-preservation, are justified as being logical requirements of the task. Rather than having self-preservation as a terminal value, the paperclip maximizer will value it's own existence as an optimal means of proliferating paperclips. It makes intuitive sense that those sorts of 'drives' would emerge from most-any goal, but then again my intuition is not necessarily very useful for these sorts of questions.
This point might also be a source of confusion;
As Dr Valiant (great name or the greatest name?) classifies things in Probably Approximately Correct, Winning Chess would be a 'theoryful' task while Discovering (Interesting) Mathematical Proofs would be a 'theoryless' one. In essence, the theoryful has simple and well established rules for the process which could be programmed optimally in advance with little-to-no modification needed afterwards while the theoryless is complex and messy enough that an imperfect (Probably Approximately Correct) learning process would have to be employed to suss out all the rules.
Now obviously the program will benefit from labeling in it's training data for what is and is not an "interesting" mathematical proof, otherwise it can just screw around with computationally-cheap arithmetic proofs (1 + 1 = 2, 1.1 + 1 = 2.1, 1.2 + 1 = 2.2, etc.) until the heat death of the universe. Less obviously, as the hidden tank example shows, insufficient labeling or bad labels will lead to other unintended results.
So applying that back to Friendliness; despite attempts to construct a Fun Theory, human value is currently (and may well forever remain) theoryless. A learning process whose goal is to maximize human value is going to have to be both well constructed and have very good labels initially to not be Unfriendly. Of course, it could very well correct itself later on, that is in fact at the core of a PAC algorithm, but then we get into questions of FOOM-ing and labels of human value in the environment which I am not equipped to deal with.