Eliezer has written a great deal about the concept of Friendly AI, for example in a document from 2001 titled Creating Friendly AI 1.0. The new SIAI overview states that:
SIAI's primary approach to reducing AI risks has thus been to promote the development of AI with benevolent motivations which are reliably stable under self-improvement, what we call “Friendly AI” [22].
The SIAI Research Program lists under its Research Areas:
Mathematical Formalization of the "Friendly AI" Concept. Proving theorems about the ethics of AI systems, an important research goal, is predicated on the possession of an appropriate formalization of the notion of ethical behavior on the part of an AI. And, this formalization is a difficult research question unto itself.
Despite the enormous value that the construction of a Friendly AI would have; at present I'm not convinced that researching the Friendly AI concept is a cost-effective way of reducing existential risk. My main reason for doubt is that as far as I can tell, the problem of building a Friendly AI has not been taskified to a sufficiently fine degree for it to be possible to make systematic progress toward obtaining a solution. I'm open-minded on this point and quite willing to change my position subject to incoming evidence
In The First Step is to Admit That You Have a Problem Alicorn wrote:
If you want a peanut butter sandwich, and you have the tools, ingredients, and knowhow that are required to make a peanut butter sandwich, you have a task on your hands. If you want a peanut butter sandwich, but you lack one or more of those items, you have a problem [... ] treating problems like tasks will slow you down in solving them. You can't just become immortal any more than you can just make a peanut butter sandwich without any bread.
In Let them eat cake: Interpersonal Problems vs Tasks HughRistik wrote:
Similarly, many straight guys or queer women can't just find a girlfriend, and many straight women or queer men can't just find a boyfriend, any more than they can "just become immortal."
We know that the problems of making a peanut butter sandwich and of finding a romantic partner can (often) be taskified because many people have succeeded in solving them. It's less clear that a given problem that has never been solved can be taskified. Some problems are in principle unsolvable whether because they are mathematically undecidable or because physical law provides an obstruction to their solution. Other currently unsolved problems have solutions in the abstract but lack solutions that are accessible to humans. That taskification is in principle possible is not a sufficient condition for solving a problem but it is a necessary condition.
The Difficulty of Unsolved Problems
There's a long historical precedent of unsolved problems being solved. Humans have succeeded in building cars and skyscrapers, have succeeded in understanding the chemical composition of far away stars and of our own DNA, have determined the asymptotic distribution of the prime numbers and have given an algorithm to determine whether a given polynomial equation is solvable in radicals, have created nuclear bombs and have landed humans on the moon. All of these things seemed totally out of reach at one time.
Looking over the history of human achievement gives one a sense of optimism as to the feasibility of accomplishing a goal. And yet, there's a strong selection effect at play: successes are more interesting than failures and we correspondingly notice and remember successes more than failures. One need only page through a book like Richard Guy's Unsolved Problems in Number Theory to get a sense for how generic it is for a problem to be intractable. The ancient Greek inspired question of whether there are infinitely many perfect numbers remains out of reach for best mathematicians of today. The success of human research efforts has been as much a product of wisdom in choosing one's battles as it has been a product of ambition.
The Case of Friendly AI
My present understanding is that there are potential avenues for researching AGI. Richard Hollerith was kind enough to briefly describe Monte Carlo AIXI to me last month and I could sort of see how it might be in principle possible to program a computer to do Bayesian induction according to an approximation to a universal prior and implement the computer with a decision making apparatus based on its epistemological state at a given time. Some people have suggested to me that the amount of computer power and memory needed to implement human level Monte Carlo AIXI is prohibitively large but (in my current, very ill-informed state; by analogy with things that I've seen in computational complexity theory) I could imagine ingenious tricks yielding an approximation to Monte Carlo AIXI which uses much less computing power/memory and which is a sufficiently close to approximation to serve as a substitute for practical purposes. This would point to a potential taskification of the problem of building an AGI. I could also imagine that there are presently no practically feasible AGI research programs; I know too little about the state of strong artificial intelligence research to have anything but a very unstable opinion on this matter.
As Eliezer has said; the problem of creating a Friendly AI is inherently more difficult than that of creating an AGI and may be a problem much more difficult than that of creating an AGI. At present, the Friendliness aspect of a Friendly AI seems to me to strongly resist taskificaiton. In his poetic Mirrors and Paintings Eliezer gives the most detailed description of what a Friendly AI should do that I've seen, but the gap between concept and implementation here seems so staggeringly huge that it doesn't suggest to me any fruitful lines of Friendly AI research. As far as I can tell, Eliezer's idea of a Friendly AI is at this point not significantly more fleshed out (relative to the magnitude of the task) than Freeman Dyson's idea of a Dyson sphere. In order to build a Friendly AI, beyond conceiving of what a Friendly AI should be in the abstract one has to convert one's intuitive understanding of friendliness into computer code in a formal programming language.
I don't even see how one would start to research the problem of getting a hypothetical AGI to recognize humans as distinguished beings. Solving this problem would seem to require as a prerequisite an understanding of the make up of the hypothetical AGI; something which people don't seem to have a clear grasp of at the moment. Even if one does have a model for a hypothetical AGI, writing code conducive to it recognizing humans as distinguished beings seems like an intractable task. And even with a relatively clear understanding of how one would implement a hypothetical AGI with the ability to recognize humans as distinguished beings; one is still left with the problem of making such a hypothetical AGI Friendly toward such beings.
In view of all this, working toward stable whole-brain emulation of a a trusted and highly intelligent person concerned about human well being seems to me like a more promising strategy of reducing existential risk at the present time than researching Friendly AI. Quoting a comment by Carl Shulman
Emulations could [...] enable the creation of a singleton capable of globally balancing AI development speeds and dangers. That singleton could then take billions of subjective years to work on designing safe and beneficial AI. If designing safe AI is much, much harder than building AI at all, or if knowledge of AI and safe AI are tightly coupled, such a singleton might be the most likely route to a good outcome.
There are various things that could go wrong with whole-brain emulation and it would be good to have a better option but Friendly AI research doesn't seem to me to be one in light of an apparent total absence of even the outlines of a viable Friendly AI research program.
But I feel like I may have missed something here. I'd welcome any clarifications of what people who are interested in Friendly AI research mean by Friendly AI research. In particular, is there a conjectural taskification of the problem?
I meant: innate cognitive architecture which plays a role in metaethical thought.
You might be familiar with the idea that, according to CEV, you figure out the full complexity of human value using neuroscience (rather than relying on people's opinions about what they value), and then you "extrapolate" or "renormalize" that using "reflective decision theory" (which does not yet exist). The idea here is that the method of extrapolation should also be extracted from the details of human cognitive architecture, rather than just figured out through intuition or pure reason.
Suppose we have a person - or an intelligent agent - with a particular "value system" or "private decision theory". Note that we are talking about its actual decision theory, as embodied in its causal structure and decision-making dispositions, and not just its introspective opinions about how it decides. Given this actual value system, RDT is supposed to tell us what would happen to that value system if it were changed according to its own implicit ideals. All I'm saying is that there's a meta-ethical relativism for RDT, for large classes of decision architecture. Different theories about how to normatively self-modify a decision architecture ought to be possible, and the selection of which RDT is used should also be derived from the agent's own cognitive architecture.
Of course you can go meta again and say, maybe the RDT extraction procedure can also take different forms - etc. It's one of the tasks of the FAI/CEV/RDT research program to figure out when and how the ethical metalevels stop.