MIRI's technical research agenda

So8res

55 MIRI's technical research agenda

23rd Dec 2014

3 min read

55

I'm pleased to announce the release of Aligning Superintelligence with Human Interests: A Technical Research Agenda written by Benja and I (with help and input from many, many others). This document summarizes and motivates MIRI's current technical research agenda.

I'm happy to answer questions about this document, but expect slow response times, as I'm travelling for the holidays. The introduction of the paper is included below. (See the paper for references.)

The characteristic that has enabled humanity to shape the world is not strength, not speed, but intelligence. Barring catastrophe, it seems clear that progress in AI will one day lead to the creation of agents meeting or exceeding human-level general intelligence, and this will likely lead to the eventual development of systems which are "superintelligent'' in the sense of being "smarter than the best human brains in practically every field" (Bostrom 2014). A superintelligent system could have an enormous impact upon humanity: just as human intelligence has allowed the development of tools and strategies that let humans control the environment to an unprecedented degree, a superintelligent system would likely be capable of developing tools and strategies that give it extraordinary power (Muehlhauser and Salamon 2012). In light of this potential, it is essential to use caution when developing artificially intelligent systems capable of attaining or creating superintelligence.

There is no reason to expect artificial agents to be driven by human motivations such as lust for power, but almost all goals can be better met with more resources (Omohundro 2008). This suggests that, by default, superintelligent agents would have incentives to acquire resources currently being used by humanity. (Can't we share? Likely not: there is no reason to expect artificial agents to be driven by human motivations such as fairness, compassion, or conservatism.) Thus, most goals would put the agent at odds with human interests, giving it incentives to deceive or manipulate its human operators and resist interventions designed to change or debug its behavior (Bostrom 2014, chap. 8).

Care must be taken to avoid constructing systems that exhibit this default behavior. In order to ensure that the development of smarter-than-human intelligence has a positive impact on humanity, we must meet three formidable challenges: How can we create an agent that will reliably pursue the goals it is given? How can we formally specify beneficial goals? And how can we ensure that this agent will assist and cooperate with its programmers as they improve its design, given that mistakes in the initial version are inevitable?

This agenda discusses technical research that is tractable today, which the authors think will make it easier to confront these three challenges in the future. Sections 2 through 4 motivate and discuss six research topics that we think are relevant to these challenges. Section 5 discusses our reasons for selecting these six areas in particular.

We call a smarter-than-human system that reliably pursues beneficial goals "aligned with human interests" or simply "aligned." To become confident that an agent is aligned in this way, a practical implementation that merely seems to meet the challenges outlined above will not suffice. It is also necessary to gain a solid theoretical understanding of why that confidence is justified. This technical agenda argues that there is foundational research approachable today that will make it easier to develop aligned systems in the future, and describes ongoing work on some of these problems.

Of the three challenges, the one giving rise to the largest number of currently tractable research questions is the challenge of finding an agent architecture that will reliably pursue the goals it is given—that is, an architecture which is alignable in the first place. This requires theoretical knowledge of how to design agents which reason well and behave as intended even in situations never envisioned by the programmers. The problem of highly reliable agent designs is discussed in Section 2.

The challenge of developing agent designs which are tolerant of human error has also yielded a number of tractable problems. We argue that smarter-than-human systems would by default have incentives to manipulate and deceive the human operators. Therefore, special care must be taken to develop agent architectures which avert these incentives and are otherwise tolerant of programmer error. This problem and some related open questions are discussed in Section 3.

Reliable, error-tolerant agent designs are only beneficial if they are aligned with human interests. The difficulty of concretely specifying what is meant by "beneficial behavior" implies a need for some way to construct agents that reliably learn what to value (Bostrom 2014, chap. 12). A solution to this "value learning'' problem is vital; attempts to start making progress are reviewed in Section 4.

Why these problems? Why now? Section 5 answers these questions and others. In short, the authors believe that there is theoretical research which can be done today that will make it easier to design aligned smarter-than-human systems in the future.

Research AgendasMachine Intelligence Research Institute (MIRI)

Frontpage

55

New Comment

Rendering 0/52 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 2:33 AM

Moderation Log

55 MIRI's technical research agenda

by So8res

23rd Dec 2014

3 min read

55

I'm happy to answer questions about this document, but expect slow response times, as I'm travelling for the holidays. The introduction of the paper is included below. (See the paper for references.)

Research AgendasMachine Intelligence Research Institute (MIRI)

Frontpage

55

Mentioned in

21MIRI's technical agenda: an annotated bibliography, and other updates

New Comment

Rendering 0/52 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 2:33 AM

Moderation Log

More from So8res

Curated and popular this week

52Comments

Comment Permalink

Kaj_Sotala11y50

If the AI can deceive you, then it has in principle solved the FAI problem. You simply take the function which tests whether the operator would be disgusted by the plan, and combine it with a Occam preference for simple plans and excessive detail.

I don't think that this follows: it's easier to predict that someone won't like a plan, than it is to predict what's the plan that would maximally fulfill their values.

For example, I can predict with a very high certainty that the average person on the street would dislike it if I were to shoot them with a gun; but I don't know what kind of a world would maximally fulfill even my values, nor am I even sure of what the question means.

Similarly, an AI might not know what exactly its operators wanted it to do, but it could know that they didn't want it to break out of the box and kill them, for example.

The AI won’t deceive it’s operators. It doesn’t know how to deceive its operators, and can’t learn how to carry out such deception undetected. If it is built in the human-like model I described previously, it wouldn’t even know deception was an option unless you taught it (thinking within its principles, not about them).

This seems like a very strong claim to me.

Suppose that the AI has been programmed to carry out some goal G, and it builds a model of the world for predicting what would be the best way of to achieve G. Part of its model of the world involves a model of its controllers. It notices that there exists a causal chain "if controller becomes aware of intention I, and controller disapproves of intention I, controller will stop AI from carrying out intention I". It doesn't have a full model of the function controller-disapproves(I), but it does develop a plan that it thinks would cause it to achieve G, and which - based on earlier examples - seems to be more similar to the plans which were disapproved of than the plans that were approved of. A straightforward analysis of "how to achieve G" would then imply "prevent the predicate controller-aware(I) from being fulfilled while I'm carrying out the other actions needed for fulfilling G".

This doesn't seem like it would require the AI to be taught the concept of deception, or even to necessarily possess "deception" as a general concept: it only requires that it has a reasonably general capability for reasoning and modeling the world, and that it manages to detect the relevant causal chain.

[anonymous]11y20

Wow there is a world of assumptions wrapped up in there. For example that the AI has a concept of external agents and an ability to model their internal belief state. That an external agent can have a belief about the world which is wrong. This may sound intuitively obvious, but it’s not a simple thing. This kind of social awareness takes time to be learnt by humans as well. Heinz Wimmer and Josef Perner showed that below a certain age (3-4 years) kids lack an ability to track this information. A teacher puts a toy in a blue cupboard, then leaves the room ... (read more)

327chaos11y

While I agree, his proposal does seem like a good start. Restricting a UFAI to pursue only a subset of all potentially detrimental plans is a modest gain, but still a gain worth achieving. I am skeptical that FAI should consist of a grand unified moral theory. I think an FAI made of many overlapping moral heuristics and patches, such as the restriction he describes, is more technically feasible, and might even be more likely to match actual human value systems, given the ambiguous, varying, and context sensitive nature of our evolved moral inclinations. (I realize that these are not properties generally considered when thinking about computer superintelligences - we're inclined to see computers as rigidly algorithmic, which makes sense given current technology levels. But I believe extrapolating from current technology to predict how AGI will function is a greater mistake than extrapolating from known examples of intelligence - a process is better understood when looking at actions than when looking at substrate. With regard to intelligence at least, AGI will necessarily be much more flexible in its operations than traditional computers are. I expect that the cost of this flexibility in behavior will be sacrificing rigidity at the process level. Performing billions of Bayesian calculations a second isn't feasible, so a more organic and heuristic based approach will be necessary. If this is correct and such technologies will be necessary for an AGI's intelligence, then it makes sense that we'd be able to use them for an AGI's emotions or goals as well.) Even if we do attempt to build a grand unified Friendly software, I expect little downside (relative to potential risks) to adding these sort of restrictions in addition.

See in context