MIRI's technical research agenda

So8res

I'm pleased to announce the release of Aligning Superintelligence with Human Interests: A Technical Research Agenda written by Benja and I (with help and input from many, many others). This document summarizes and motivates MIRI's current technical research agenda.

I'm happy to answer questions about this document, but expect slow response times, as I'm travelling for the holidays. The introduction of the paper is included below. (See the paper for references.)

The characteristic that has enabled humanity to shape the world is not strength, not speed, but intelligence. Barring catastrophe, it seems clear that progress in AI will one day lead to the creation of agents meeting or exceeding human-level general intelligence, and this will likely lead to the eventual development of systems which are "superintelligent'' in the sense of being "smarter than the best human brains in practically every field" (Bostrom 2014). A superintelligent system could have an enormous impact upon humanity: just as human intelligence has allowed the development of tools and strategies that let humans control the environment to an unprecedented degree, a superintelligent system would likely be capable of developing tools and strategies that give it extraordinary power (Muehlhauser and Salamon 2012). In light of this potential, it is essential to use caution when developing artificially intelligent systems capable of attaining or creating superintelligence.

There is no reason to expect artificial agents to be driven by human motivations such as lust for power, but almost all goals can be better met with more resources (Omohundro 2008). This suggests that, by default, superintelligent agents would have incentives to acquire resources currently being used by humanity. (Can't we share? Likely not: there is no reason to expect artificial agents to be driven by human motivations such as fairness, compassion, or conservatism.) Thus, most goals would put the agent at odds with human interests, giving it incentives to deceive or manipulate its human operators and resist interventions designed to change or debug its behavior (Bostrom 2014, chap. 8).

Care must be taken to avoid constructing systems that exhibit this default behavior. In order to ensure that the development of smarter-than-human intelligence has a positive impact on humanity, we must meet three formidable challenges: How can we create an agent that will reliably pursue the goals it is given? How can we formally specify beneficial goals? And how can we ensure that this agent will assist and cooperate with its programmers as they improve its design, given that mistakes in the initial version are inevitable?

This agenda discusses technical research that is tractable today, which the authors think will make it easier to confront these three challenges in the future. Sections 2 through 4 motivate and discuss six research topics that we think are relevant to these challenges. Section 5 discusses our reasons for selecting these six areas in particular.

We call a smarter-than-human system that reliably pursues beneficial goals "aligned with human interests" or simply "aligned." To become confident that an agent is aligned in this way, a practical implementation that merely seems to meet the challenges outlined above will not suffice. It is also necessary to gain a solid theoretical understanding of why that confidence is justified. This technical agenda argues that there is foundational research approachable today that will make it easier to develop aligned systems in the future, and describes ongoing work on some of these problems.

Of the three challenges, the one giving rise to the largest number of currently tractable research questions is the challenge of finding an agent architecture that will reliably pursue the goals it is given—that is, an architecture which is alignable in the first place. This requires theoretical knowledge of how to design agents which reason well and behave as intended even in situations never envisioned by the programmers. The problem of highly reliable agent designs is discussed in Section 2.

The challenge of developing agent designs which are tolerant of human error has also yielded a number of tractable problems. We argue that smarter-than-human systems would by default have incentives to manipulate and deceive the human operators. Therefore, special care must be taken to develop agent architectures which avert these incentives and are otherwise tolerant of programmer error. This problem and some related open questions are discussed in Section 3.

Reliable, error-tolerant agent designs are only beneficial if they are aligned with human interests. The difficulty of concretely specifying what is meant by "beneficial behavior" implies a need for some way to construct agents that reliably learn what to value (Bostrom 2014, chap. 12). A solution to this "value learning'' problem is vital; attempts to start making progress are reviewed in Section 4.

Why these problems? Why now? Section 5 answers these questions and others. In short, the authors believe that there is theoretical research which can be done today that will make it easier to design aligned smarter-than-human systems in the future.

I'm happy to answer questions about this document, but expect slow response times, as I'm travelling for the holidays. The introduction of the paper is included below. (See the paper for references.)

On my understanding of CogPrime, its answers to these questions are of the form "we'll just keep adding/refining heuristics and heuristic-learning mechanisms and heuristic-selection methods until it's smart, and so we don't need to answer these problems ourselves”

Have you read Engineering General Intelligence[1]? CogPrime is not a hack job -- there is significant theory going into the architectural design.

But I won’t belabor that point because even if CogPrime / OpenCog were as you describe, that pretty much also describes how human intelligence works too. The only example of general intelligence we have is also a grab-bag of heuristics and heuristic-learning procedures. I hope we can agree that human intelligence was evolved, not designed. And it pretty much was a matter of adding more gears to get more capability (oversimplified a bit, but qualitatively correct).

The human mind does not have a naturalized induction algorithm. The human mind does not have a rigorous understanding of counterfactuals. The human mind does not have an explicit set of logical priors. If it did, we wouldn’t need all this rationality business and LessWrong. Calling these out as problems that need to be solved to build an AGI betrays a certain top-down, unified design bias which is neither reflected in the human mind nor relevant to architectures like CogPrime. CogPrime approaches these problems in the same way the human mind does -- namely it doesn’t solve them, because general solutions to these problems are not required to build human-level or better intelligence.

You seem under-confident that an AGI could be developed by “adding more gears” as you put it. Yet that is exactly how the only known examples of general intelligence we know of originated. It would seem a less conservative assumption to presume that there exists some single-principle universal inference engine that can be realistically implemented and is compatible with human morality, itself a result of the complex interaction of our learned heuristics, long-term memory, and basic instincts.

CogPrime is extremely complex and I don’t want to oversimplify it. But relevant to this conversation, CogPrime does work by networking specialized pattern recognizers and procedures together, much like the above description of human thinking. However it does so in a way that is better adapted to modern computational hardware and ease of analysis, rather than precise neural emulation.

You seem to think that a heuristic approach is more likely to lead to unfriendly AI. I find this a hard position to sympathize with -- and not just because I personally favor boxing strategies that let us delay solving the FAI problem. You list the value learning problem as one of your core problems in AGI theory. Well human values arise in a very complex way out of the structure of human thought processes, so it seems reasonable that an AGI designed to more closely resemble human thought processes would be more amenable to direct value loading -- simply structure the built-in processes to roughly similar to the best we know about human neural science based psychology, then test and iterate. (This is OpenCog’s plan. As said, I prefer side-stepping the issue with boxing.)

[1] http://lesswrong.com/lw/kq4/link\_engineering\_general\_intelligence\_the/

The only working generally intelligent system we know of (the brain) was indeed evolved, and I do indeed expect that we could invent general intelligence by doing something kinda like what evolution did.

However, I think that value loading on an evolved system would prove quite difficult: just as humans are godshatter, I expect you'd end up with an AI that is a thousand shards of desire but not the same shards as you. Maybe it would be possible to value-align such a system, but there are a number of reasons why I expect that this would be very difficult. (I... (read more)

1Kaj_Sotala11y

I'd be curious to hear your opinion about my recent paper. [...] Your link is broken (correct version): you need to escape underscores in URLs outside a link with a backslash, see formatting help. (Amusingly, the copy-pasted version in this comment looks to work fine.)

55

MIRI's technical research agenda

55

55

55

MIRI's technical research agenda

55

55