I'm pleased to announce a new paper from MIRI: Toward Idealized Decision Theory.
Abstract:
This paper motivates the study of decision theory as necessary for aligning smarter-than-human artificial systems with human interests. We discuss the shortcomings of two standard formulations of decision theory, and demonstrate that they cannot be used to describe an idealized decision procedure suitable for approximation by artificial systems. We then explore the notions of strategy selection and logical counterfactuals, two recent insights into decision theory that point the way toward promising paths for future research.
Following the Corrigibility paper, this is the second in a series of six papers motivating MIRI's active research areas. Also included in the series will be a technical agenda, which motivates all six research areas and describes the reasons why we have selected these topics in particular, and an annotated bibliography, which compiles a fair bit of related work. I plan to post one paper every week or two for the next few months.
I've decided to start with the decision theory paper, as it's one of the meatiest. This paper compiles and summarizes quite a bit of work on decision theory that was done right here on LessWrong. There is a lot more to be said on the subject of decision theory than can fit into a single paper, but I think this one does a fairly good job of describing why we're interested in the field and summarizing some recent work in the area. The introduction is copied below. Enjoy!
As artificially intelligent machines grow more capable and autonomous, the behavior of their decision procedures becomes increasingly important. This is especially true in systems possessing great general intelligence: superintelligent systems could have a massive impact on the world (Bostrom 2014), and if a superintelligent system made poor decisions (by human standards) at a critical juncture, the results could be catastrophic (Yudkowsky 2008). When constructing systems capable of attaining superintelligence, it is important for them to use highly reliable decision procedures.
Verifying that a system works well in test conditions is not sufficient for high confidence. Consider the genetic algorithm of Bird and Layzell (2002), which, if run on a simulated representation of a circuit board, would have evolved an oscillating circuit. Running in reality, the algorithm instead re-purposed the circuit tracks on its motherboard as a makeshift radio to amplified oscillating signals from nearby computers. Smarter-than-human systems acting in reality may encounter situations beyond both the experience and the imagination of the programmers. In order to verify that an intelligent system would make good decisions in the real world, it is important to have a theoretical understanding of why that algorithm, specifically, is expected to make good decisions even in unanticipated scenarios.
What does it mean to "make good decisions"? To formalize the question, it is necessary to precisely define a process that takes a problem description and identifies the best available decision (with respect to some set of preferences1). Such a process could not be run, of course; but it would demonstrate a full understanding of the problem of decision-making. If someone cannot formally state what it means to find the best decision in theory, then they are probably not ready to construct heuristics that attempt to find the best decision in practice.
At first glance, formalizing an idealized process which identifies the best decision in theory may seem trivial: iterate over all available actions, calculate the utility that would be attained in expectation if that action were taken, and select the action which maximizes expected utility. But what are the available actions? And what are the counterfactual universes corresponding to what "would happen" if an action "were taken"? These questions are more difficult than they may seem.
The difficulty is easiest to illustrate in a deterministic setting. Consider a deterministic decision procedure embedded in a deterministic environment. There is exactly one action that the decision procedure is going to select. What, then, are the actions it "could have taken"? Identifying this set may not be easy, especially if the line between agent and environment is blurry. (Recall the genetic algorithm repurposing the motherboard as a radio.) However, action identification is not the focus of this paper.
This paper focuses on the problem of evaluating each action given the action set. The deterministic algorithm will only take one of the available actions; how is the counterfactual environment constructed, in which a deterministic part of the environment does something it doesn't? Answering this question requires a satisfactory theory of counterfactual reasoning, and that theory does not yet exist.
Many problems are characterized by their idealized solutions, and the problem of decision-making is no exception. To fully describe the problem faced by intelligent agents making decisions, it is necessary to provide an idealized procedure which takes a description of an environment and one of the agents within, and identifies the best action available to that agent. Philosophers have studied candidate procedures for quite some time, under the name of decision theory. The investigation of what is now called decision theory stretches back to Pascal and Bernoulli; more recently decision theory has been studied by Wald (1939), Lehmann (1950), Jeffrey (1965), Lewis (1981), Joyce (1999), Pearl (2000) and many others.
Various formulations of decision theory correspond to different ways of formalizing counterfactual reasoning. Unfortunately, the standard answers from the literature do not allow for the description of an idealized decision procedure. Two common formulations and their shortcomings are discussed in Section 2. Section 3 argues that these shortcomings imply the need for a better theory of counterfactual reasoning to fully describe the problem that artificially intelligent systems face when selecting actions. Sections 4 and 5 discuss two recent insights that give some reason for optimism and point the way toward promising avenues for future research. Nevertheless, Section 6 briefly discusses the pessimistic scenario in which it is not possible to fully formalize the problem of decision-making before the need arises for robust decision-making heuristics. Section 7 concludes by tying this study of decision theory back to the more general problem of aligning smarter-than-human systems with human interests.
1: For simplicity, assume von Neumann-Morgenstern rational preferences, that is, preferences describable by some utility function. The problems of decision theory arise regardless of how preferences are encoded.
The first problem is to take a full description of an environment and an agent, and identify the best action available to that agent, explicitly assuming "full, post-hoc information." At this level we're not looking for a process that can be run, we're looking for a formal description of what is meant by "best available action." I would be very surprised if the resulting function could be evaluated within the described environment in general, and yeah, it will "require a halting oracle" (if the environment can implement Turing machines). Step one is not writing a practical program, step one is describing what is meant by "good decision." If you could give me a description of how to reliably identify "the best choice available" which assumed not only a halting oracle but logical omniscience and full knowledge of true arithmetic, that would constitute great progress. (The question is still somewhat ill-posed, but that's part of the problem: a well-posed question frames its answer.) At this level we're going to need something like logical counterfactuals, but we won't necessarily need logical uncertainty.
The second problem is figuring out how to do something kinda like evaluating that function inside the environment on a computer with unlimited finite computing power. This is the level where you need logical uncertainty etc. The second problem will probably be much easier to answer given an answer to the first question, though in practice I expect both problems will interact a fair bit.
Solving these two problems still doesn't give you anything practical: the idea is that answers would reveal the solution which practical heuristics must approximate if they're going to act as intended. (It's hard to write a program that reliably selects good decisions even in esoteric situations if you can't formalize what you mean by "good decisions"; it's easier to justify confidence in heuristics if you understand the solution they're intended to approximate; etc.)
https://groups.google.com/forum/#!topic/mirix-workshops/MWuGo25eI8g