Sources: An old draft on Oracle AI from Daniel Dewey, conversation with Dewey and Nick Beckstead. See also Thinking Inside the Box and Leakproofing the Singularity.
Can we just create an Oracle AI that informs us but doesn't do anything?
"Oracle AI" has been proposed in many forms, but most proposals share a common thread: a powerful AI is not dangerous if it doesn't "want to do anything", the argument goes, and therefore, it should be possible to create a safe "Oracle AI" that just gives us information. Here, we discuss the difficulties of a few common types of proposed Oracle AI.
Two broad categories can be treated separately: True Oracle AIs, which are true goal-seeking AIs with oracular goals, and Oracular non-AIs, which are designed to be "very smart calculators" instead of goal-oriented agents.
True Oracle AIs
A True Oracle AI is an AI with some kind of oracular goal. Informally proposed oracular goals often include ideas such as "answer all questions", "only act to provide answers to questions", "have no other effect on the outside world", and "interpret questions as we would wish them to be interpreted.” Oracular goals are meant to "motivate" the AI to provide us with the information we want or need, and to keep the AI from doing anything else.
First, we point out that True Oracle AI is not causally isolated from the rest of the world. Like any AI, it has at least its observations (questions and data) and its actions (answers and other information) with which to affect the world. A True Oracle AI interacts through a somewhat low-bandwidth channel, but it is not qualitatively different from any other AI. It still acts autonomously in service of its goal as it answers questions, and it is realistic to assume that a superintelligent True Oracle AI will still be able to have large effects on the world.
Given that a True Oracle AI acts, by answering questions, to achieve its goal, it follows that True Oracle AI is only safe if its goal is fully compatible with human values. A limited interaction channel is not a good defense against a superintelligence.
There are many ways that omission of detail about human value could cause a "question-answering" goal to assign utility to a very undesirable state of the world, resulting in a undesirable future. A designer of an oracular goal must be certain to include a virtually endless list of qualifiers and patches. An incomplete list includes "don't forcefully acquire resources to compute answers, don't defend yourself against shutdown, don't coerce or threaten humans, don't manipulate humans to want to help you compute answers, don't trick the questioner into asking easy questions, don't hypnotize the questioner into reporting satisfaction, don't dramatically simplify the world to make prediction easier, don't ask yourself questions, don't create a questioner-surrogate that asks easy questions," etc.
Since an oracular goal must contain a full specification of human values, the True Oracle AI problem is Friendly-AI-complete (FAI-complete). If we had the knowledge and skills needed to create a safe True Oracular AI, we could create a Friendly AI instead.
Oracular non-AIs
An Oracular non-AI is a question-answering or otherwise informative system that is not goal-seeking and has no internal parts that are goal-seeking, i.e. not an AI at all. Informally, an Oracular non-AI is something like a "nearly AI-complete calculator" that implements a function from input "questions" to output "answers.” It is difficult to discuss the set of Oracular non-AIs formally because it is a heterogeneous concept by nature. Despite this, we argue that many are either FAI-complete or unsafe for use.
In addition to the problems with specific proposals below, many Oracular non-AI proposals are based on powerful metacomputation, e.g. Solomonoff induction or program evolution, and therefore incur the generic metacomputational hazards: they may accidentally perform morally bad computations (e.g. suffering sentient programs or human simulations), they may stumble upon and fail to sandbox an Unfriendly AI, or they may fall victim to ambient control by a superintelligence. Other unknown metacomputational hazards may also exist.
Since many Oracular non-AIs have never been specified formally, we approach proposals on an informal level.
Oracular non-AIs: Advisors
An Advisor is a systems that takes a corpus of real-world data and somehow computes the answer to the informal question "what ought we (or I) to do?". Advisors are FAI-complete because:
- Formalizing the ought-question requires a complete formal statement of human values or a formal method for finding them.
- Answering the ought-question requires a full theory of instrumental decision-making.
Oracular non-AIs: Question-Answerers
A Question-Answerer is a system that takes a corpus of real-world data along with a "question,” then somehow computes the "answer to the question.” To analyze the difficulty of creating a Question-Answerer, suppose that we ask it the question "what ought we (or I) to do?"
- If it can answer this question, the Question-Answerer and the question together are FAI-complete. Either the Question-Answerer can understand the question as-is, or we can rewrite it in a more formal language; regardless, the Question-Answerer and the question together comprise an Advisor, which we previously argued to be FAI-complete.
- If it cannot answer this question, many of its answers are radically unsafe. Courses of action recommended by the Question-Answerer will likely be unsafe, insofar as "safety" relies on the definition of human value. Also, asking questions about the future will turn the Question-Answerer into a Predictor, leading to the problems outlined below.
Of course, if safe uses for a Question-Answerer can be devised, we still have the non-negligible challenge of creating a Question-Answerer without using any goal-seeking AI techniques.
Oracular non-AIs: Predictors
A Predictor is a system that takes a corpus of data and produces a probability distribution over future data. Very accurate and general Predictors may be based on Solomonoff's theory of universal induction.
Very powerful Predictors are unsafe in a rather surprising way: when given sufficient data about the real world, they exhibit goal-seeking behavior, i.e. they calculate a distribution over future data in a way that brings about certain real-world states. This is surprising, since a Predictor is theoretically just a very large and expensive application of Bayes' law, not even performing a search over its possible outputs.
To see why, consider a Predictor P with a large corpus of real-world data. If P is sufficiently powerful and the corpus is sufficiently large, P will infer a distribution that gives very high probability to a model of the world (let’s call it M) that contains a model of P being asked the questions we’re asking it. (It is perfectly possible for a program to model its own behavior, and in fact necessary if the Predictor is to be accurate.)
Suppose now that we ask P to calculate the probability of future data d; call this probability P(d). Since model M has much of P’s distribution’s probability mass, P(d) is approximately equal to the probability of M if M computes d (call this M→d), and zero otherwise. Furthermore, since M contains a model of the Predictor being asked about d, M→d depends on the way P’s “answer” affects M’s execution. This means that P(d) depends on P(d)’s predicted impact on the world; in other words, P takes into account the effects of its predictions on the world, and “selects” predictions that make themselves accurate-- P has an implicit goal that the world ought to match its predictions. This goal does not necessarily align with human goals, and should be treated very carefully.
Probabilistic predictions of future data are a very small output channel, but once again, the ability of a superintelligence to use a small channel effectively should not be underestimated. Additionally, the difficulty of using such a Predictor well (specifying future data strings of interest and interpreting the results) speaks against our ability to keep the Predictor from influencing us through its predictions.
It is not clear that there is any general way to design a Predictor that will not exhibit goal-seeking behavior, short of dramatically limiting the power of the Predictor.
Not sure if this is a new idea or how safe it is, but we could design a Predictor that incorporates a quantum random number generator, such that with some small probability it will output "no predictions today, run me again tomorrow". Then have the Predictor make predictions that are conditional on it giving the output "no predictions today, run me again tomorrow".
If Predictors are to be modeled as accuracy-maximizing agents, they could acausally cooperate with each other, so that one Predictor optimizes its accuracy about a world where it's absent through controlling the predictions of another Predictor that is present in that world.