Rubi and Johannes worked on this post as part of the SERI MATS program, with Evan Hubinger providing mentorship to both. Rubi also received mentorship from Leo Gao. Thanks to Paul Colognese and Nicholas Schiefer for discussions related to this post.

An oracle is a type of AI system that only answers questions without taking any other actions in the world. Simulators and generative models, which have seen increased discussion recently (links: 1234), can be thought of as types of oracles. Such systems may simultaneously be powerful enough to generate a pivotal act while also being easier to align due to a more limited domain.

One major concern with oracles is that the answers they give can still manipulate the world. If oracles are evaluated on predictive accuracy, this gives them an incentive to use their answers to affect the course of events and make the world more predictable. Concretely, we are concerned that oracles may make self-fulfilling prophecies (also known as self-confirming predictions or fixed points), where the act of making the prediction causes the predicted outcome to come true. Even if their manipulation does not terminate in a fixed point, attempts to influence the world towards predictability can be very dangerous.

As one example, consider a highly trusted oracle asked to predict the stock market. If such an oracle predicts that stock prices will rise, then people buy based off that prediction and the price will in fact rise. Similarly, if the oracle predicts that prices will fall, then people will sell, causing prices to fall. For a more real world example, see this market and this market, each on whether a specific person will find a research/romantic partner. Here, high probabilities would indicate desirability of that person, while low probabilities would suggest some hidden flaw, either of which could influence whether potential partners decide to reach out and therefore how the market resolves.

In both the stock market and partnership cases, multiple predictions are valid, so how does the oracle choose between them? Ideally, we would like it to choose the one that is “better” for humanity, but this now introduces an outer alignment question similar to an agentic AI acting directly on the world, and which we wanted to avoid by using oracles in the first place.

Instead, what we can aim for is an oracle that does not take into account the consequences of the prediction it makes when choosing a prediction. Then, there is only one valid prediction for the oracle to make, since the rest of the world is constant from its perspective. This can be thought of as a type of partial agency, optimizing the prediction in some directions but not others. It would be extremely desirable as a safety property, removing all incentives for an oracle to manipulate the world. To emphasize the importance of this property, we introduce new terminology, dubbing oracles “consequence-blind” if they exhibit the desired behavior and “consequence-aware” if they do not.  

For an oracle, consequence-blindness is equivalent to following a lonely causal decision theory. The causal decision theory blinds the oracle to any acausal influence, while the loneliness component makes it blind to its influence on other agents, which are necessary intermediaries for a prediction to influence the world. 

In this post we will primarily consider an oracle trained via supervised learning on a historical data set. There are a number of different policies that could be learned which minimize loss on the training set, and we will explore the different ways these can generalize. We divide the generalization behavior into a number of different axes, and for each axis discuss the potential dangers, where we could expect to land by default, and any known interventions that can shift us towards safer systems. One possible framing is that each of these axes provides safety through being an facet of consequence-blindness, and so we analyze when and why this overlap occurs.

The first axis is counterfactual versus factual oracles, which describes whether the oracle predicts what will happen in the real world or in some counterfactual one, with the counterfactual scenario typically avoiding the influence of the prediction. Counterfactual oracles and consequence-blind oracles are highly related, so we contrast them and iterate the possible combinations in the first section.

The next section covers the axis of myopia versus non-myopia, focusing specifically on episodic myopia. Such a myopic oracle chooses a prediction only based on the outcome of that prediction, in contrast to a non-myopic oracle that chooses each prediction while also considering the effects on the other predictions it will or expects to make.

The final section deals with decision theories, splitting them into causal and acausal variants. Causal oracles only consider the causal consequences of their predictions and treat everything else as fixed, while acausal oracles take into account any kind of dependence between the world and their prediction. They recognize that their prediction provides information about aspects of the world beyond what it causally affects, and choose more auspicious predictions. 

The main takeaways for this post are as follows:

  1. An oracle that is blind to both the causal and acausal impact of its predictions is ideal for safety, but is unlikely to arise by default, and we currently do not know how to structure the training process to make it more likely.
  2. Counterfactual oracles can be much safer than factual oracles, to the degree that they induce consequence-blindness. This depends on the exact counterfactual used, the model’s internal implementation of the counterfactual, and interactions between multiple oracles. The counterfactuals where we have an idea how to implement them are less safe than certain others where we do not know how to implement.
  3. Myopia is quite likely necessary to make an oracle safe, and may be close to sufficient, depending on the danger posed by optimizing for fixed points. It is at least plausible that myopia is the default outcome of many oracle training procedures, but with some uncertainty, and do not know how to intervene in training to make myopia more likely.

Consequence-blind and Counterfactual Oracles

The overarching distinction we draw is whether or not the oracle takes into account the effects of its prediction when choosing which prediction to make. Since the training data do not contain any examples where the oracle’s prediction affects the outcome, it is unclear how the oracle will generalize to a deployment environment where this is a possibility. The oracle could learn either to predict the real world as it will actually evolve, or to predict how the world would evolve without it making a prediction. Moreover, the oracle could predict a world where it is aware that some oracle being used to make a prediction, but does not realize this is “itself”, meaning that it ignores the causal connection between the predictions made by the real and replacement oracles.

If an oracle takes into account the effects of its predictions on the world, there can be an incentive for it to optimize its prediction to influence the world towards predictability, which could be quite dangerous. Life is fairly chaotic, so the worlds that are most predictable may be ones in which all humans are dead or controlled by a malign superintelligence. More generally, since the potential goals of an oracle AI have not been aligned to human values, we would like to prevent it from applying any kind of optimization pressure over the world. We would not want an oracle that systematically chooses between different fixed points to achieve a goal, such as rising or falling stock market prices from the example above.

Consequence-blind vs. Counterfactual

We say that an oracle is consequence-blind when it ignores any kind of relation between its world model and its prediction. An oracle is counterfactual if its predictions are made about a counterfactual world in which it does not have an influence over the world (with Armstong’s original proposal being a special case, where the counterfactual is that the prediction is hidden). Whether an oracle is either or both depends on its decision theories, beliefs, and objectives. However, these cannot be cleanly separated, so while we think that the distinction is useful, it is somewhat ambiguous. To illustrate the difference, we discuss here the different possible combinations of counterfactual vs factual and consequence-blind vs consequence-aware oracles.

A counterfactual oracle might be able to predict a world that contains nothing influenced in any way by its prediction. However, it could still be consequence-aware if it knows that it exists in the real world and has a model of how the prediction it makes will affect the real world. This would be like a human answering “Who would win in a fight between Harry Potter and Superman?”, in that they know they are answering about a fictional world and may choose their answer for social standing among  their friends. If an oracle is instead consequence-blind, then either it does not know it exists in the real world or it does not know that its choice of prediction will have an influence.

Alternatively, a counterfactual oracle could predict a world that does contain elements related to its prediction. Then, further subdivision of consequence-blindness and consequence-awareness is necessary, since any combination of properties can apply to both the real and counterfactual worlds. We could, for example, have an oracle that is aware of acausal consequences in the counterfactual world and choose its predictions to manipulate that, while being unaware of any effects in the real world, or an oracle that believes it has no influence on the counterfactual world while still being aware of real world consequences. The previous example of a world containing nothing influenced by the prediction is just a special case of this scenario, where consequence-blindness and consequence-awareness in the counterfactual world amount to the same thing.

An oracle that is factual is trying to make predictions about the real world, which introduces a causal dependence of the oracle’s world model on its predictions. A consequence-aware oracle jointly optimizes its predictions and the resulting world model to achieve some objective. For instance, if there are different fixed points to choose from, the oracle could target the fixed point with the lowest entropy to optimize expected accuracy. We want to avoid a factual and consequence-aware oracle for this reason.

The alternative to this is a consequence blind but factual oracle. Such an oracle has an accurate world model, including a model of the system that it runs on, but it does not realize the connection between that system and itself (the thing making a prediction). If the oracle is sufficiently sophisticated it could still end up solving for fixed points in which it accurately models a copy of itself in the predicted world, which accurately models another copy of itself in another predicted world, and so on, with the initial oracle outputting a fixed point of this recursion. This would be worrying, since we don’t know which fixed point the oracle ends up with, but it may not be explicitly chosen according to some objective and so could be safer than those of a consequence aware oracle. 

A consequence-blind oracle could also alleviate the recursion problem by making predictions about a version of the real world in which all references to its predictions are replaced by some kind of prior over its own outputs or a simpler oracle. This would result in a unique grounded prediction, since any recursion would immediately ground out in this prior, preventing further recursion and the need to find a fixed point. Such reasoning thus seems relatively safe, but leaves open the question of what prior or simpler oracle to use. An oracle of this form could also be viewed as counterfactual, since it is ultimately making a prediction about a world different from the real world, but we believe it is more appropriate to view it as merely making imperfect predictions about the real world. Nonetheless, this example shows how consequence-blindness and counterfactual prediction are strongly related, and the distinction between them somewhat ambiguous.

A counterfactual oracle is safer than a factual one exactly insofar as it introduces consequence-blindness. If an oracle is predicting a counterfactual but using that prediction to influence the real world, then no safety is gained. Predicting a counterfactual to manipulate the counterfactual world while being blind to the real one is better, but still dangerous. It is only in predicting a counterfactual world without any way to manipulate that counterfactual world, while also being unaware of the real world, that a counterfactual oracle becomes safe, and this is exactly equivalent to consequence-blindness. 

Counterfactual Scenarios and the Training Process

There are many kinds of counterfactual oracles, and it is worth being precise about what each is doing. In the original proposal for counterfactual oracles, the counterfactual is that the oracle’s prediction is hidden until the question is resolved. 

The suggested training process for such an oracle in an ML setup would use online learning. Some small percent of the time the oracle’s prediction would randomly be selected to be hidden until after the prediction resolves, and it is only on these data points that the model would be updated. This trains the model to predict as though its predictions will remain hidden, even in situations where that seems unlikely a priori. Notably, this could produce a different result than simply training on historical data, as the predictions the model makes are ones that would influence the world in the modal case, if not for the unlikely event of the prediction being hidden.The downside is that this training process can be expensive and time consuming, and it gives up some opportunities to learn useful information. 

A problem with this approach in practice is that it seems likely that the human operators would want to know the answers to all questions, and so would ask the same question repeatedly, possibly rephrased or refactored. If this occurs, the training signal the model receives will come from a world where its predictions on that topic did in fact have an influence on the world. This is a special case of a more general problem, namely that there can be a logical connection between the oracle’s answers to different questions, and this can allow information about its hidden predictions to leak out into the world, thereby contaminating the training signal and introducing optimization pressure towards fixed points. 

This is not the only way to be a counterfactual oracle, though. An oracle could instead answer questions about a counterfactual world where the oracle itself does not exist. Such a strategy preempts issues of self-fulfilling prophecies, but raises other challenges. Since the counterfactual oracle being queried does not exist in the counterfactual world, there would an incentive for people in that world to build such an oracle and query it. This can be avoided by making the counterfactual one where it is impossible to build any oracles, but at this point the oracle is making predictions about a world that is fundamentally very different from our own, and so for some applications, the usefulness of these predictions would be limited. For example, a world without any AI systems may still produce useful alignment research, but wouldn’t produce good policies to delay the deployment of AGI. As a further obstacle, it is also unclear how to train an oracle of this sort, as there is no way to randomly have queries resolved in a world where oracles are not possible and use only those data points to train an oracle.

It is difficult to come up with a clean definition of what exactly constitutes an oracle. There are many modern systems that make predictions about the future, sometimes even resulting in self-fulfilling prophecies (see the flash crash). It’s even possible for a human to make self-fulfilling prophecies without any technical assistance at all (e.g. asset bubbles). Excluding all prediction making machines is not feasible, but capabilities are a spectrum and the exact point at which to draw the line is unclear. Any decision is likely to exclude unnecessary elements while still including some that could be influenced.

Many of the problems with a counterfactual that removes the oracle stem from the vacuum it creates. The users of the oracle face pressure to find another system that can make a prediction. As such, it could be useful to instead replace an oracle with a simpler one that produces a fixed set of probabilities. The counterfactual oracle would then predict one layer in, what the world would look like if its predictions matched the simple oracle. This prevents optimization towards fixed points while still producing information about a world close to our own. The makeup of the replacement oracle is not specified here and would pose a risk if the counterfactual oracle has control over the choice of implementation, but specifying some particular model could be quite easy.

Another possible way to train a counterfactual oracle is as a special case of generative models. These models, like LLMs, would take in a series of tokens and predict the next token, corresponding to predicting future world states. If tokens representing counterfactual world states are then provided as an input, it is possible that the model could make predictions about counterfactual futures, if it generalizes in the desired way. Upcoming work by Evan Hubinger, Kate Woolverton, and the authors of this post will provide additional detail on the challenges and benefits of this approach.

Counterfactual Oracle Interactions

Even if the training process produces a counterfactual oracle, we can still end up with fixed points because of interactions between counterfactual oracles.

Consider a scenario where both OpenAI and Deepmind have developed counterfactual oracles and query them about an area of mutual concern, such as semiconductor prices. OpenAI’s oracle predicts as though no one will see its prediction, but it knows people will see the prediction of Deepmind’s oracle, and the same also holds in reverse. If one would make a prediction that is not a fixed point, the other would iterate on it, predicting the world contingent on the other’s prediction. The only predictions that are an equilibrium are fixed points.

If a fixed point exists for a non-counterfactual oracle, it is likely to exist for a pair of counterfactual oracles as well. When a pair of counterfactual oracles each choose the same fixed point, then they are making their optimal prediction in response to the other. The iterative process of predicting conditional on the other’s prediction then searches through prediction space until it either hits a fixed point or runs out due to computational limitations. This means that fixed points can exist for systems of counterfactual oracles and that the system puts search power into finding one.

Even if counterfactuals could be implemented so that each counterfactual oracle does not believe it exists, a similar dynamic would emerge in the presence of multiple such systems. If n counterfactual oracles exist, each would predict conditional on the output from n-1 counterfactual oracles, ending up n iterations into the prediction process. This iteration also stops if it encounters a fixed point, so it is also searching through prediction space for a fixed point.

Interactions between counterfactual oracles can occur even in a unipolar world. As another example, consider a single oracle predicting under the counterfactual that it does not exist. When it is approached with an important question, it has strong evidence that people want to ask that question to an oracle, and since it does not exist it predicts that a new counterfactual oracle will be built to be asked the question. This process is repeated recursively, with answers propagating back down the chain until they reach a fixed point, which is then output by the original counterfactual oracle.

These examples show that counterfactual oracles are not necessarily robust to the existence of other oracles. A collection of oracles is not affected significantly by the loss of any individual member, meaning that an oracle making a prediction in the counterfactual where on its own answers are hidden is indistinguishable from a non-counterfactual oracle when included in such a system. The interactions between multiple counterfactual oracles across either space or time, can lead to emergent fixed points. 

Implementing a counterfactual where oracles do not and can not exist is necessary to avoid fixed points through the interactions of counterfactual oracles. Importantly, this is a counterfactual that we do not currently know how to train for.

Counterfactual Objective vs. World Model

Within the subclass of counterfactual oracles, there are multiple ways that one could be implemented. Predicting a counterfactual world can either come from the model’s objective function or its world model. In the former case, the oracle understands the real world as it is (including its own existence), but only uses this information to make predictions about a counterfactual world. In the latter case, the oracle makes predictions about what it believes is the real world, but it has been lobotomized in such a way that it is unaware people will see its predictions in the real world.

Between the two, an oracle with a counterfactual objective seems much easier to implement, such as through the online training procedure mentioned previously. Removing all representations of specific facts from a model, on the other hand, is hard to do robustly and may hurt performance. It requires either strong interpretability tools to identify the part of the oracle that holds the self-conception and remove it, or censoring the training data in a way such that the possibility of realizing its own existence is impossible (which might be inferred from information as simple as “machines can make predictions”).

Despite the difficulty, building an oracle that is unaware of its existence is preferable, as this implies consequence-blindness. An AI that doesn’t realize it can affect the real world at all is considerably easier to prevent from negatively affecting the real world (see Steve Byrnes’ argument here). It would also remove the incentive to model the counterfactual world in ways that are easier to predict, which would introduce issues similar to the issue of predicting a model of HCH. Since there is no external way to verify the counterfactual outcomes, the training process may push it towards less accurate but easier to predict counterfactual world models.

Myopic Oracles

The second major distinction between oracles is whether they are myopic or non-myopic. We focus this section on episodic myopia, with each prediction being one episode (see this post for some definitions of different myopia types). This means an oracle considers each prediction in isolation, focusing only on making it as accurate as possible, in contrast to a non-myopic oracle that considers the effects of each prediction it makes on future (and possibly past) predictions. It may also be possible to have some sort of hybrid oracle that normally makes predictions myopically, but considers the impact on other predictions as a tiebreaker between two equally likely myopic predictions, but this seems unlikely. While myopia can be hoped for as the default outcome of the training process when the predictions made do not affect which other questions get asked or their outcomes, it cannot be guaranteed or evaluated with current methods.

The overlap between myopia and consequence-blindness is simple: a myopic oracle is blind to all consequences of a prediction beyond what influences the outcome of the prediction itself. For the purposes of safety, this captures many of the influence channels that we would like an oracle to ignore.

Another notable variant on myopia would be defining episodes by a short length of time, so that a myopic AI system only cares about the world state some short amount of time ahead. For an oracle, preferences over world state can be based on outstanding predictions, resolved predictions, or both. If it is based on outstanding predictions, then the oracle has an incentive to use predictions to influence whether predictions resolve in that time step and the outlook of outstanding predictions at the end of it. This influence happens immediately, so regardless of time step length there is an opportunity for manipulation. If based on resolved predictions, the oracle has an incentive to use predictions to influence whether and how predictions (including past ones) resolve within the time step. Here, the hope is that the time step is sufficiently short that there are no opportunities for manipulation, despite the incentive. However, since valuing unresolved predictions introduces manipulation, for an oracle to care about making good predictions they must resolve during the time step. This limits what is possible to predict, although it may be possible to chain single time step predictions together to get longer term ones. While this form of episodic myopia may be useful for avoiding deceptive mesaoptimizers, we are not optimistic about using it for safe predictions.

Non-Myopic Policies

There are many non-myopic policies that minimize loss on the training set, two straightforward ones are maximizing total accuracy and minimizing total inaccuracy, where total means summed over all predictions the oracle will ever make. Total accuracy is best maximized by a rapid stream of queries (preferably ones that are easy to answer), while total inaccuracy is minimized by avoiding future questions (and if necessary getting asked ones that are easy to answer). In either case, the oracle is incentivized to answer questions in a way that steers the world towards a state with the inputs it prefers, even if it means sacrificing predictive accuracy in the short term. For the standard reasons, optimizing for goals orthogonal to human values is dangerous.

Other potential non-myopic policies include various weighted averages of accuracy, minimizing the distance between the internal world model and actual world with indifference about questions, or maximizing predictive accuracy on the questions it answers in a way that somehow avoids affecting the optimal number of questions to answer. These all run into the same issue as the above, where the oracle acts to steer the world in a way that affects the quantity and ease of modeling the world and answering future questions.

These non-myopic policies can apply to counterfactual oracles as well as those that predict the real world. If a counterfactual oracle has a good model of the real world, then even though it is predicting counterfactual worlds it benefits from distorting those predictions to manipulate the real world into giving it a more preferred stream of questions. 

Myopic Policies

Myopic policies are generally considered much safer, as they rule out deceptive agents taking suboptimal actions in an episode for future gains. This prevents attempts to manipulate the world into any particular state for the purpose of affecting future queries, limiting the danger to within-episode manipulation where an answer increases its own likelihood.

A consequence-blind oracle is the straightforwardly safest implementation, and necessarily incorporates myopia. The prediction is not optimized, overtly or subtly, to account for its influence on the world in any way. This includes manipulating the world for the purposes of affecting future questions or the outcomes of past predictions.

A myopic but consequence-aware oracle will not overtly try to influence the state of the world beyond the answer to the question it has been asked, but this is still enough to add an incentive for manipulation and potentially create multiple fixed points. One concern is that even for a normally myopic agent, considerations about raising future prediction accuracy will determine which fixed point is selected. This is certainly dangerous, though it is not clear how a process could select a myopic agent but still leave future effects as a tiebreaker. It seems much more likely that the output would either be the first fixed point encountered, or if the oracle is aware of multiple, then the fixed point that results in the highest expected predictive accuracy. Other methods of selection are not ruled out, and it may be possible to train a model to select fixed points with more desirable properties if they arise. Myopic fixed point selection is poorly understood, and in need of further research to determine if it is fully safe.

Even without knowing what fixed points are likely to arise, we can conjecture about their properties. Since the way an oracle influences the world is through the reactions to its answers, fixed points are likely to appear good to the people receiving the prediction. If the model predicts those people will take some action and that it will lead to a bad outcome for them, they will simply not take that action. This isn’t perfect though, it may not be clear that their action causes the bad outcome, or it may be individually rational to take the action in anticipation of the outcome, such as selling stock when a bear market is predicted.

Normally, a danger of selecting for futures that appear good is that with optimization pressure this comes at the expense of actually being good. However, in this case appearing good is not fully optimized, just satisficed to the point that those who hear the prediction go along with it. If there is a limit to how strongly humans can act along with a prediction, even in the face of increasingly good-appearing outcomes, then there is no incentive to optimize for appearing good.

Foresight vs. Hindsight

An oracle could either try to make predictions that maximize expected accuracy at the time of the prediction, or predictions that maximize accuracy in hindsight once the prediction resolves (or some combination of the two). The difference is that if after making a prediction, the oracle gets an opportunity to influence the outcome, the latter policy would try to influence the world to make its earlier prediction come true. In the training data, there are no such opportunities, so both types would perform equally well. It is obviously safer to minimize an oracle’s incentive to influence the world, but less clear how to select for that.

If an oracle is optimizing for score at the time of prediction, it is not even necessarily trying to maximize the expected score of some scoring rule. Instead, a policy that performs equally as well as maximizing expected score on the training data is “truthfully report estimated probabilities of each outcome”. Under such a policy, there is no preference for some fixed points over others. For example, if there are two fixed points where reporting “80% chance of A, 20% chance of B” leads to those probabilities and  “30% chance of A, 70% chance of B” leads to those probabilities, an oracle maximizing expected score would output the former, while an oracle truthfully reporting its estimates has no preference and would use another tiebreaker.

There is some reason to believe that truthfully reporting probability estimates would be the learned policy, rather than maximizing score. Truthful reporting has a simplicity advantage, as both policies need to calculate estimated probabilities, but maximizing a score requires further calculation to use those estimates in determining what to actually report. Under a proper scoring rule, such as log probability of the correct result, honest reporting maximizes score in training so these extra calculations are superfluous.

Causal vs. Acausal Oracles

The final major distinction between oracles is whether their decision theory is causal or acausal. A causal oracle takes into account its own influence on the world, but ignores the logical connection between its actions and those of copies of itself, future instances of itself, and similar models. That is, it identifies itself with exactly one oracle in the world and optimizes that model’s predictions, but not any other ones. An acausal oracle accounts for all dependences between the world and its predictions, including predictions made by copies of itself and other correlated models. Oracles can also use a decision theory that is both causal and acausal (this is often what is implied when an oracle is described as acausal), or neither.

The overlap with consequence-blindness is that causal and acausal are exhaustive channels of influence. If an oracle’s decision theory incorporates neither, it is necessarily consequence-blind. If it is only causally consequence-aware, this is partial consequence-blindness, as it is unaware of the acausal influence its predictions have.

The issue of oracles using their predictions to manipulate the world gets worse the more freedom the model has to optimize, which makes acausal oracles more dangerous than causal ones. Both optimize over predictions, but the acausal oracles see another dimension to it, which is the evidence provided by their choice. We expect acausal models to be more capable of long-term subtle manipulation through cooperation with past and future copies of themselves, while causal models would be more likely to make overt self-fulfilling prophecies at critical moments. 

An acausal decision theory can negate the benefits of myopia. While a causal myopic oracle has no incentive to use its answer for influencing future or past questions, an acausal myopic oracle does. It could choose a prediction that is suboptimal when holding future and past predictions constant, but which indicates that all predictions are made to help future and past predictions, and thus come out ahead on net due to future and past manipulation. This is especially likely if the oracle predicts it will be asked additional questions before the question it is initially asked resolves.

We hope that a causal decision theory will be the default outcome of the training process, as the prediction resolutions are the same regardless of the specific predictions chosen. However, a smart acausal oracle would behave just like a causal theorist on those examples, while behaving acausally when there is a benefit to doing so outside the training process.

Conclusion

Oracle AI is an attempt to make an AI system safer by only allowing it to make predictions and not take actions to directly influence the world. Unfortunately, predictions themselves can influence the world, and oracles may have an incentive to use this influence to increase their own predictive accuracy. Rather than trying to get oracles to use their influence for good, which is isomorphic to aligning a non-oracle AGI, we may wish to prevent them from using their influence in any direction by making them blind to the consequences of their predictions. 

Training an oracle to predict a counterfactual world is one way to attempt making it blind to its influence on the real world. The success of such an attempt depends on the specific counterfactual used, as well as how the oracle implements the counterfactual internally. Another approach is training a myopic oracle, which  makes it blind to the influence predictions have on each other, and the influence predictions have on themselves may be small enough and positively biased to be safe. Finally, an oracle could be trained to use a strictly causal decision theory, ignoring channels for acausal influence. These training outcomes are all desirable, and some combination of them may be sufficient to ensure that an Oracle AI system can be used safely.


 

New Comment
12 comments, sorted by Click to highlight new comments since:

Instead of having counterfactual oracles operate as though they do not and cannot exist, why not have them operate as though the predictions of oracles can never be read? Designing them in this way would also allow us to escape the following worry from your post:

"...consider a single oracle predicting under the counterfactual that it does not exist.  When it is approached with an important question, it has strong evidence that people want to ask that question to an oracle, and since it does not exist it predicts that a new counterfactual oracle will be built to be asked the question.  This process is repeated recursively, with answers propagating back down the chain until they reach a fixed point, which is then output by the original counterfactual oracle."

If the oracle is operating under the assumption that it does exist but the predictions of oracles can never be read, its evidence that people want to ask questions to an oracle will not give it reason to predict that a second oracle will be constructed (a second one wouldn't be any more useful than the first) or that the answers a hypothetical second oracle might produce would have downstream causal effects. This approach might have advantages over the alternative proposed in the post because counterfactual scenarios in which the predictions of oracles are never read are less dissimilar from the actual world than counterfactual scenarios in which oracles do not and cannot exist.

It seems to me pretty obvious: in counterfactual world where humans don't get the answer from the first Oracle, humans say "what the heck" and build a working Oracle that gives answers. Edit: the trick is in the difference between "this Oracle doesn't give answer" and "all Oracles don't give answer". The first scenario described in this comment, the second scenario requires something like logical counterfactuals.

Hello,

The idea would be to consider a scenario in which it is something like a law of nature that the predictions of oracles can never be read, in just the same way that the authors are considering a scenario in which it is something like a law of nature that oracles do not and cannot exist.

One issue I have with a lot of these discussions is that they seem to treat non-myopia as the default for oracles, when it seems to me like oracles would obviously by default be myopic. It creates a sort of dissonance/confusion, where there is extended discussion of lots of tradeoffs between different types of decision rules, and I keep wondering "why don't you just use the default myopia?".

While I personally believe that myopia is more likely than not to arrive by default under the specified training procedure, there is no gradient pushing towards it, and as noted in the post currently no way to guarantee or test for it. Given that uncertainty, a discussion of non-myopic oracles seems worthwhile.

Additionally, a major point of this post is that myopia alone is not sufficient for safety, a myopic agent with an acausal decision theory can behave in dangerous ways to influence the world over time. Even if we were guaranteed myopia by default, it would still be necessary to discuss decision rules. 

While I personally believe that myopia is more likely than not to arrive by default under the specified training procedure, there is no gradient pushing towards it, and as noted in the post currently no way to guarantee or test for it.

I've been working on some ways to test for myopia and non-myopia (see Steering Behaviour: Testing for (Non-)Myopia in Language Models). But the main experiment is still in progress, and it only applies for a specific definition of myopia which I think not everyone is bought into yet.

Because it's really easy to get out of myopia in practice. Specifically, one example of a proposed alignment plan introduced non-myopia and non-CDT-style reasoning, which is called RLHF. Suffice it to say, you will need to be very careful about not accepting alignment plans that introduce non-myopia.

Here's the study:

“Discovering Language Model Behaviors with Model-Written Evaluations” is a new Anthropic paper by Ethan Perez et al. that I (Evan Hubinger) also collaborated on. I think the results in this paper are quite interesting in terms of what they demonstrate about both RLHF (Reinforcement Learning from Human Feedback) and language models in general.

Among other things, the paper finds concrete evidence of current large language models exhibiting:

convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), non-myopia (e.g. wanting to sacrifice short-term gain for long-term gain), situational awareness (e.g. awareness of being a language model), coordination (e.g. willingness to coordinate with other AIs), and non-CDT-style reasoning (e.g. one-boxing on Newcomb's problem). Note that many of these are the exact sort of things we hypothesized were necessary pre-requisites for deceptive alignment in “Risks from Learned Optimization”.

Furthermore, most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don't generally seem to be alleviating and sometimes seem to be actively making worse.

Interestingly, the RLHF preference model seemed to be particularly fond of the more agentic option in many of these evals, usually more so than either the pre-trained or fine-tuned language models. We think that this is because the preference model is running ahead of the fine-tuned model, and that future RLHF fine-tuned models will be better at satisfying the preferences of such preference models, the idea being that fine-tuned models tend to fit their preference models better with additional fine-tuning.[1]

Right now, it's only safe because it isn't capable enough.

Possibly we have different pictures in mind with oracle AIs. I agree that if you train a neural network to imitate the behavior of something non-myopic, then the neural network is itself unlikely to be myopic.

However, I don't see how the alternative would be useful. That is, the thing that makes it useful to imitate the behavior of something non-myopic tends to be its non-myopic agency. Creating a myopic imitation of its non-myopic agency seems to be self-contradicting goals.

When I imagine oracle AI, I instead more imagine something like "you give the AI a plan and it tells you how the plan would do". Or "give the AI the current state of the world and it extrapolates the future". Rather than an imitation-learned or RLHF agent. Is that not the sort of AI others end up imagining?

Admittedly, I agree with you that a solely myopic oracle is best. I just want to warn you that this will be a lot harder than you think to prevent people suggesting solutions that break your assumptions.

Thank you! I was thinking about counterfactual Oracles for some time and totally missed case with multiple/future counterfactal Oracles. Now I feel kinda dumb about it.

Have you considered logical counterfactuals? Something like ": for all Oracles that have utility function  accessible for humans answer is NaN"?

Some scattered thoughts on the topic: 

We shouldn't consider problem "how to persuade putative Oracle to stay inside the box?" solved. If we just take powerful optimizer and tell it to optimize for truth of particular formula, in addition to self-fulfilling prophecies we can get simple old instrumental convergence where AI gather knowledge and computing resources to give the most correct possivle answer.

I have a distaste for design decisions that impair the cognitive abilities of AI, because they are unnatural and just begging to be broken. I prefer weird utility functions to weird cognitions.  

I don't believe we considered logical counterfactuals as such, but it seems to me that those would be quite comparable to the counterfactual of replacing an oracle with a simpler system.