An AI-in-a-box success model

azsantosk

Given that the strategy of Death with Dignity is not particularly appealing to me, and that others have encouraged us to instead Do Something Else (if Alignment is Doomed), I am going to move forward and explain why, despite the well-known limitations, I am still optimistic on the old idea of creating an Oracle and keeping it in a box.

My reasoning is as follows:

The class of AGIs that may be deployed safely as an Oracle seems to be larger than the class of intent aligned or a Friendly AGIs.
It seems at least possible that building a safe Oracle will be considerably easier than solving general AGI alignment, and that we can do so in a timescale competitive with the development of similar unfriendly AI.
These Oracles are going to be extremely useful, and even if costly precautions are needed to make them safe, and it may still be possible for their operators to achieve decisive strategic advantage.

If these ideas are correct, then we may have a plan. It may be hard to execute and not at all guaranteed to succeed, but it has at least some potential to work. And plans like that are exactly the ones you should be looking for if you are losing and need to play to your outs.

Risks and general precautions

In order to understand the class of AGIs that may be deployed safely as a Oracle we need to understand first what the risks are. Here are the main ones that I am aware of:

The AI may persuade the human gatekeepers to let it out, as shown to be possible even by humans in the AI-Box Experiment.
The AI may escape by technical means, such as hacking, or by a combination of technical and social engineering, such as generating radio waves or using other "unintended output channels" to communicate with third-parties, and then using persuasion or manipulation in order to get them to do what it wants.
The AI may manipulate the human gatekeepers, secretly guiding them into world-states in which it is more likely to escape. For example, the AI may provide good-looking plans with subtle failure modes that are somehow beneficial to the AI, and that the AI can manage to avoid blame for.

Risks A and B are both serious, and both require costly countermeasures, but it seems to me that none of them actually make it impossible for us to create a secure Oracle AGI.

For example, we can:

Strictly limit the output bandwidth of the system. The less information content in the AGI's output, the less likely it is that it can persuade us.
Give AGI only as much information about specific human operators as it really has to know to perform its intended function. The more it knows about a specific human, the more tailor-made a persuasive message can be made to be. Messages capable of persuading humans in general are likely to be longer.
Perform memory erasure on the AGI by initiating a new AGI instance straight from the trained model and into new hardware often. It is unclear whether this makes it much more difficult for an AGI to output a generic persuasive message, but at least it can reset the AGI's psychological model of whoever is giving it inputs.
Run the AGI in a secure air-gapped box made carefully to prevent radio signals and other unintended output channels from leaking.

In my opinion risk C is actually the most devastating one, even though it seems to be made much less often.

For an extreme example, if we have an Oracle that would rather have another AGI rule the world than its human operators, then all its outputs will be unhelpful and manipulative, and we can expect no good from them.

Motivational control for Oracles

One proposal that I don't find particularly promising but that has been made frequently is that of building a system that is somehow self-unaware or less 'agentic'.

For several reasons I think it is unlikely that a system will not be 'agentic' if it is doing powerful optimization in production (like AlphaZero) and is trained to search for plans over large world-models. At the same time, systems that are very optimized but do not do optimization in production (like GPT-3) do not seem powerful enough to give us the kind of answers we expect from an Oracle AI.

So what I do consider promising is creating an AGI from among one of the several classes of agents that are expected to make for safer Oracles.

The first is the well-known class of myopic agents. One possibility is an agent that only cares about the rewards it gets for the current question-answer episode. These agents are expected to not engage in long-ranged persuasion and manipulation, and we can possibly make them unambitious.

I have proposed a second class, that of reversely-aligned agents. These are agents that, while not necessarily friendly, very much prefer their human operators to remain in power to the default scenario of allowing another AGI to take over. Moreover, they are not willing to take significant risks in that regard to gain power for themselves.

Such reversely-aligned AGIs are those for which the AI-in-a-box setup is going to be most helpful. Because their priority is making sure another AGI will not take over, they can be expected to provide humans operators with useful advice and technology in order to prevent that. But they are also expected to engage in persuasion and manipulation attempts if the opportunity costs for doing that are small. So a carefully-designed boxing mechanism may be critical for them to be safe.

I think myopic and reversely-aligned Oracles complement each other's shortcomings and that we should attempt to build both if possible. Next I will explain why I feel it may be realistic to do so.

How hard is it going to be to create a myopic Oracle, or a reversely-aligned one?

The goal of this section is to explain why I think creating a myopic or a reversely-aligned agent may be easier than creating a directly-intent-aligned one, which is what we usually think of when we discuss the alignment problem.

To explain this intuition I'll quickly go over some concepts related to paths for AGI development.

First I'd like to say that I subscribe fully to Steven Byrnes' view that evolution is not a good analogy for how humans will create AGI, and as a result I don't expect the AGI's core learning algorithm and reward calculation to be part of a black box.

I expect the first AGI to be much more like a steered optimized than a mesa-optimizer, doing something similar to model-based actor-critic RL with online learning, and that the difficulty of alignment will lie mostly on getting the thought assessors or value functions to assign positive valence to the right learned concepts of its world-model.

I expect direct intent alignment in such systems to be quite hard both because of the complexity of human values and because of Goodhart's Law.

However, if we have the more modest goal of creating a myopic AGI, then many of these problems go away. For example, it should be relatively easy to train the thought assessors on the ground-truth of getting the most reward on the current episode.

And if things go only a little off there, the outcomes don't necessarily become wildly different, because the boxed AGI has limited power, and the thought assessors will likely notice that the most malicious plans use up computational resources that would be better allocated to the short-term reward.

Training a myopic agent also reduces the likelihood of dangerous mesa-optimization. For if the episodes as relatively short, it seems unlikely that creating a whole new intelligent agent to help solve a hard task will be considered a promising way of maximizing the next reward.

There are many open questions, such as whether myopic training will make training less efficient, whether a dangerous curiosity drive is required in the beginning, and whether we can really turn it off safely later.^[1]

Part of this reasoning also apply to creating reverse aligned AGIs. Because we don't expect to ever let such AGIs out of the box, Goodhart's law is less of a concern. The more R-aligned an Oracle is, the more useful and safe it is going to be, with no requirement for absolute perfection.

At the same time, there are reasons to think that it will be easier to get the thought-assessors approximately right in this case. For reverse alignment, we only need the AGI to care about the expectation of some bounded property that is already nearly maximized in the "human victory" state and that is not maximized at all in the default scenarios. There are certainly many such properties, and we just need to find one that can be easily learned in the world-model and reliably identified by the thought-assessors based on ground-truth signals.

I don't have much of an idea of what such a property could be, but the fact that many different nonhuman motivations may be capable of inducing reverse alignment in an Oracle gives me a lot of hope for this approach.

Strategic usefulness of a myopic Oracle

Suppose we succeed at creating a safe Oracle AGI. That would be great! But what is there to prevent another team from developing an unsafe AGI that will destroy the world a little bit later?

Some people seem to think that, if our Oracle is myopic, then it is not going to be much useful. After all, it by definition does not care about the long-term consequences. So, the argument goes, the plans it generates will impress us but they won't work.

I'm a little skeptical of this argument. The important consideration I think is that the myopic AGI is not trying to generate an adversarial plan that is designed to both look good and to not work. If that were the case then there wouldn't be much hope (and this is the main problem with non-myopic, non-aligned Oracle AIs in my opinion). So our task is only to make sure that the easiest way to create a plan we like is to create a plan that is likely to work.

I mean, the plans it will give us will probably be far from the "best" plans. Many good plans are complicated and we won't understand them or why they'll work, so the AGI won't even try. But if we reward it appropriately for realistic plans that we can test quickly and that moves our situation forward, or new knowledge that is easy to verify, that should gradually increase our power (= our ability to impact the world towards positive states).

The way I see it not working is if we are idiots that consistently reward the AGI for giving us good-looking, wishful-thinking types of ideas, or if we decide to do something big and irreversible based on such a plan and it all goes wrong. But this looks stupid to me, I don't feel a smart enough developer team capable of creating the first AGI will act like that.

I don't think we'll be actually asking the myopic AGI for thinks like "what are pivotal things we can do to prevent x-risk for humans" or "how can we rule the world" or "how can we create an aligned AGI". These are not what good questions look like at all.

So what I think can work best is for we humans to create the long-term pivotal plans, separate them into multiple steps, and then ask specific, technical questions to the AGI that we can verify quickly, and that do not require irreversible steps.

A myopic Oracle can help us with the technical details of human-generated plans

I will provide a stupid plan just to sake of an example. In reality the plan need to be a lot better than that! But lets go.

Let us say the plan looks like that: (1) we want to make 100 billion dollars, (2) we want to know all researchers doing AGI work, (3) we want to bribe them so that they all stop. Of course step (3) is not going to work like that, but please do suspension of disbelief for a while.

Then for each sub-goal in the larger plan we can generate technical questions for which good answers provably move us closer to the sub-goal.

For instance, we may ask for an easy-to-understand ML algorithm that predicts short-term stock prices, and if it works we can know that we are making money very soon. So in this case we can get the plan, execute it and see if it provably makes a lot of money, and then you reward it after you already know the plan works.^[2]

So let us assume we can get insanely rich quickly once we develop a myopic AGI. This is by itself not pivotal, but it's a step closer. Let us go to step (2), we want to know what all potential AGI researchers are doing.

At this point if we want to we can create an intelligence/surveillance organization with budget 10x greater than the NSA. The NSA has annual budget of 10 billion, and we just made 100 billion in profits. So we are doing quite well!

And not only that, we can still ask questions to the AGI. Let us say we now ask for some zero-day exploits. This is the kind of thing NSA mathematicians do, but that the AGI will do a lot better. Again, once we have an output from the AGI, we can test it immediately to see if it works, that is, whether we can use it hack a given computer, before the AGI gets any reward.

The zero-day exploits are also not pivotal on their own. But that is another step closer, and the AGI didn't even had to do long-term world modelling to help us achieve that.

So now we are extremely rich, and we also run the biggest intelligence agency in the world, and we also know what all AGI researchers are doing (we can afford special operatives to investigate the few weirdos that appear to be doing AGI research disconnected from the internet). And the total AGI output has been quite limited so far.

The only thing missing now is step (3), making them stop. This may well be the hardest part and I won't go into details here, but I think you can see how far we have come. If we managed to get to this point without anyone else having developed an AGI yet, we can possibly prevent other AGI for a decade or more. At this point government-funded AGI projects are going to be the biggest concern, but these are only a few entities, and we may be able to find creative ways for dealing with them.

Of course my plan is lame! But if we can encourage our community to think about it for years, so I'm pretty sure we can find something better that might work. The important concept is that the long-term parts of the strategy should be planned by humans ourselves, and we should basically ask the myopic AGI for technology that can be verified somewhat quickly and that is expected to increase our general influence over the world.

A reversely-aligned Oracle can help us find long-term strategic plans that actually work

The major shortcoming of a myopic Oracle is that its long-term plans cannot be expected to work reliably. I have proposed above that we humans do the long-term planning instead. While I believe that may work, it is clear that there are many limitations in this approach.

This is not much of a problem for a R-aligned Oracle. Here our biggest concern is that, if reverse alignment is imperfect (as it is likely to be), we need to be very careful with persuasion and manipulation attempts.

So one possibility is for us to ask the R-aligned Oracle for help with the long-term strategic plan. We need to operate in a low-bandwidth regime due to persuasion risk, but because we are only interested in high-level ideas, that may be possible.

At the same time, we can borrow ideas from the AI-safety-by-debate to get a myopic Oracle to criticize the plan. Because the myopic Oracle is a little bit less likely to be trying to persuade us to leave the box, we can let its output be a little longer. And if we are careful to reward it accordingly for finding complex objections to ideas we would otherwise accept, this may help us mitigate most of the manipulation risks associated with following the advice of an R-aligned AGI.

^{^}
There are also many decision-theoretic questions regarding myopia. I don't really know well what to think of them. It may be easy for an Oracle to simulate a past version of itself, but I don't see how any present reward can be gained by doing that. As a result, in my mental model I expect the Oracle at any timestamp to not give much weight to the possibility of such acausal trades happening (I am still confused by that, so please correct me if I am wrong here).
^{^}
My background is in high-frequency trading, so I can tell from experience that there are short-term trading strategies that generate statistically-significant profits, p<0.01, within a single day, e.g. by making 1000 trades in 50 different markets and trades lasting a couple minutes each. In this example it is totally possible that out of 1000 trades 550+ turn out profitable which is very statistically significant already.

[-]Evan R. Murphy4y10

Interesting post. I agree that a myopic oracle is a promising idea. In particular if it were used to accelerate AGI alignment research. (It could be dangerous if used in many other ways.)

> The major shortcoming of a myopic Oracle is that its long-term plans cannot be expected to work reliably.

I don't think this is necessarily true. It depends on the sort of myopia which the oracle is imbued with. If it has a myopia where it is incapable of thinking about the long-term future, then this is correct.

But it could have a different kind of myopia where it is very capable of thinking about the long-term future, but its goal is still to maximize its short term reward. In this case, it could be incentivized to go to the trouble of creating a good long-term plan if that meant it was more likely to get the short-term reward.

If we had such a nice form of myopia, the best formats I've seen so far for creating such an oracle with it are debate (which you referenced) and market making. Market making is particularly interesting because it is compatible with per-step myopia, which seems like it will be easier to enforce than per-episode myopia (what debate seems to require).

I don't think there is any known way to verify/enforce any of these kinds of myopia yet though. I still need to read your post about reverse-intent alignment to better understand that idea.

LESSWRONG
LW