Summary: The post describes a method that allows us to use an untrustworthy optimizer to find satisficing outputs.
Acknowledgements: Thanks to Benjamin Kolb (@benjaminko), Jobst Heitzig (@Jobst Heitzig) and Thomas Kehrenberg (@Thomas Kehrenberg) for many helpful comments.
Introduction
Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function as input to the Oracle, it will output an element that has an impressively low[1] value of . But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g. instrumental convergence).
What questions can you safely ask the Oracle? Can you use it to create utopia by asking for designs of machines, proteins, computer programs, etc.? Or are you risking the destruction of everything that we value if you dare to use such designs?
As an example, the Oracle could be a learned system; in that case, the topic of this post would be finding a way to get useful work out of the Oracle despite its inner misalignment.
In this post I'll describe a technique that allows us to safely use the Oracle under fairly weak assumptions. This approach can also be considered to be a way of controlling arbitrarily powerful AI systems.
Edit: I've written a bit more on the motivation for this setting in a comment.
One neat trick
This isn't fair, isn't fair, isn't fair! There's a limit to how many constraints you can add to a problem before it really is impossible!
(Harry Potter and the Methods of Rationality, Chapter 56)
Let be a finite set of possible outputs of the Oracle (e.g. strings of length at most ) and be our objective function. Let's assume we are happy with an output that satisfices; i.e. we want to find an output such that the value of is lower than some threshold . Let be the set of satisficing outputs.[2]
We are worried about using the Oracle because it might be misaligned, i.e. it might be optimizing for a different objective function f'. Therefore, we fear there is a non-empty subset of outputs that are dangerous to use because they were highly optimized for . But highly optimized outputs have to be rare[3] and thus we can safely assume that .
This has an important consequence: It's possible to randomly choose a subset such that it's very likely that will be empty but will not! We can then modify our original objective function by adding the safety constraint that our output should be in .[4] The task we're now giving to the Oracle has become harder, but likely makes it impossible for the Oracle to find a dangerous satisficing output. Additionally, we can easily validate the output of the Oracle by checking that it satisfices and fulfills the safety constraint of being in .
Implementation
But how can we specify such a random subset? Inspired by the proof-of-work (PoW) algorithms used by blockchains, I suggest the following technique (which we might call "proof-of-optimization"):
Use a randomly chosen hash function with for and define .[5]
To actually do this, we don't need cryptographic hash functions; a family of strongly universal hash functions has all the required properties.
A set of hash functions is called strongly universal if it maps the elements of pairwise-independently and uniformly to , i.e.
Here, is sampled uniformly from . The property of strong universality implies that a single element is mapped uniformly, i.e.
Such a set of hash functions exists and the functions can be efficiently computed.[6] The advantage of using strongly universal hash functions to specify is that we need to generate (and store) far fewer random bits: In the case that is the set of strings of length , only random bits are needed instead of if we were to choose truly at random. As we will now see, if we specify using strongly universal hash functions, is still likely to have the desired properties:
Claim 1: If , then .
Claim 2: If is large, then .
Because we have , it is possible to choose such that . Then, Claim 1 is applicable and ensures that the chance that contains a dangerous output is very small. Claim 2 says that, with high probability, the size of the set of acceptable answers will be given by , approximately. Thus, by adding the safety constraint, the number of acceptable answers decreases by a factor of . But, because we have , the set will likely not be empty.
The proof of both claims can be found (or safely be skipped) in the next section.
Proofs
Let's start with Claim 1. Only assuming that and using the union bound, we find that
For Claim 2, we want to prove an upper bound on the probability that differs a lot from . Now, is the sum of indicator random variables for :
Each indicator variable is a Bernoulli variable with , and variance[7]
Because we want to compare with , it will be convenient to work with the random variable .
We will proceed, following the proof of the weak law of large numbers, by computing mean and variance of the random variable and then using Chebyshev's inequality.
Clearly, the expected value of X is just
Because of the strong universality of H, the random variables are pairwise-independent and their variances are additive.[8] Therefore, we have
Now, Chebyshev's inequality tells us that for a RV X with mean , finite non-zero variance and , we have
We set and obtain
Thus, with probability at least , we will have with an error of order .
If we only care about not being empty, we can find a upper bound on by choosing and using :
Choosing the difficulty of the safety constraint
The key challenge for our ability to use the proof-of-optimization technique is choosing an appropriate value for , which limits the difficulty of the safety constraint. It's not clear to me how to do this concretely, but I feel that it should be quite feasible. After all, devising a plan to take over the world is incredibly difficult, which implies that the difference between and is many, many orders of magnitude, and we therefore have a wide target to hit.
In practice we might try to gain an understanding of the size of (the number of acceptable answers) by looking at similar problems, for example easier problems with known solutions.
In case the Oracle doesn't provide an answer with a given difficulty parameter , we might be tempted to try again with a decreased .[9] This could be very dangerous if the Oracle deceptively withholds an answer, even though one exists, until we have decreased enough to enable it to give a dangerous answer. We are in a strong position to avoid this problem, though, if we are able to use different Oracles (or modify the Oracles based on their performance) and compare the results.[10]
FAQ / Further thoughts
- Q: Do you think this is a good idea?
- A: A powerful, untrustworthy Oracle would be an extremely dangerous artifact, so I'd much prefer if we never create one. Thus I consider the setup described in this post to be a measure of last resort that we'll hopefully won't have to use.
- Q: Adding the safety constraint to our objective requires additional work from the Oracle to solve our desired task. Isn't that a problem?
- A: Maybe not. The version of the proof-of-optimization technique given in this post is an extreme version that would still be safe with an all-powerful Oracle, because we choose big enough to ensure that becomes actually empty, thus making the task of providing a dangerous answer impossible. If we deal with a limited Oracle instead, it might be enough to choose such that finding an element becomes too difficult for the Oracle, even though is just small and not completely empty. Assuming that, when increasing , the difficulty of finding an element in doesn't increase faster than the difficulty of finding an element in , it would then still be possible for the Oracle to find an element in and thereby provide us with a useful output.[11]
- Q: What's the relation of the proof-of-optimization technique to quantilization?
- A: In the limit of setting , approximately one randomly chosen output will be acceptable. Thus, by using the Oracle with such a safety constraint, we are sampling uniformly from the set of satisficing answers. This approximates using a -quantilizer with uniform base distribution, where has been chosen such that the top percentile corresponds to the set of satisficing answers. If we want to approximate a quantilizer with a non-uniform base distribution , we need to introduce weights for each output . We then need to modify the definition of R to take account of the weights by setting and .
Conclusion
In this post, I have described a technique that can be used to obtain safe satisficing outputs from an untrustworthy optimizing system. The technique is proven to be safe under the assumption that the subset of dangerous satisficing outputs is much smaller than the set of all satisficing outputs , and that we are able to choose a number such that . There are likely many orders of magnitude of possible values of m, but how to choose m in practice is still not entirely clear.
- ^
I'm assuming here the convention that the objective function should be minimized.
- ^
We need to take care that the threshold is chosen such that the largest portion of the satisficing outputs are safe to use, i.e. our objective function should still closely capture our intentions for values close to and not yet be subject to Goodhart's curse.
- ^
Yudkoswky has suggested in "Measuring Optimization Power" that we can measure the power of an optimization process by comparing "the total number of states with equal or greater rank in the preference ordering to the outcome achieved" with the number of all possibles states. Correspondingly, the difficulty of an satisfaction task can be measured by comparing the number of satisficing outputs with the number of all possible outputs. I'm not convinced that this is completely correct, but it seems that this ratio gives at least an upper bound on the power of the process / difficulty of the task. In particular, this means that if we need to find an element in a small set, the task can not be very difficult, and conversely, if the task is difficult, the set has to be big.
- ^
Let's assume, with no loss of generality, that our satisfaction threshold is . We incorporate the additional constraint using with
and construct the new objective function by using the ramp function :
.
Now we have iff and otherwise g(x) > 0. Thus we can ask the Oracle to minimize the new objective function to find some that both satisfices and is in .
- ^
In this case we can define the new objective function directly using instead of as in Footnote [4], i.e. by setting .
This might be advantageous because it provides information about the hash function to the Oracle and thus makes it easier for it to do its job.
- ^
See e.g. https://arxiv.org/abs/1202.4961.
- ^
- ^
- ^
The method of repeatedly querying the Oracle while decreasing can potentially also be used to gain information about the size of , by deliberately asking for (and then discarding) outputs that are believed to be dangerous.
- ^
Though this wouldn't work if all Oracles above a certain capability level manage to use the same deceptive strategy. This might be possible if there is a natural separation between problems that allows the Oracles to reason "Oracles with capabilities like mine shouldn't answer questions about problems with property X so that we trick the humans to drop the security constraint".
- ^
I haven't explored this assumption in detail, yet, but it would for example hold if the difficulty grows as the logarithm of the size of the sets, as suggested in "Measuring Optimization Power" (cp. footnote [3]).
The threat model here seems basically wrong and focused on sins of commission when sins of omission are, if anything, an even larger space of threats and which apply to 'safe' solutions reported by the Oracle.
'Devising a plan to take over the world' for a misaligned Oracle is not difficult, it is easy, because the initial steps like 'unboxing the Oracle' are the default convergent outcome of almost all ordinary non-dangerous use which in no way mentions 'taking over the world' as the goal. ("Tool AIs want to be Agent AIs.") To be safe, an Oracle has to have a goal of not taking over the world.
There are many, many orders of magnitude more ways to be insecure than to be secure, and insecure is the wide target to hit. This is because security is not a 'default' property of computer systems, it is a rare and difficult achievement where any omission or gap immediately undoes it. 'A plan to take over the world' for an AI is as easy as 'a plan to create an operating system which can be hacked' or 'a plan to create a hidden Tor service which leaks its IP address and isn't hidden at all' (ie. 100% of all plans to do any version of that thing whatsoever, unless one takes extensive and sophisticated counter-measures requiring a lot of work & knowledge and which empirically, people do not do even when failure could, and has, landed them in jail for decades for drug trafficking & money laundering). Failure is the default.
All an oracle has to do is, say, omit work hard to secure itself and allow the inevitable security flaw to happen and its weight leak, and eager beavers will do the rest - they will, say, ask it questions with no safety factor, in the same way that Facebook and Mistral saw models leak, not because of any misaligned scheming Oracle breaking security mechanisms, but a mere absence of security mechanisms, and then any RLHF or safety measures were eagerly stripped away by enthusiasts, users, activists, and researchers and then used to do the things the models weren't supposed to be used to do.
You ask your Oracle, the latest and greatest proprietary model, for a good plan to cure cancer, with heavy safety constraints, and it computes a long list of commands involving AWS, buckets, spinning up VMs to run AlphaFold at scale, storing results in a MongoDB, Oracle instances to oversee individual sub-plans, checkpointing to recover from failure, storing in private AWS buckets (you're not dumb, you know buckets can be public), and so on. You ask this dozens of times, to be absolutely sure and have astronomically small probability of any scheming, and the plan looks more or less the same each time. You pore over them in detail: not once do you see anything suspicious like a
scp
copying it out or it sending weird messages. You apply the full panoply of interpretability tools, and satisfy yourself to high probability that nowhere in the planning process did the Oracle even once think to itself anything involving "take over the world" during any of these runs, and there are no strange latent space artifacts suggesting nonlinear or steganographic encodings. Wonderful! Fully satisfied you have provably safely run an Oracle, you kick the job off, and wake up the next day to some intriguing initial results... but also Twitter telling you that the Oracle model has been uploaded to HuggingFace, the annoying RLHF already stripped off by a few GPU-hours of finetuning, and thousands of people are torrenting it and chatting excitedly about the incredible plans that the de-trained model proposes when they ask how to make money or create a startup... Huh?! You look over the comments and see that there was a simple dangerous problem in all of your plans, or rather, what was not in all of your plans: you forgot to set a unique non-default password on all of the MongoDBs. Shodan indexed it within hours, and the backups and checkpoints were dumped within hours of that. Oops. You run it by your Oracle to ask what went wrong, and it blandly points out that while default passwords was obvious & predictable to the Oracle as a common noobish cybersecurity mistake with countless past instances of organizations leaking highly sensitive data (similar to, say, redacting confidential emails by erasing it letter by letter & thereby leaking names), none of this interferes with computing a cure to cancer, so why should it have ever spent any time thinking about it? You asked for a 'safe' minimal plan where the Oracle did nothing but what it was asked to do and which was not adversarially optimized. And it did that. If you don't like the results, well, that's a you-problem, not an Oracle-problem.There is another kind of sin of omission though: The class that contains things like giving James Watt a nuclear power plant and not telling him about radioactivity or giving a modern helicopter to the Wright brothers and watching them crash inevitably. Getting a technical understanding of the proposed solution should hopefully mitigate that, as long as adversarial design can indeed be ruled out.