Suppose I have a magic box that takes as input a program , and produces , with only times the cost of a single evaluation of . Could we use this box to build an aligned AI, or would broad access to such a box result in doom?
This capability is vaguely similar to modern ML, especially if we use to search over programs. But I think we can learn something from studying simpler models.
An unaligned benchmark
(Related.)
I can use to define a simple unaligned AI (details omitted):
- Collect data from a whole bunch of sensors, including a "reward channel."
- Use to find a program that makes good predictions about that data.
- Use to find a policy that achieves a high reward when interacting with .
This isn't a great design, but it works as a benchmark. Can we build an aligned AI that is equally competent?
(I haven't described how works for stochastic programs. The most natural definition is a bit complicated, but the details don't seem to matter much. You can just imagine that it returns a random that is within one standard deviation of the optimal expected value.)
Competing with the benchmark
(Related.)
If I run this system with a long time horizon and a hard-to-influence reward channel, then it may competently acquire influence in order to achieve a high reward.
We'd like to use to build an AI that acquires influence just as effectively, but will use that influence to give us security and resources to reflect and grow wiser, and remain responsive to our instructions.
We'd like the aligned AI to be almost as efficient. Ideally the proportional overhead would converge to 0 as we consider more complex models. At worst the overhead should be a constant factor.
Possible approach
(Related.)
My hope is to use to learn a policy which can answer questions in a way that reflects "everything knows." This requires:
- Setting up an objective that incentivizes to give good answers to questions.
- Arguing that there exists a suitable policy that is only slightly more complicated than .
If we have such a , then we can use it to directly answer questions like "What's the best thing to do in this situation?" The hope is:
- Its answers can leverage everything knows, and in particular all of 's knowledge about how to acquire influence. So using in this way is competitive with using directly.
- It knows enough about human preferences to be corrigible.
"Everything knows" is slippery; I mean something like "what a sufficiently-idealized Bayesian would believe after updating on the fact that achieves a high reward." Constructing an objective which incentivizes these answers probably requires understanding the nature of that update.
Thoughts on feasibility
In the context of ML, I usually imagine training via iterated amplification. Unfortunately, iterated amplification doesn't correspond to optimizing a single objective ---it requires either training a sequence of agents or exploiting properties of local search (using the previous iterate to provide oversight for the next). If we just have , it's not clear if we can efficiently do anything like iterated amplification or debate.
If aligning is impossible, I think that's slightly bad news for aligning ML. That said, it's reasonably likely that local search will be easier to align, so the next step would be constructing a simple model of local search.
There are also some ways in which the optimizer case seems easier:
- It's a simpler model and so more amenable to analysis. The Bayesian update from " gets a high reward" is more straightforward when is actually optimized.
- We don't have to worry about optimization difficulty.
- Given a policy we can directly search for an input on which it behaves a certain way.
It's OK if it's impossible
When working on alignment I aim to either find a scalable alignment strategy or a clear argument for why scalable alignment is impossible. I'm excited about considering easy-to-analyze versions of the alignment problem even if they are impossible:
- It gives us practice making impossibility arguments, and developing relevant intuitions and techniques.
- It clarifies the difficulty of the alignment problem---if we know why we can't handle simple cases like , then we learn something about what the solution must look like in more complicated cases.
- It gives us a sense of what impossibility results might look like, if we were able to prove them in more realistic cases. Would they actually be strong enough to guide action, or convince anyone skeptical?
Expensive optimization
I described as requiring times more compute than . If we implemented it naively it would instead cost times more than .
We can use this more expense in our unaligned benchmark, which produces an AI that we can actually run (but it would be terrible, since it does a brute force search over programs). It should be easier to compete with this really slow AI. But it's still not trivial and I think it's worth working on. If we can't compete with this benchmark, I'd feel relatively pessimistic about aligning ML.
Planned summary:
Current ML capabilities are centered around **local search**: we get a gradient (or an approximation to one, as with evolutionary algorithms), and take a step in that direction to find a new model. Iterated amplification takes advantage of this fact: rather than a sequence of gradient steps on a fixed reward, we can do a sequence of amplification steps and distillation gradient steps.
However, we can consider an even simpler model of ML capabilities: function maximization. Given a function from n-bit strings to real numbers, we model ML as allowing us to find the input n-bit string with the maximum output value, in only O(n) time (rather than the O(2n) time that brute force search would take). If this were all we knew about ML capabilities, could we still design an aligned, competitive version of it? While this is not the actual problem we face, due to its simplicity it is more amenable to theoretical analysis, and so is worth thinking about.
We could make an unaligned AI that maximizes some explicit reward using only 2 calls to Opt: first, use Opt to find a good world model M that can predict the dynamics and reward, and then use Opt to find a policy that does well when interacting with M. This is unaligned for all the usual reasons: most obviously, it will try to seize control of the reward channel.
An aligned version does need to use Opt, since that's the only way of turning a naively-exponential search into a linear one; without using Opt the resulting system won't be competitive. We can't just generalize iterated amplification to this case, since iterated amplification relies on a _sequence_ of applications of ML capabilities: this would lead to an aligned AI that uses Opt many times, which will not be competitive since the unaligned AI only requires 2 calls to Opt.
One possible approach is to design an AI with good incentives (in the same way that iterated amplification aims to approximate HCH) that "knows everything that the unaligned AI knows". However, it would also be useful to produce a proof of impossibility: this would tell us something about what a solution must look like in more complex settings.
Planned opinion:
Amusingly, I liked this post primarily because comparing this setting to the typical setting for iterated amplification was useful for seeing the design choices and intuitions that motivated iterated amplification.