Suppose that ELK was solved, and we could train AIs to answer unambiguous human-comprehensible questions about the consequences of their actions. How could we actually use this to guide a powerful AI’s behavior? For example, how could we use it to select amongst many possible actions that an AI could take?
The natural approach is to ask our AI “How good are the consequences of action A?” but that’s way outside the scope of “narrow” ELK as described in Appendix: narrow elicitation.
Even worse: in order to evaluate the goodness of very long-term futures, we’d need to know facts that narrow elicitation can’t even explain to us, and to understand new concepts and ideas that are currently unfamiliar. For example, determining whether an alien form of life is morally valuable might require concepts and conceptual clarity that humans don’t currently have.
We’ll suggest a very different approach:
1. I can use ELK to define a local utility function over what happens to me over the next 24 hours. More generally, I can use ELK to interrogate the history of potential versions of myself and define a utility function over who I want to delegate to—my default is to delegate to a near-future version of myself because I trust similar versions of myself, but I might also pick someone else, e.g. in cases where I am about to die or think someone else will make wiser decisions than I would.
2. Using this utility function, I can pick my “favorite” distribution over people to delegate to, from amongst those that my AI is considering. If my AI is smart enough to keep me safe, then hopefully this is a pretty good distribution.
3. The people I prefer to delegate to can then pick the people they want to delegate to, who can then pick the people they want to delegate to, etc. We can iterate this process many times, obtaining a sequence of smarter and smarter delegates.
4. This sequence of smarter and smarter delegates will gradually come to have opinions about what happens in the far future. Me-of-today can only evaluate the local consequences of actions, but me-in-the-future has grown enough to understand the key considerations involved, and can thus evaluate the global consequences of actions. Me-of-today can thus define utilities over “things I don’t yet understand” by deferring to me-in-the-future.
ARC's hope is to train AI systems with the loss function "How much do we like the actions proposed by this system?", in order to produce AI systems that take actions we like.
If human overseers know everything the model knows then this is just RLHF. However we are concerned about cases where the model understands something that the humans do not. In that case, we hope to use ELK to elicit key information that will help us understand the consequences of the AI's action so that we can decide whether they are good.
(The same difficulty would arise if you were trying to train AI systems to evaluate actions and then searching against those evaluations, which is the case we discuss in the ELK report.)
I'll be happy if AI gives people time/space/safety to figure out what they want while taking actions in the world that preserve option value.
The kind of AI alignment solution we're working on isn't a substitute for people deciding how they want to reflect and develop and decide what they value. The idea is that if AI is going to be part of that process, then the timing and nature of AI involvement should be decided by people rather than by "we need to deploy this AI now in order to remain competitive and accept whatever affects that has on our values."
You ... (read more)