ARC's hope is to train AI systems with the loss function "How much do we like the actions proposed by this system?", in order to produce AI systems that take actions we like.
If human overseers know everything the model knows then this is just RLHF. However we are concerned about cases where the model understands something that the humans do not. In that case, we hope to use ELK to elicit key information that will help us understand the consequences of the AI's action so that we can decide whether they are good.
(The same difficulty would arise if you were trying to train AI systems to evaluate actions and then searching against those evaluations, which is the case we discuss in the ELK report.)
what happens if this finds a way to satisfy values that the human actually has, but would not have if they had been able to do ELK on their own brain? eg, for example, I'm pretty sure I don't want to want some things I want, and I'm worried about s-risks from the scaled version of locking in networks of conflicting things people currently truly want but truly wouldn't want to truly want. eg, I'm pretty sure mine are milder than this, but some people truly want to hurt others in ways the other doesn't want order to get ahead, and would resist any attempt to...
The ELK report has a section called "Indirect normativity: defining a utility function" that sketches out a proposal for using ELK to help align AI. Here's an excerpt:
Suppose that ELK was solved, and we could train AIs to answer unambiguous human-comprehensible questions about the consequences of their actions. How could we actually use this to guide a powerful AI’s behavior? For example, how could we use it to select amongst many possible actions that an AI could take?
The natural approach is to ask our AI “How good are the consequences of action A?” but that’s way outside the scope of “narrow” ELK as described in Appendix: narrow elicitation.
Even worse: in order to evaluate the goodness of very long-term futures, we’d need to know facts that narrow elicitation can’t even explain to us, and to understand new concepts and ideas that are currently unfamiliar. For example, determining whether an alien form of life is morally valuable might require concepts and conceptual clarity that humans don’t currently have.
We’ll suggest a very different approach:
1. I can use ELK to define a local utility function over what happens to me over the next 24 hours. More generally, I can use ELK to interrogate the history of potential versions of myself and define a utility function over who I want to delegate to—my default is to delegate to a near-future version of myself because I trust similar versions of myself, but I might also pick someone else, e.g. in cases where I am about to die or think someone else will make wiser decisions than I would.
2. Using this utility function, I can pick my “favorite” distribution over people to delegate to, from amongst those that my AI is considering. If my AI is smart enough to keep me safe, then hopefully this is a pretty good distribution.
3. The people I prefer to delegate to can then pick the people they want to delegate to, who can then pick the people they want to delegate to, etc. We can iterate this process many times, obtaining a sequence of smarter and smarter delegates.
4. This sequence of smarter and smarter delegates will gradually come to have opinions about what happens in the far future. Me-of-today can only evaluate the local consequences of actions, but me-in-the-future has grown enough to understand the key considerations involved, and can thus evaluate the global consequences of actions. Me-of-today can thus define utilities over “things I don’t yet understand” by deferring to me-in-the-future.
I would instead say that ELK is a component of getting good human feedback, and the more ambitious the ELK is (e.g. requiring the human to actually understand something correctly for it to count as reported), the more it's sort of trying to do all the work needed to get good human feedback (which may involve a lot of subtle work), and making it so that you only need a very simple wrapper around it (e.g. approval-directedness, or RLHF, or human-guided self-modification, or automated alignment research) to get good outcomes.
Let's say we arrive at a worst-case solution for ELK, how are we planning to use it? My initial guess was that ELK is meant to help make IDA viable so that we may be able to use it for some automated alignment-type approach. However, this might not be it. Can someone clarify this? Thanks.