EDIT: Please note that the way I use the word "corrigibility" in this post isn't quite how Paul uses it. See this thread for clarification.
This is mostly a reply to Paul Christiano's Universality and security amplification and assumes familiarity with that post as well as Paul's AI alignment approach in general. See also my previous comment for my understanding of what corrigibility means here and the motivation for wanting to do AI alignment through corrigibility learning instead of value learning.
Consider the translation example again as an analogy about corrigibility. Paul's alignment approach depends on humans having a notion of "corrigibility" (roughly "being helpful to the user and keeping the user in control") which is preserved by the amplification scheme. Like the information that a human uses to do translation, the details of this notion may also be stored as connection weights in the deep layers of a large neural network, so that the only way to access them is to provide inputs to the human of a form that the network was trained on. (In the case of translation, this would be sentences and associated context, while in the case of corrigibility this would be questions/tasks of a human understandable nature and context about the user's background and current situation.) This seems plausible because in order for a human's notion of corrigibility to make a difference, the human has to apply it while thinking about the meaning of a request or question and "translating" it into a series of smaller tasks.
In the language translation example, if the task of translating a sentence is broken down into smaller pieces, the system could no longer access the full knowledge the Overseer has about translation. By analogy, if the task of breaking down tasks in a corrigible way is itself broken down into smaller pieces (either for security or because the input task and associated context is so complex that a human couldn't comprehend it in the time allotted), then the system might no longer be able to access the full knowledge the Overseer has about "corrigibility".
In addition to "corrigibility" (trying to be helpful), breaking down a task also involves "understanding" (figuring out what the intended meaning of the request is) and "competence" (how to do what one is trying to do). By the same analogy, humans are likely to have introspectively inaccessible knowledge about both understanding and competence, which they can't fully apply if they are not able to consider a task as a whole.
Paul is aware of this problem, at least with regard to competence, and his proposed solution is:
I propose to go on breaking tasks down anyway. This means that we will lose certain abilities as we apply amplification. [...] Effectively, this proposal replaces our original human overseer with an impoverished overseer, who is only able to respond to the billion most common queries.
How bad is this, with regard to understanding and corrigibility? Is an impoverished overseer who only learned a part of what a human knows about understanding and corrigibility still understanding/corrigible enough? I think the answer is probably no.
With regard to understanding, natural language is famously ambiguous. The fact that a sentence is ambiguous (has multiple possible meanings depending on context) is itself often far from apparent to someone with a shallow understanding of the language. (See here for a recent example on LW.) So the overseer will end up being overly literal, and misinterpreting the meaning of natural language inputs without realizing it.
With regard to corrigibility, if I try to think about what I'm doing when I'm trying to be corrigible, it seems to boil down to something like this: build a model of the user based on all available information and my prior about humans, use that model to help improve my understanding of the meaning of the request, then find a course of action that best balances between satisfying the request as given, upholding (my understanding of) the user's morals and values, and most importantly keeping the user in control. Much of this seems to depend on information (prior about humans), procedure (how to build a model of the user), and judgment (how to balance between various considerations) that are far from introspectively accessible.
So if we try to learn understanding and corrigibility "safely" (i.e., in small chunks), we end up with an overly literal overseer that lacks common sense understanding of language and independent judgment of what the user's wants, needs, and shoulds are and how to balance between them. However, if we amplify the overseer enough, eventually the AI will have the option of learning understanding and corrigibility from external sources rather than relying on its poor "native" abilities. As Paul explains with regard to translation:
This is potentially OK, as long as we learn a good policy for leveraging the information in the environment (including human expertise). This can then be distilled into a state maintained by the agent, which can be as expressive as whatever state the agent might have learned. Leveraging external facts requires making a tradeoff between the benefits and risks, so we haven’t eliminated the problem, but we’ve potentially isolated it from the problem of training our agent.
So instead of directly trying to break down a task, the AI would first learn to understand natural language and what "being helpful" and "keeping the user in control" involve from external sources (possibly including texts, audio/video, and queries to humans), distill that into some compressed state, then use that knowledge to break down the task in a more corrigible way. But first, since the lower-level (less amplified) agents are contributing little besides the ability to execute literal-minded tasks that don't require independent judgment, it's unclear what advantages there are to doing this as an Amplified agent as opposed to using ML directly to learn these things. And second, trying to learn understanding and corrigibility from external humans has the same problem as trying to learn from the human Overseer: if you try to learn in large chunks, you risk corrupting the external human and then learning corrupted versions of understanding and corrigibility, but if you try to learn in small chunks, you won't get all the information that you need.
The conclusion here seems to be that corrigibility can't be learned safely, at least not in a way that's clear to me.
That clarifies things a bit, but I'm not sure how to draw a line between what counts as aligned de dicto and what doesn't, or how to quantify it. Suppose I design an AI that uses a hand-coded algorithm to infer what the user wants and to optimize for that, and it generally works well but fails to infer that I disvalue mindcrimes. (For people who might be following this but not know what "mindcrimes" are, see section 3 of this post.) This seems analogous to IDA failing to infer that the user disvalues mindcrimes, so you'd count it as aligned? But there's a great (multi-dimensional) range of possible errors, and it seems like there must be some types or severities of value-learning errors where you'd no longer consider the AI to be “trying to do what I want it to do”, but I don't know what those are.
Can you propose a more formal definition, maybe something along the lines of "If in the limit of infinite computing power, this AI would achieve X% of the maximum physically feasible value of the universe, then we can call it X% Aligned"?
Not sure how motivated you are to continue this line of discussion, so I'll mention that uncertainty/confusion about a concept/term as central as "alignment" seems really bad. For example if you say "I think my approach can achieve AI alignment" and you mean one thing but the reader thinks you mean another, that might lead to serious policy errors. Similarly if you hold a contest on "AI alignment" and a participant misinterprets what you mean and submits something that doesn't qualify as being on topic, that's likely to cause no small amount of frustration.
I don't have a more formal definition. Do you think that you or someone else has a useful formal definition we could use? I would be happy to adopt a more formal definition if it doesn't have serious problems.
Or: are there some kinds of statements that you think shouldn't be made without a more precise definitions? Is there an alternative way to describe a vague area of research that I'm interested in, that isn't subject to the same criticism? Do you think I typically use "alignment"... (read more)