From my current understanding of Paul's IDA approach, I think there are two different senses in which corrigibility can be thought about in regards to IDA, both with different levels of guarantee.
1. On average, the reward function incentivizes behaviour which competes effectively and gives the user effective control.
2. There do not exist inputs on which the policy choose an action because it is bad, or the value function outputs a high reward because the prior behaviour was bad. (Or else the policy on its own will generate bad consequences.)
3. The reward function never gives a behaviour a higher reward because it is bad. (Or else the test-time optimisation by MCTS can generate bad behaviour.) For example, if the AI deludes the human operator so that the operator can’t interfere with the AI’s behaviour, that behaviour can’t receive a higher reward even if it ultimately allows the AI to make more money.
Property 1 is dealing with "consequence corrigibility" (competence at producing actions that will produce outcomes in the world we would describe as corrigible)
Properties 2&3 are dealing with corrigibility in terms of "intent corrigibi...
It appears that I seriously misunderstood what you mean by corrigibility when I wrote this post. But in my defense, in your corrigibility post you wrote, "We say an agent is corrigible (article on Arbital) if it has these properties." and the list includes helping you "Make better decisions and clarify my preferences" and "Acquire resources and remain in effective control of them" and to me these seem to require at least near human level ability to model the user and detect ambiguities. And others seem to have gotten the same impression from you. Did your conception of corrigibility change at some point, or did I just misunderstand what you wrote there?
Since this post probably gave even more people the wrong impression, I should perhaps write a correction, but I'm not sure how. How should I fill in this blank? "The way I interpreted Paul's notion of corrigibility in this post is wrong. It actually means ___."
Increases in competence as it is amplified, including competence at tasks like “model the user,” “detect ambiguities” or “make reasonable tradeoffs about VOI vs. safety”
Is there a way to resolve our disagreement/uncertainty about this, short of building such an AI and seeing what happens? (I'm imagining that it would take quite a lot of amplification before we can see clear results in these areas, so it's not something that can be done via a project like Ought?)
I think your post is (a) a reasonable response to corrigibility as outlined in my public writing, (b) a reasonable but not decisive objection to my current best guess about how amplification could work. In particular, I don't think anything you've written is too badly misleading.
In the corrigibility post, when I said "AI systems which help me do X" I meant something like "AI systems which help me do X to the best of their abilities," rather than having in mind some particular threshold for helpfulness at which an AI is declared corrigible (similarly, I'd say an AI is aligned if it's helping me achieve my goals to the best of its abilities, rather than fixing a certain level of helpfulness at which I'd call it aligned). I think that post was unclear, and my thinking has become a lot sharper since then, but the whole situation is still pretty muddy.
Even that's not exactly right, and I don't have a simple definition. I do have a lot of intuitions about why there might be a precise definition, but those are even harder to pin down.
(I'm generally conflicted about how much to try to communicate publicly about early stages of my...
I thought more about my own uncertainty about corrigibility, and I've fleshed out some intuitions on it. I'm intentionally keeping this a high-level sketch, because this whole framing might not make sense, and even if it does, I only want to expound on the portions that seem most objectionable.
Suppose we have an agent A optimizing for some values V. I'll call an AI system S high-impact calibrated with respect to A if, when A would consider an action "high-impact" with respect to V, S will correctly classify it as high-impact with probability at least 1-ɛ, for some small ɛ.
My intuitions about corrigibility are as follows:
1. If you're not calibrated about high-impact, catastrophic errors can occur. (These are basically black swans, and black swans can be extremely bad.)
2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it's critical that it knows to check that action with you).
3. To learn how to be high-impact calibrated w.r.t. A, you will have to generalize properly from training examples of low/high-impact (i.e. be robust to distributional shift).
4. To robustly generalize, you...
I curated this post for these reasons:
I regret I don't have the time right now to try to summarise the specific ideas I found valuable (and what exact updates I made). Paul's subsequent post on the definition of alignment was helpful, and I...
See also my previous comment for my understanding of what corrigibility means here and the motivation for wanting to do AI alignment through corrigibility learning instead of value learning.
...
The conclusion here seems to be that corrigibility can't be learned safely, at least not in a way that's clear to me.
1) Are you more comfortable with value learning, or do both seem unsafe at present?
2) If we had a way to deal with this particular objection (where, as I understand it, subagents are either too dumb to be sophisticatedly corrigible, or are sm...
...With regard to corrigibility, if I try to think about what I'm doing when I'm trying to be corrigible, it seems to boil down to something like this: build a model of the user based on all available information and my prior about humans, use that model to help improve my understanding of the meaning of the request, then find a course of action that best balances between satisfying the request as given, upholding (my understanding of) the user's morals and values, and most importantly keeping the user in control. Much of this seems to depend on information
This wasn't posted explicitly as a submission in the prize for probable problems; would it be OK for me to consider it as a submission, given the timing?
if you try to learn in large chunks, you risk corrupting the external human and then learning corrupted versions of understanding and corrigibility
Why do you think small vs large chunks is the key issue when it comes to corrupting the external human? Can you articulate the chunk size at which you believe things start to become problematic?
It may have to be learned through repeated failure. "Here's how not to do it." gets repeated a few times before you can avoid the worst professional practices.
EDIT: Please note that the way I use the word "corrigibility" in this post isn't quite how Paul uses it. See this thread for clarification.
This is mostly a reply to Paul Christiano's Universality and security amplification and assumes familiarity with that post as well as Paul's AI alignment approach in general. See also my previous comment for my understanding of what corrigibility means here and the motivation for wanting to do AI alignment through corrigibility learning instead of value learning.
Consider the translation example again as an analogy about corrigibility. Paul's alignment approach depends on humans having a notion of "corrigibility" (roughly "being helpful to the user and keeping the user in control") which is preserved by the amplification scheme. Like the information that a human uses to do translation, the details of this notion may also be stored as connection weights in the deep layers of a large neural network, so that the only way to access them is to provide inputs to the human of a form that the network was trained on. (In the case of translation, this would be sentences and associated context, while in the case of corrigibility this would be questions/tasks of a human understandable nature and context about the user's background and current situation.) This seems plausible because in order for a human's notion of corrigibility to make a difference, the human has to apply it while thinking about the meaning of a request or question and "translating" it into a series of smaller tasks.
In the language translation example, if the task of translating a sentence is broken down into smaller pieces, the system could no longer access the full knowledge the Overseer has about translation. By analogy, if the task of breaking down tasks in a corrigible way is itself broken down into smaller pieces (either for security or because the input task and associated context is so complex that a human couldn't comprehend it in the time allotted), then the system might no longer be able to access the full knowledge the Overseer has about "corrigibility".
In addition to "corrigibility" (trying to be helpful), breaking down a task also involves "understanding" (figuring out what the intended meaning of the request is) and "competence" (how to do what one is trying to do). By the same analogy, humans are likely to have introspectively inaccessible knowledge about both understanding and competence, which they can't fully apply if they are not able to consider a task as a whole.
Paul is aware of this problem, at least with regard to competence, and his proposed solution is:
How bad is this, with regard to understanding and corrigibility? Is an impoverished overseer who only learned a part of what a human knows about understanding and corrigibility still understanding/corrigible enough? I think the answer is probably no.
With regard to understanding, natural language is famously ambiguous. The fact that a sentence is ambiguous (has multiple possible meanings depending on context) is itself often far from apparent to someone with a shallow understanding of the language. (See here for a recent example on LW.) So the overseer will end up being overly literal, and misinterpreting the meaning of natural language inputs without realizing it.
With regard to corrigibility, if I try to think about what I'm doing when I'm trying to be corrigible, it seems to boil down to something like this: build a model of the user based on all available information and my prior about humans, use that model to help improve my understanding of the meaning of the request, then find a course of action that best balances between satisfying the request as given, upholding (my understanding of) the user's morals and values, and most importantly keeping the user in control. Much of this seems to depend on information (prior about humans), procedure (how to build a model of the user), and judgment (how to balance between various considerations) that are far from introspectively accessible.
So if we try to learn understanding and corrigibility "safely" (i.e., in small chunks), we end up with an overly literal overseer that lacks common sense understanding of language and independent judgment of what the user's wants, needs, and shoulds are and how to balance between them. However, if we amplify the overseer enough, eventually the AI will have the option of learning understanding and corrigibility from external sources rather than relying on its poor "native" abilities. As Paul explains with regard to translation:
So instead of directly trying to break down a task, the AI would first learn to understand natural language and what "being helpful" and "keeping the user in control" involve from external sources (possibly including texts, audio/video, and queries to humans), distill that into some compressed state, then use that knowledge to break down the task in a more corrigible way. But first, since the lower-level (less amplified) agents are contributing little besides the ability to execute literal-minded tasks that don't require independent judgment, it's unclear what advantages there are to doing this as an Amplified agent as opposed to using ML directly to learn these things. And second, trying to learn understanding and corrigibility from external humans has the same problem as trying to learn from the human Overseer: if you try to learn in large chunks, you risk corrupting the external human and then learning corrupted versions of understanding and corrigibility, but if you try to learn in small chunks, you won't get all the information that you need.
The conclusion here seems to be that corrigibility can't be learned safely, at least not in a way that's clear to me.