Thanks for writing this, Will! I think it's a good + clear explanation, and "high/low-bandwidth oversight" seems like a useful pair of labels.
I've recently found it useful to think about two kind-of-separate aspects of alignment (I think I first saw these clearly separated by Dario in an unpublished Google Doc):
1. "target": can we define what we mean by "good behavior" in a way that seems in-principle learnable, ignoring the difficulty of learning reliably / generalizing well / being secure? E.g. in RL, this would be the Bellman equation or recursive definition of the Q-function. The basic issue here is that it's super unclear what it means to "do what the human wants, but scale up capabilities far beyond the human's".
2. "hitting the target": given a target, can we learn it in a way that generalizes "well"? This problem is very close to the reliability / security problem a lot of ML folks are thinking about, though our emphasis and methods might be somewhat different. Ideally our learning method would be very reliable, but the critical thing is that we should be very unlikely to learn a policy that is powerfully optimizing for some other target (malign failure / daemon). E.g. inclusive genetic fitness is a fine target, but the learning method got humans instead -- oops.
I've largely been optimistic about IDA because it looks like a really good step forward for our understanding of problem 1 (in particular because it takes a very different angle from CIRL-like methods that try to learn some internal values-ish function by observing human actions). 2 wasn't really on my radar before (maybe because problem 1 was so open / daunting / obviously critical); now it seems like a huge deal to me, largely thanks to Paul, Wei Dai, some unpublished Dario stuff, and more recently some MIRI conversations.
Current state:
Thanks for writing this, Will! I think it’s a good + clear explanation, and “high/low-bandwidth oversight” seems like a useful pair of labels.
Seconded! (I said this to William privately while helping to review his draft, and just realized that I should also say it publicly so more people will read this and get a better understanding of Paul's approach.)
About “target”/“hitting the target”, maybe we should stick with Dario's labels "training target" and "robustness"? He recently complained about people inventing new names for existing concepts, thereby causing confusion, and this seems to be a case where that's pretty easy to avoid?
On the object level, Paul seems to think that IDA can also help with a large part of "robustness" / "problem 2", namely transparency / interpretability / benign induction. See here.
BTW, I wish Dario would publish his Google Doc so more people can see and critique it. For example it doesn't seem like any MIRI people have access to it. In general I'm pretty concerned about the effect where an idea may look better than it actually is because few people have investigated it deeply yet and pointed out the potential flaws. (I was saying this about Paul's approach earlier, if you recall.) I was just shared on Dario's Doc myself recently and am not sure whether it makes sense to try to discuss it within the Doc (given that few people will see the discussion) or here (where most people won't have seen the full force of Dario's arguments).
About “target”/“hitting the target”, maybe we should stick with Dario's labels "training target" and "robustness"? He recently complained about people inventing new names for existing concepts, thereby causing confusion, and this seems to be a case where that's pretty easy to avoid?
Except that "hitting the target" is much more evocative than "robustness" :/
Maybe "robustness (aka hitting the target)" would be okay to write.
I do agree. I think the main reason to stick with "robustness" or "reliability" is that that's how the problems of "my model doesn't generalize well / is subject to adversarial examples / didn't really hit the training target outside the training data" are referred to in ML, and it gives a bad impression when people rename problems. I'm definitely most in favor of giving a new name like "hitting the target" if we think the problem we care about is different in a substantial way (which could definitely happen going forward!)
"Target" and "hitting the target" seem to align very well with the terms "accuracy" and "precision", so much so that almost all images of accuracy and precision feature a target. Maybe we could use terms like "Value Accuracy" and "Value Precision", to be a bit more specific?
I think of "robustness" and "reliability" as aspects of "precision", but it's not the only ones. To me those two imply "improving on an already high chance of success", rather than the challenge of getting anywhere close to begin with, without specific external constraints. The reason why they apply to rockets or things with chances of failure is that they are typically discussed more specifically to all the individual parts of such items. Another case of non-robust precision may be an AI trying to understand some of the much finer details of what's considered "the human value function."
[Edit] I don't mean to proliferate the "new terms" problem, and don't intend for these specific terms to get used in the future, but use them for demonstration.
Terminology names seem pretty important; blog posts and comments are good places for coming up with them, but are there more formal ways of coming to consensus on choosing some?
One place to start may be having someone make a list of all the currently used terms for a set of questions, then having a semi-formal system (even if it's with around 5-10 people, as long as they are the main ones) to decide on the ones to use going forward.
I'd be happy to volunteer to do a writeup or so on of that sort.
One place to start may be having someone make a list of all the currently used terms for a set of questions
Even if only this step was accomplished, it seems worthwhile.
I agree re: terminology, but probably further discussion of unpublished docs should just wait until they're published.
I got the sense from Dario that he has no plans to publish the document in the foreseeable future.
OK -- if it looks like the delay will be super long, we can certainly ask him how he'd be OK w/ us circulating / attributing those ideas. In the meantime, there are pretty standard norms about unpublished work that's been shared for comments, and I think it makes sense to stick to them.
It occurs to me that a hybrid approach combining both HBO and LBO is possible. One simple thing we can do is build an HBO-IDA, and ask it to construct a "core for reasoning" for LBO, or just let the HBO-IDA agent directly act as the overseer for LBO-IDA. This potentially gets around a problem where LBO-IDA would work, except that unenhanced humans are not smart enough to build or act as the overseer in such a scheme. I'm guessing there are probably cleverer things we can do along these lines.
How does this address the security issues with HBO? Is the idea that only using the HBO system to construct a "core for reasoning" reduces the chances of failure by exposing it to less inputs/using it for less total time? I feel like I'm missing something...
I'd interpreted it as "using the HBO system to construct a "core for reasoning" reduces the chances of failure by exposing it to less inputs/using it for less total time", plus maybe other properties (eg. maybe we could look at and verify an LBO overseer, even if we couldn't construct it ourselves)
plus maybe other properties
That makes sense; I hadn't thought of the possibility that a security failure in the HBO tree might be acceptable in this context. OTOH, if there's an input that corrupts the HBO tree, isn't it possible that the corrupted tree could output a supposed "LBO overseer" that embeds the malicious input and corrupts us when we try to verify it? If the HBO tree is insecure, it seems like a manual process that verifies its output must be insecure as well.
One situation is: maybe an HBO tree of size 10^20 runs into a security failure with high probability, but an HBO tree of size 10^15 doesn't and is sufficient to output a good LBO overseer.
[Paul:] I believe the honest debater can quite easily win this game, and that this pretty strongly suggests that amplification will be able to classify the image.
I think this is only true for categories that the overseer (or judge in the debate game) can explicitly understand the differences between. For example instead of cats and dogs, suppose the two categories are photos of the faces of two people who look alike. These photos can be reliably recognized by humans (through subtle differences in their features or geometry) without the humans being able to understand or explain (from memory) what the differences are. (This is analogous to the translation example so is not really a new point.)
BTW, the AI safety via debate paper came out from OpenAI a few days ago (see also the blog post). It sheds some new light on Amplification and is also a very interesting idea in its own right.
BTW, the AI safety via debate paper came out from OpenAI a few days ago (see also the blog post). It sheds some new light on Amplification and is also a very interesting idea in its own right.
I've made an LW link post for it here.
First, thanks for this writeup. I've read through many of Paul's posts on the subject, but they now make a lot more sense to me.
One question: you write, `Corrigibility requires understanding of AI safety concepts. For example, breaking down the task “What action does the user want me to take?” into the two subtasks “What are the user’s values?” and “What action is best according to these values”? is not corrigible. It produces an action optimized for some approximate model of the user’s values, which could be misaligned.`
This is something I've been worried about recently. But I'm not sure what the alternative would be. The default model for an agent not explicitly asking that question seems to me like one where they try their best to optimize for their values without being so explicit. This is assuming that their goal is to optimize their caller's values. It seems like they're either trying to maximize either someone's values or a judgement meant to judge them against someone's values.
Is there an alternative thing they could maximize for that would be considered corrigible?
[edited]: Originally I thought the overseer was different from the human.
Is there an alternative thing they could maximize for that would be considered corrigible?
Paul answered this question in this thread.
"But I'm not sure what the alternative would be."
I'm not sure if it's what your thinking of, but I'm thinking of “What action is best according to these values” == "maximize reward". One alternative that's worth investigating more (IMO) is imposing hard constraints.
For instance, you could have an RL agent taking actions in $(a_1, a_2) \in \mathbb{R}^2$, and impose the constraint that $a_1 + a_2 < 3$ by projection.
A recent near-term safety paper takes this approach: https://arxiv.org/abs/1801.08757
Is meta-execution HBO, LBO, or a general method that could be implemented with either? (my current credences: 60% LBO, 30% general method, 10% HBO)
I think it's a general method that is most applicable in LBO, but might still be used in HBO (eg. an HBO overseer can read one chapter of a math textbook, but this doesn't let it construct an ontology that let's it solve complicated math problems, so instead it needs to use meta-execution to try to manipulate objects that it can't reason about directly.
I think the distinction in this paragraph is quite messy on inspection:
I think that IDA with a low bandwidth overseer is not accurately described as “AI learns to reason from humans”, rather more “Humans figure out how to reason explicitly, then the AI learns from the explicit reasoning”. As Wei Dai has pointed out, amplified low bandwidth oversight will not actually end up reasoning like a human. Humans have implicit knowledge that helps them perform tasks when they see the whole task. But not all of this knowledge can be understood and break into smaller pieces. Low bandwidth oversight requires that the overseer not use any of this knowledge.
First, I think that even if one were to split tasks into 15-minute increments, there would have to be a lot of explicit reasoning (especially to scale to arbitrarily-high quality levels with extra resources.)
Second, the issue really shouldn't be quality here. If it comes out that "implicit knowledge" creates demonstrably superior results (in any way) compared with "explicit work", and this cannot be fixed or minimized with scale, that seems like a giant hole in the general feasibility of HCH.
Third, humans already use a lot of explicit tasks. Is a human that breaks down a math problem acting "less like a human", or sacrificing anything, outside of one who solves it intuitively? There are some kinds of "lossy" explicit reasoning, but other kinds seem superior to me to the ways that humans do things.
I generally expect that "explicit reasoning that humans decide is better than their own intuitive work", generally is better, according to human values.
I think you're right that it's not a binary HBO=human like, LBO=not human like, but rather that as the overseer's bandwidth goes down from an unrestricted human to HBO to LBO, our naive intuitions about what a large group of humans can do become less and less applicable, so we need to use explicit reasoning and/or develop new intuitions about what HBO-based and especially LBO-based IDA is ultimately capable of.
I generally expect that “explicit reasoning that humans decide is better than their own intuitive work”, generally is better, according to human values.
Sure, I agree with this, but in Paul's scheme you may be forced to use explicit reasoning even if you don't think it's better than intuitive work, so I'm not sure what the relevance of this statement is.
EDIT: Also (in case you're objecting to this part) I think in the case of LBO, the explicit reasoning is at such a small granularity (see Paul's example here) that calling it "not human like" is justified and can perhaps help break people out of a possible complacency where they're thinking of IDA as roughly like a very large group of humans doing typical human reasoning/deliberation.
[Note: I probably like explicit reasoning a lot more than most people, so keep that in mind.]
"our naive intuitions about what a large group of humans can do become less and less applicable, so we need to use explicit reasoning and/or develop new intuitions about what HBO-based and especially LBO-based IDA is ultimately capable of."
One great things about explicit reasoning is that it does seem easier to reason/develop intuitions about than human decisions made in more intuitive ways (to me). It's different, but I assume far more predictable. I think it could be worth outlining what the most obvious forms of explicit reasoning that will happen are at some point (I've been thinking about this.)
To me it seems like there are two scenarios:
1. Group explicit reasoning cannot perform as well as longer individual intuitive reasoning, even with incredibly large budgets.
2. Group explicit reasoning outperforms other reasoning.
I believe that if #1 is one of the main cruxes towards the competitiveness and general viability of IDA. If it's not true, then the whole scheme has some serious problems.
But in the case of #2, it seems like things are overall in a much better place. The system may cost a lot (until it can be distilled enough), but it could be easier to reason about, easier to predict, and able to produce superior results on essentially all axis.
[Note: I probably like explicit reasoning a lot more than most people, so keep that in mind.]
Great! We need more people like you to help drive this forward. For example I think we desperately need explicit, worked out examples of meta-execution (see my request to Paul here) but Paul seems too busy to do that (the reply he gave wasn't really at the level of detail and completeness that I was hoping for) and I find it hard to motivate myself to do it because I expect that I'll get stuck at some point and it will be hard to tell if it's because I didn't try hard enough, or don't have the requisite skills, or made a wrong decision several branches up, or if it's just impossible.
One great things about explicit reasoning is that it does seem easier to reason/develop intuitions about than human decisions made in more intuitive ways (to me).
That's an interesting perspective. I hope you're right. :)
After some more reflection on this, I'm now more in the middle. I've come to think more and more that these tasks will be absolutely tiny; and if so, they will need insane amounts of structure. Like, the system may need to effectively create incredibly advanced theories of expected value, and many of these many be near incomprehensible to us humans, perhaps even with a very large amount of explanation.
I intend to spend some time writing more of how I would expect a system like this to work in practice.
When you say "better", do you mean in an empirical way that will be apparent with testing? If so, would it result in answers that are rated worse by users, or in answers that are less corrigible/safe?
My personal guess is that explicit reasoning with very large budgets can outperform reasonably-bound (or maybe even largely-bound) systems using intuitive approaches, on enough types of problems for this approach to be very considerable.
[Background: Intended for an audience that has some familiarity with Paul Christiano’s approach to AI Alignment. Understanding Iterated Distillation and Amplification should provide sufficient background.]
[Disclaimer: When I talk about “what Paul claims”, I am only summarizing what I think he means through reading his blog and participating on discussions on his posts. I could be mistaken/misleading in these claims]
I’ve recently updated my mental model of how Paul Christiano’s approach to AI alignment works, based on recent blog posts and discussions around them (in which I found Wei Dai’s comments particularly useful). I think that the update that I made might be easy to miss if you haven’t read the right posts/comments, so I think it’s useful to lay it out here. I cover two parts: understanding the limits on what Paul’s approach claims to accomplish, and understanding the role of the overseer in Paul’s approach. These considerations are important to understand if you’re trying to evaluate how likely this approach is to work, or trying to make technical progress on it.
What does Paul’s approach claim to accomplish?
First, it’s important to understand that what “Paul’s approach to AI alignment” claims to accomplish if it were carried out. The term “approach to AI alignment” can sound like it means “recipe for building a superintelligence that safely solves all of your problems”, but this is not how Paul intends to use this term. Paul goes into this in more detail in Clarifying “AI alignment”.
A rough summary is that his approach will only build an agent that is as capable as some known unaligned machine learning algorithm.
He does not claim that the end result of his approach is an agent that:
It’s important to understand the limits of what Paul’s approach claims in order to understand what it would accomplish, and the strategic situation that would result.
What is the Overseer?
Iterated Distillation and Amplification (IDA) describes a procedure that tries to take an overseer and produce an agent that does what the overseer would want it to do, with a reasonable amount of training overhead. “what the overseer would want it to do” is defined by repeating the amplification procedure. The post refers to amplification as the overseer using a number of machine learned assistants to solve problems. We can bound what IDA could accomplish by thinking about what the overseer could do if it could delegate to a number of copies of itself to solve problems (for a human overseer, this corresponds to HCH). To understand what this approach can accomplish, it’s important to understand what the overseer is doing. I think there are two different models of the overseer that could be inferred from different parts of the discussion around Paul’s work, which I label high bandwidth oversight and low bandwidth oversight.
High Bandwidth Overseer
The impression that I get from many of Paul’s posts is that the overseer is:
A high bandwidth overseer is a human that takes in an input and has some amount of time (between 15 minutes and a day) to process it. The human can look at as much of the input as it can within the allowed time, and when it delegates a sub-task it can look at the results. The main requirement for a human acting as an overseer is that they are trying to be helpful. The overseer can take in arbitrary natural-language requests and attempt to fulfill them.
The main question that comes to mind considering this model of the overseer is the problem of competent task decomposition:
Can we break down tasks into reasonably sized pieces (ie. can you solve a math problem from a textbook in 15 minutes by delegating to a number of people who don’t understand the material and only have 15 minutes each to interact with the textbook)?
This high bandwidth oversight model can be easier to use when trying to get a rough idea of how oversight works in order to understand how IDA works as a whole.
Low Bandwidth Overseer
However, looking at Universality and Security Amplification I found a different picture of what the overseer looks like:
A low bandwidth overseer is only allowed to process inputs from a finite input set. This includes both the original input query, and the results of subqueries. The overseer can pass information about the input or from a subquery to other copies of itself without looking at them.
To give a sense for the size of the input set, in Universality and Security Amplification, Paul speculates that input sets of size 10⁸ would probably be safe and inputs sets of size 10³⁰ would probably be unsafe.
To give a sense of what this implies, the range of input size 10⁸ to 10³⁰ roughly corresponds to:
A source of potential confusion is that Paul still talks about a human fulfilling the role of the low bandwidth overseer (there are reasons for this that I won’t cover here). But when the space of information the human overseer can consider is reduced to a finite set, we could simply evaluate what the human does on every element in that set and produce a lookup table that replaces the human. In other words, if you don’t think that some task could be accomplished by an amplified lookup table as overseer, then you shouldn’t think it could be accomplished with a low bandwidth overseer.
The reason for this system being introduced is wanting to avoid security issues as the system scales. The fear is that there would be an “attack” on the system: an input that could be shown to an overseer that would cause the overseer to become corrupted and try to sabotage the system. This could be some kind of misleading philosophical argument, some form of blackmail, a human adversarial example, etc. If an input like this exists, then as soon as the first agent is corrupted, it can try to spread the attack to other agents. The first agent could be corrupted either by chance, or through an attack being included in the input. Paul’s proposal in Universality and Security Amplification is to avoid this by limiting the set of inputs that the overseer can process, and then use various techniques to verify that the overseer is not attackable on that set of inputs. I read this as implying that he thinks that a high bandwidth overseer will be sufficiently vulnerable to attack that after some number of rounds of amplification, the overseer would be likely to become misaligned.
How does a low bandwidth overseer impact the difficult of IDA?
Previously, I was mostly starting with the model of a high bandwidth overseer when thinking about IDA. I now think that Paul’s approach is intended to be using low bandwidth oversight, and that reasoning about the difficulty of the approach requires using this model.
Using a low bandwidth overseer more clearly requires solving a set of hard problems prior to using IDA at all:
When considering the high bandwidth oversight regime, these problems seem more like they could be (at least partially) delegated to high bandwidth overseer at runtime. For example, one could think of preparing a summary of the relevant factors of a decision and giving it to a highly competent human overseer. The human overseer might be able to make good enough decisions without having an explicit model of decision theory (especially if odd edge cases don’t come up in practice).
Evaluating IDA
There are a number of lines of reasoning that one could use to try to get a sense of whether IDA could work, or how it should be prioritized relative to other approaches. Here, I summarize some of these considerations and indicate which versions of oversight they apply to, as I think this is important to be clear about. I don’t make claims about how these should be weighed, or claim to be exhaustive (there are additional considerations that would weigh on both forms of oversight, or other parts of IDA).
Evaluating High Bandwidth Oversight
Positive Considerations
Negative Consideration: Not Scalable and Secure
In high bandwidth oversight, if one copy of the overseer is corrupted, this corruption could easily spread. As more copies of the overseer are used to solve a problem, there is a higher chance of one copy becoming corrupted. Or an adversary could introduce a corrupting input into the system.
Evaluating Low Bandwidth Oversight
Positive Considerations
Negative Consideration: No longer “learning to reason from humans”
I think that IDA with a low bandwidth overseer is not accurately described as “AI learns to reason from humans”, rather more “Humans figure out how to reason explicitly, then the AI learns from the explicit reasoning”. As Wei Dai has pointed out, amplified low bandwidth oversight will not actually end up reasoning like a human. Humans have implicit knowledge that helps them perform tasks when they see the whole task. But not all of this knowledge can be understood and break into smaller pieces. Low bandwidth oversight requires that the overseer not use any of this knowledge.
Now, it’s quite possible that performance still could be recovered by doing things like searching over a solution space, or by reasoning about when it is safe to use training data from insecure humans. But these solutions could look quite different from human reasoning. In discussion on Universality Amplification, Paul describes why he thinks that a low bandwidth overseer could still perform image classification, but the process looks very different from a human using their visual system to interpret the image:
“I’ve now played three rounds of the following game (inspired by Geoffrey Irving who has been thinking about debate): two debaters try to convince a judge about the contents of an image, e.g. by saying “It’s a cat because it has pointy ears.” To justify these claims, they make still simpler claims, like “The left ears is approximately separated from the background by two lines that meet at a 60 degree angle.” And so on. Ultimately if the debaters disagree about the contents of a single pixel then the judge is allowed to look at that pixel. This seems to give you a tree to reduce high-level claims about the image to low-level claims (which can be followed in reverse by amplification to classify the image). I believe the honest debater can quite easily win this game, and that this pretty strongly suggests that amplification will be able to classify the image.”
Conclusion: Weighing Evidence for IDA
The important takeaway is that considering IDA requires clarifying whether you are considering IDA with high or low bandwidth oversight. Then, only count considerations that actually apply to that approach. I think there’s a way to misunderstand the approach where you mostly think about high bandwidth oversight and count the feeling like it’s somewhat understandable, feels plausible to you that it could work and that it avoids some hard problems. But if you then also count Paul’s opinion that it could work, you may be overconfident - the approach that Paul claims is most likely to work is the low bandwidth oversight approach.
Additionally, I think it’s useful to consider both models as alternative tools for understanding oversight: for example, the problems in low bandwidth oversight might be less obvious but still important to consider in the high bandwidth oversight regime.
After understanding this, I am more nervous about whether Paul’s approach would work if implemented, due to the additional complications of working with low bandwidth oversight. I am somewhat optimistic that further work (such as fleshing out how particular problems could be address through low bandwidth oversight) will shed light on this issue, and either make it seem more likely to succeed or yield more understanding of why it won’t succeed. I’m also still optimistic about Paul’s approach yielding ideas or insights that could be useful for designing aligned AIs in different ways.
Caveat: high bandwidth oversight could still be useful to work on
High bandwidth oversight could still be useful to work on for the following reasons: