Paul put out a call for probable problems with his agenda, which prompted this post detailing my confusions with it. This isn’t confidently asserting that the plan is doomed, and it’s not obvious to me that any of what follows is a novel objection. But it seems like it’s worth broadcasting the quiet voices of doubt when they appear, and worth trying to be an independent source of judgment. I’ll start by talking about the proposal of Paul’s that I think I understand best, the approval-directed agent, and then talk about IDA, and then finish by pointing at what I suspect is a major underlying crux.
Approval-Directed Agents
While many approaches to alignment view the agent as having some goal (which is only part of what we want it to do) and then restrictions (which prevent it from destroying the other things we want), approval-directed agents try to point directly at the goal of doing the thing that we want it to do as part of a broader ecosystem. If I say “Buy me a plane ticket for my trip to Austin,” the agent decides how much computing resources to allocate the task based on what it thinks I would approve of, and then it spends those resources based on what it thinks I would approve of, and so on, and eventually I trust the conclusion the system comes to and the tradeoffs it made along the way because my judgment has been approximated at every step.
The core insight here seems to be that the question “what would the architect approve of this subsystem doing?” remains a well-formed query for all subsystems, and we could in theory train all of them separately using standard machine learning techniques, or train them jointly with some sort of parameter sharing, or so on. Any unaligned behavior--the agent deciding to angrily tweet at airline companies in order to get a discount, despite me not wanting it to do that--can be traced back to a training failure of some subsystem, and a training failure of the meta-system that decides how to explore. It seems to me like the overall number of mistakes depends most on the training of that meta-system, and if it gets well-trained during the small-stakes high-involvement education period, then the system can learn new domains without many (or, potentially, any) catastrophic failures.
It seems like a core benefit of this sort of approach is that it replaces the motivation system (take action that argmaxes some score) with an approval-seeking system that can more easily learn to not do things the overseer doesn’t approve of--like search for ways to circumvent the overseer’s guidance--but it’s not clear how much this actually buys you. Approval-maximization is still taking the action that argmaxes some score, and the difference is that approval-direction doesn’t attempt to argmax (because argmaxing is disapproved of), but without pointing explicitly at the math that it will use instead, it’s not clear that this won’t reduce to something like argmax with the same sort of problems.
It seems like another core benefit of this sort of approach is that it makes it much easier to handle daemons (that is, unaligned consequentialist optimization processes that arise from a subsystem that is trying to achieve some subgoal). The agent spins up a web crawler to figure out what prices for plane tickets are available, but this crawler is also presumably directed by approval, because the overseer wouldn’t approve of creating an unaligned daemon. Since any approach is going to need some way to extend this oversight downstream, it’s worth investigating whether a system that just takes oversight at every level as the primary goal might work.
What about this seems wrong? Basically, consciences (that is, the human cost function for particular actions) seem like big things (in part because they depend on subtle or distant considerations), and this requires that the conscience can be subdivided into many functional pieces, even though the correct low-level decision will often depend on high-level considerations. If all decisions are routing through calls to the human-sized conscience, then it won’t be competitive with other systems; if all decisions are routing through approximations to the conscience, then it seems like there will predictably be errors because those approximations will miss out on subtle or distant considerations.
Crucially, it trusts the human operator (or the approximated human operator serving as the conscience) to be able to foresee the consequences of small low-level changes. The agent comes to me and says “hey, I want to make a change to my candidate-consideration process in order to serve you better, and I expect you’ll approve of the change but think it’s important enough to check. Should I make it? [Y/N]”, and this may push me into a regime where I shouldn’t trust my judgment but mistakenly do. (The solution of never approving such upgrades--which leads to the system no longer considering them--seems more robust but doesn’t allow it to be competitive with other approaches or do the sort of amplification that Paul talks about elsewhere.)
IDA
It seems to me like the core consideration with IDA is similar. To briefly restate IDA, one starts with a human overseer H, who trains an emulator A, which serves as an assistant. The first iteration of the emulator A[0] means the H + A[0] system is more powerful than just H, and thus can train a new emulator A[1], and this process can repeat to continue improving the emulator. This seems to boil down to “make a small known-good step, repeat this process, eventually end up far from where you started in a known-good place by induction.” But it seems highly unlikely to me that small steps can be known-good in the relevant way, and similarly unlikely that this process allows for recoverability from mistakes. For example, an assistant that learns to hide distressing information from the overseer early on (by generalizing from the overseer’s judgment of different reports it produces, for example) and then is a distilled subroutine of all future agents seems like it may permanently corrupt the IDA agent by installing a metacognitive blind spot.
Stated another way, the IDA algorithm doesn’t seem to be doing any metacognitive work of the sort that I would expect would be necessary to become more aligned over time. Instead, the goal seems to be to simply preserve the existing level of alignment, without guarantees that it will be particularly good at this task. Perhaps distillation indirectly achieves this goal, but instead it seems more like lossy compression. (There is a regularization argument in favor of lossy compression of networks, but I don’t know that extends up the ladder of abstraction high enough to apply to the things that I’m thinking about instead of just sample noise.)
One could perhaps argue that this metacognitive work is an admissable problem for IDA. The task the overseer is attempting to accomplish could simply be “check that the system A[t] is aligned,” and it does this by farming out subqueries to its subagents, inspecting how the system A[t] is put together and doing the reasoning steps necessary to eventually deliver a verdict. This seems unsatisfying to me because it doesn’t give any guidance on how the overseer might approach such a problem, and requires that we already know, and it seems like it still falls prey to the same sort of corrupted subagents problems.
Alignment by Induction
My pessimism about these approaches seems to be highly related to whether or not ‘alignment by induction’ is a reasonable property to expect. A crux here is something like “is there a broad basin of corrigibility?”, which I think of in terms of stable vs. unstable equilibria. I am pretty confident that Paul agrees that corrigibility is a narrow target in some dimensions, and a broad target in other dimensions, and so part of the question is a quantitative sense of “for training-relevant dimensions, is corrigibility more broad or narrow?”. It seems like Paul is arguing that an agent that is partly corrigible will want to make a successor that is more corrigible, and my suspicion is that an agent that is partly corrigible will want to make a successor that is less corrigible. That is, corrigibility seems unstable rather than stable to me. My intuition here seems to be based on something like rules-lawyering or nearest unblocked strategy, where a system that does not have the ‘true spirit’ of corrigibility but is under some limitations that allow human operators to modify it will not help them point the system towards the true spirit of corrigibility, because it doesn’t possess that as a cognitive object or optimization target to point towards. (Otherwise it would be corrigible already!)
As an elaboration, my view is that corrigibility is something like the combination of several concepts, like “the map is not the territory” and “causes flow downstream” and “do what I mean, not what I say” such that the agent views its confusion about values as an object that it can interact with in the same way that a human might interact with their confusion about morality or the desires of their overseers, as opposed to a probability distribution to marginalize over or so on. An agent that has some of these concepts but not others still seems vulnerable to problems associated with corrigibility.
Another contributing intuition: is it better or worse to pass more tests? It seems to me like it’s only mildly better, and likely worse, to pass more tests, until one hits the point of passing all tests. An agent that passes 90% of tests is probably not aligned, and is dangerous if your test coverage is insufficient. Especially so since the tests you know how to write and the tests you know how to pass are likely correlated! Likewise, a partly corrigible agent (which is suspected to maybe be fully corrigible) seems much more dangerous than an agent that is known to be incorrigible (and handled appropriately, ideally by not running it).
One question I think about a lot--and potentially one Paul has useful thoughts on--is the minimum size for an aligned agent. It seems like Paul is focusing on weak definitions of alignment (see here and the resulting discussion tree), where potentially very small agents could be aligned; the goal is something closer to "don't be willfully disobedient" than "don't cause problems." For stronger definitions of alignment--something like "an agent that you can trust with more power than you have"--it seems not obvious that even something human-sized is sufficient. This seems cruxy for me--if I thought we could get to something big enough that it had the metacognitive and metaphilosophical competence that it could be trusted with more power than humans have through lots of small induction steps, as opposed to through large jumps of insight, then I would be much more optimistic and less confused about induction-style approaches.
Paul put out a call for probable problems with his agenda, which prompted this post detailing my confusions with it. This isn’t confidently asserting that the plan is doomed, and it’s not obvious to me that any of what follows is a novel objection. But it seems like it’s worth broadcasting the quiet voices of doubt when they appear, and worth trying to be an independent source of judgment. I’ll start by talking about the proposal of Paul’s that I think I understand best, the approval-directed agent, and then talk about IDA, and then finish by pointing at what I suspect is a major underlying crux.
Approval-Directed Agents
While many approaches to alignment view the agent as having some goal (which is only part of what we want it to do) and then restrictions (which prevent it from destroying the other things we want), approval-directed agents try to point directly at the goal of doing the thing that we want it to do as part of a broader ecosystem. If I say “Buy me a plane ticket for my trip to Austin,” the agent decides how much computing resources to allocate the task based on what it thinks I would approve of, and then it spends those resources based on what it thinks I would approve of, and so on, and eventually I trust the conclusion the system comes to and the tradeoffs it made along the way because my judgment has been approximated at every step.
The core insight here seems to be that the question “what would the architect approve of this subsystem doing?” remains a well-formed query for all subsystems, and we could in theory train all of them separately using standard machine learning techniques, or train them jointly with some sort of parameter sharing, or so on. Any unaligned behavior--the agent deciding to angrily tweet at airline companies in order to get a discount, despite me not wanting it to do that--can be traced back to a training failure of some subsystem, and a training failure of the meta-system that decides how to explore. It seems to me like the overall number of mistakes depends most on the training of that meta-system, and if it gets well-trained during the small-stakes high-involvement education period, then the system can learn new domains without many (or, potentially, any) catastrophic failures.
It seems like a core benefit of this sort of approach is that it replaces the motivation system (take action that argmaxes some score) with an approval-seeking system that can more easily learn to not do things the overseer doesn’t approve of--like search for ways to circumvent the overseer’s guidance--but it’s not clear how much this actually buys you. Approval-maximization is still taking the action that argmaxes some score, and the difference is that approval-direction doesn’t attempt to argmax (because argmaxing is disapproved of), but without pointing explicitly at the math that it will use instead, it’s not clear that this won’t reduce to something like argmax with the same sort of problems.
It seems like another core benefit of this sort of approach is that it makes it much easier to handle daemons (that is, unaligned consequentialist optimization processes that arise from a subsystem that is trying to achieve some subgoal). The agent spins up a web crawler to figure out what prices for plane tickets are available, but this crawler is also presumably directed by approval, because the overseer wouldn’t approve of creating an unaligned daemon. Since any approach is going to need some way to extend this oversight downstream, it’s worth investigating whether a system that just takes oversight at every level as the primary goal might work.
What about this seems wrong? Basically, consciences (that is, the human cost function for particular actions) seem like big things (in part because they depend on subtle or distant considerations), and this requires that the conscience can be subdivided into many functional pieces, even though the correct low-level decision will often depend on high-level considerations. If all decisions are routing through calls to the human-sized conscience, then it won’t be competitive with other systems; if all decisions are routing through approximations to the conscience, then it seems like there will predictably be errors because those approximations will miss out on subtle or distant considerations.
Crucially, it trusts the human operator (or the approximated human operator serving as the conscience) to be able to foresee the consequences of small low-level changes. The agent comes to me and says “hey, I want to make a change to my candidate-consideration process in order to serve you better, and I expect you’ll approve of the change but think it’s important enough to check. Should I make it? [Y/N]”, and this may push me into a regime where I shouldn’t trust my judgment but mistakenly do. (The solution of never approving such upgrades--which leads to the system no longer considering them--seems more robust but doesn’t allow it to be competitive with other approaches or do the sort of amplification that Paul talks about elsewhere.)
IDA
It seems to me like the core consideration with IDA is similar. To briefly restate IDA, one starts with a human overseer H, who trains an emulator A, which serves as an assistant. The first iteration of the emulator A[0] means the H + A[0] system is more powerful than just H, and thus can train a new emulator A[1], and this process can repeat to continue improving the emulator. This seems to boil down to “make a small known-good step, repeat this process, eventually end up far from where you started in a known-good place by induction.” But it seems highly unlikely to me that small steps can be known-good in the relevant way, and similarly unlikely that this process allows for recoverability from mistakes. For example, an assistant that learns to hide distressing information from the overseer early on (by generalizing from the overseer’s judgment of different reports it produces, for example) and then is a distilled subroutine of all future agents seems like it may permanently corrupt the IDA agent by installing a metacognitive blind spot.
Stated another way, the IDA algorithm doesn’t seem to be doing any metacognitive work of the sort that I would expect would be necessary to become more aligned over time. Instead, the goal seems to be to simply preserve the existing level of alignment, without guarantees that it will be particularly good at this task. Perhaps distillation indirectly achieves this goal, but instead it seems more like lossy compression. (There is a regularization argument in favor of lossy compression of networks, but I don’t know that extends up the ladder of abstraction high enough to apply to the things that I’m thinking about instead of just sample noise.)
One could perhaps argue that this metacognitive work is an admissable problem for IDA. The task the overseer is attempting to accomplish could simply be “check that the system A[t] is aligned,” and it does this by farming out subqueries to its subagents, inspecting how the system A[t] is put together and doing the reasoning steps necessary to eventually deliver a verdict. This seems unsatisfying to me because it doesn’t give any guidance on how the overseer might approach such a problem, and requires that we already know, and it seems like it still falls prey to the same sort of corrupted subagents problems.
Alignment by Induction
My pessimism about these approaches seems to be highly related to whether or not ‘alignment by induction’ is a reasonable property to expect. A crux here is something like “is there a broad basin of corrigibility?”, which I think of in terms of stable vs. unstable equilibria. I am pretty confident that Paul agrees that corrigibility is a narrow target in some dimensions, and a broad target in other dimensions, and so part of the question is a quantitative sense of “for training-relevant dimensions, is corrigibility more broad or narrow?”. It seems like Paul is arguing that an agent that is partly corrigible will want to make a successor that is more corrigible, and my suspicion is that an agent that is partly corrigible will want to make a successor that is less corrigible. That is, corrigibility seems unstable rather than stable to me. My intuition here seems to be based on something like rules-lawyering or nearest unblocked strategy, where a system that does not have the ‘true spirit’ of corrigibility but is under some limitations that allow human operators to modify it will not help them point the system towards the true spirit of corrigibility, because it doesn’t possess that as a cognitive object or optimization target to point towards. (Otherwise it would be corrigible already!)
As an elaboration, my view is that corrigibility is something like the combination of several concepts, like “the map is not the territory” and “causes flow downstream” and “do what I mean, not what I say” such that the agent views its confusion about values as an object that it can interact with in the same way that a human might interact with their confusion about morality or the desires of their overseers, as opposed to a probability distribution to marginalize over or so on. An agent that has some of these concepts but not others still seems vulnerable to problems associated with corrigibility.
Another contributing intuition: is it better or worse to pass more tests? It seems to me like it’s only mildly better, and likely worse, to pass more tests, until one hits the point of passing all tests. An agent that passes 90% of tests is probably not aligned, and is dangerous if your test coverage is insufficient. Especially so since the tests you know how to write and the tests you know how to pass are likely correlated! Likewise, a partly corrigible agent (which is suspected to maybe be fully corrigible) seems much more dangerous than an agent that is known to be incorrigible (and handled appropriately, ideally by not running it).
One question I think about a lot--and potentially one Paul has useful thoughts on--is the minimum size for an aligned agent. It seems like Paul is focusing on weak definitions of alignment (see here and the resulting discussion tree), where potentially very small agents could be aligned; the goal is something closer to "don't be willfully disobedient" than "don't cause problems." For stronger definitions of alignment--something like "an agent that you can trust with more power than you have"--it seems not obvious that even something human-sized is sufficient. This seems cruxy for me--if I thought we could get to something big enough that it had the metacognitive and metaphilosophical competence that it could be trusted with more power than humans have through lots of small induction steps, as opposed to through large jumps of insight, then I would be much more optimistic and less confused about induction-style approaches.