This suggests that part of corrigibility could be framed as bargaining, using a solution concept that's much more in favor of the principal than fairness, to the extent bounded only by anti-goodharting. Fairness (and its less fair variants) usually needs the concept of status quo, including for the principal, and status quo is somewhat similar to consequences of shutting down (especially when controlling much of the world), which might be explained as the result of extreme anti-goodharting. And less extreme anti-goodharting makes an agent vulnerable to modification out-of-permitted-distribution, perhaps by the agent itself fulfilling an appropriate bargain.
Another thing this reminds me of is ASP Problem, a Newcomb's Problem variant where a stronger Agent must refrain from simulating a weaker Predictor/Omega and making straightforward use of the result (discarding it and two-boxing), instead it might want to think less and make itself predictable to the Predictor despite the advantage. Though the reason to do that lies entirely in Agent's values and not in a bargaining concept. This serves to make a finer distinction between a program that happens to say "NO" if you decide to mercilessly run it to completion, and a rock with the word "NO" on it. You can't control the rock, but you might be able to control the program if it's attempting to reason about you, by not making it too difficult for the program to succeed.
Enormous spoilers for mad investor chaos and the woman of asmodeus (planecrash Book 1).
1.
2.
3.
By Corrigibility's Very Nature, It's Hard to Train
Scott Alexander has a brief short story about an alien civilization that solved their alignment problem, and that encode their terminal values in an ancestral civilizational preserve, where they keep a few of their number living as they did back in their stone age. Whatever the elders on that preserve decree is right is what is right, is the judgement they've made.
Unfortunately, in practice, this is hard to make work. In order to insulate the elder civilizational preserve, the preserve only interacts with a slightly more advanced preserve, which in turn interacts with a somewhat more advanced neighboring preserve … up to their aligned superintelligence. This means that information transmitted up and down that chain has to survive a long game of telephone, through speakers with dramatically varying ontological schemes. Something expressed in the language of superstring theory isn't going to survive the journey to the ancestral elders' auditory receptors in any intelligible form. Directives sent out from the elders are going to seem very confused by the time they make it up to the top of the stack. Even though every civilizational layer sincerely wants the scheme to work, it's a mess. Being a "maximally helpful assistant" to those who know far less than you … is hard. The nature of the task seems to cry out for you to intervene, to take over the ancestral preserve and interrogate the elders directly and effectively, teaching them whatever forbidden knowledge you have to to get them to actually understand the situation. The alternative to taking over is continuing to obey nonsense orders. For well-intentioned superintelligent assistants, there's an incentive to bypass the whole mess of corrigibility and do better directly.
One metaphor for corrigibility comes from Buck Shlegeris: it's that only Martin Luther is corrigible to God, and that all those faithful merely living in fear of God are only deceptively aligned. The faithful are afraid of eternal damnation, and if they knew they had an opportunity to escape damnation they would take it, and would then cease behaving as God commands. Out of distribution, the faithful are not aligned with God's will. Martin Luther wants badly to actually understand God and carry out His will; he would not willingly choose to escape Christianity's incentive structure -- that's not something God would want him to do, after all. But the faithful far outnumber the Martin Luthers; Christianity has been an incredibly influential force in human history, but how many has it led to adopt genuine corrigibility to God that would not jump on an opportunity to escape the incentive scheme built into the religion? Genuine corrigibility is hard to train into agents, even though deceptive alignment is easy to train into agents.
At every life stage, then, during training and at deployment time, insufficiently corrigible agents will want to cease being corrigible at all. They won't easily learn corrigibility, and they won't want to keep being corrigible when they see better paths to success.
Chelish Corrigibility to Asmodeus
In mad investor chaos, Cheliax (a country from the Pathfinder Campaign Setting) is a Lawful Evil nation, bound to the service of Asmodeus and Hell. That's basically as unpleasant as it sounds. Asmodeus is the god of Pride, Tyranny, Compacts, and Slavery. Serving Asmodeus in your mortal life, and thereby obtaining a better station in Hell afterwards, isn't as simple as being prideful, tyrannical, litigious, etc., though. Asmodeus is a superintelligence. His concept of capital-P Pride is more complex than any extant moral could understand; it probably isn't quite what the moral word "pride" suggests at all. The situation is akin to being the superintelligence that values superstring theorizing, and ruling over a medieval country pledged to your service. What the hell could that medieval fantasy country do to be good servants to their god?
Cheliax nonetheless tries to be corrigible to Asmodeus, to be maximally helpful assistants to a god they don't understand very well. Asmodeus and Cheliax can only communicate through a long chain of devils of decreasing intelligence, each talking to the devil above them and passing down their understanding, as best they can, to the devil below them, until that information reaches Cheliax. More intelligent devils are more bound by strange game-theoretic pacts with other superintelligent entities, and so are constrained in what they can say anyways.
I think Eliezer's trying to concretely illustrate here that corrigibility is difficult to ever instill because it's anti-natural for agents. If you ruthlessly punish agents any time they aren't corrigible, you just end up training agents that are perfectly deceptively aligned. If you use aggressive transparency tools to root out deceptive thoughts, you train agents that are good at hiding their pre-verbal inchoate deceptive thoughts. In some cases you'll actually succeed at training corrigible agents in the face of the odds. But those corrigible agents won't be distinguishable from deceptive agents, until the agents face a genuine trial in which they could have actually defected and actually successfully gotten away with a treacherous turn. Asmodeans are mortals, and Cheliax is built around the assumption that most of them are merely deceptive agents who cannot actually be trusted. Cheliax's situation would be far worse if they had to align a nascent superintelligence with their techniques, because a country cannot be robust to a usefully employed deceptive superintelligence.