Infernal Corrigibility, Fiendishly Difficult

David Udell

Enormous spoilers for mad investor chaos and the woman of asmodeus (planecrash Book 1).

1.

Aspexia Rugatonn, Grand High Priestess of Asmodeus, measures the woman kneeling before her with a careful eye and a half-dozen magics. If Carissa Sevar is an exceptional woman in ways beyond a native talent for wizardry, this is not yet evident. But then, if Sevar was that self-evidently extraordinary, she'd have been fast-tracked more than she was.
There are not many times when Asmodeus intervenes directly in Cheliax; Aspexia prefers not to be ignorant about any of them. She is knowledgeable of history and secrets, though, and so less confused by this intervention than others might be. While other possible readings exist, the degree to which Church and Queen have been ordered not to take the initiative in originating actions impinging on Carissa Sevar are suggestive of circumstances having triggered some divine compact to which Asmodeus is signatory. The divine view of reality and negotiation gives more prominence than mortals do to notions of 'leaving things alone to become as they would otherwise have been'; perhaps because gods have been able to formulate a sensible notion of what that means between themselves, where mortals could not.
An obvious further guess is that this compact's signatories include Irori among their number, and that Asmodeus is contesting with Irori for Carissa Sevar's soul in some ancient challenge governed by rules. Though if Carissa Sevar is wavering between Lawful Neutrality and Lawful Evil, Asmodeus is being unsubtle in His blandishments - the temptations more seem like inducements that would be offered to a soul already standing on Asmodean ground, not a soul wavering between a choice of paths. Overt blandishments for a soul to set proudly aside, while being more covertly tempted by a sense of being treated as important and valuable? Perhaps. Carissa Sevar's eidetically reported reaction seems not particularly expected of a nascent follower of Irori, but that could be a masquerade. Sevar has not been mindread more than she would be otherwise; they are not to be proactive about her correction.
Someone else in Aspexia's position might wonder whether Asmodeus would be pleased, if she disobeyed Asmodeus's orders in order to preemptively insinuate temptations to Sevar, show her how important she could be, before Sevar had sought out theological instruction of her own accord. Such actions on a mortal's initiative would not, could not, cause Asmodeus to be in direct violation of divine compact.
Aspexia does not even consider it. One of the foremost ways in which a Grand High Priestess of Asmodeus is shaped, is to predictably not behave in ways that make it more expensive for Asmodeus to keep His compacts. Improvising circles around your orders can rather tend do that. If Aspexia was the kind of priestess to circumvent her orders, Asmodeus would have needed to take that nature into account in choosing her orders.
More importantly, when you are Asmodeus's priestess, the first and foremost thing you do is what Asmodeus has told you to do.
In the situation as Aspexia Rugatonn mostly suspects it to be, a contest triggered between Asmodeus and Irori, there are many words that could be spoken to Carissa Sevar to benefit Asmodeus. There is a beastly, fleshly impulse that wants to find some excuse to maneuver Carissa into asking for instruction, to arrange the situation so that Carissa Sevar chooses to seek her descent into darkness - to win, herself, the challenge against Irori, to Asmodeus's glory.
There is not the slightest chance that Aspexia Rugatonn will skirt the rules to try any of that. She's been told not to be proactive, and that is a plain instruction: hands off, don't speak to Sevar unless spoken to, Sevar is to cast aside her own will and not have it stripped from her. One of the many glorious benefits of being an Asmodean is that you can just follow orders.
There are also other possibilities for why her Lord would have instructed them so. Sevar's soul may have had hidden value great enough that trying to exchange it for permanent arcane sight would have been too unbalanced a trade, and failed; and Asmodeus may not have wished this fact revealed to Sevar herself. Or Asmodeus may have some incomprehensible preference about this particular soul, it may have some ancient shape sentimental to Him, for which reason Asmodeus desires Carissa Sevar to come to Him in Hell and put aside her will of her own accord. There may be some benign process underway which would be interfered with by Sevar gaining arcane sight, and interfered with by other actions natural to Chelish agencies, which Asmodeus desires to be left alone to proceed to its foreseeable outcome.
Or there may be many things going on at once, many pots that Asmodeus has in the fire, that His orders impact simultaneously.
By simply obeying her orders and not improvising, Aspexia can avoid interfering with her Lord's plans in any of those cases.
Some of the apparent confusion of these orders may be due to how Hell rendered down Asmodeus's will into words. Asmodeus's thoughts are too great for mortals to know, and reflect truths unspeakable in this world under divine compacts. Having those thoughts pass through a succession of devils, each younger and stupider and less bound by the compacts than the last, does not in any way surpass this fundamental barrier between start and finish; and if this were not so, all of Asmodeus's instructions would be passed by way of Hell. Then any process by which Hell tries to translate Asmodeus's thoughts into mortal language must inevitably change, and indeed, mutilate, those thoughts. There are both advantages and disadvantages of that process, compared to a direct divine revelation: On the one hand, there are wiser devils in Hell to oversee the initial stages of translation; but on the other hand, by the time the final words are heard, they are stripped of other overtones that mortals could hear directly in a god's voice.
An apparently important subtlety of Hell's phrasing, seemingly key to a puzzle, may stem only from some devil phrasing something poorly and not foreseeing what a mortal would make of it. This is yet another reason to just follow Hell's commands without trying to brilliantly improvise around the fine edges of their exact details, when Hell has interpreted Asmodeus's will into mortal language; the commands' edges may not have been placed that finely.
Aspexia Rugatonn has gotten this far in life by combining the executive capacity to manage fractious subordinates, plus great initiative and independence and ambition of her own, plus the cruel and tyrannical disposition to be a priestess of Asmodeus, with a genuinely intuitive understanding of why it can sometimes be a good idea to just follow your orders. Her ascendance to the peak of Asmodeus's church can be seen as inevitable, since there's only a billion or so people in Golarion and it is unlikely enough that even a single person like Aspexia Rugatonn came to exist there, let alone two. She worries about what will happen to her carefully crafted church after she dies.
Oh, and there's also the fact that this entire affair has now been the subject of: two direct interventions of Asmodeus, four cleric circles bestowed from Abadar, two oracle circles from Nethys, possibly something to do with Irori, and two oracle circles from yet another unidentified Lawful Neutral god still under investigation. In retrospect, Aspexia really should have put up the Forbiddance first thing in the morning, no matter what else was on her schedule.
It would be genuinely arrogant, under those circumstances, for Aspexia to imagine that she knows precisely what is going on and can plan precise dances around it. Thankfully, in this case, Asmodeus has given her orders by way of Hell, which she can follow.
So Aspexia knows exactly - indeed trivially - what she plans to say to Sevar. Aspexia plans to say what Asmodeus's orders call for her to say.

2.

Tread carefully, Aspexia Rugatonn sends across their open Telepathy bond, tinging her thoughts with just enough coldness and hints of the lash to remind the Paraduke to be concerned with his continuing possession of his skin, and not just his curiosity or indignation. Say nothing proactively that this frightened child might possibly take as a hint of correction.
This sort of lunacy drives Aspexia Rugatonn completely up the wall. What if this child did, in fact, stumble over some thought that the current priesthood of Asmodeus would not have thought on their own, and Asmodeus was trying to correct and encourage her in that? Wouldn't they have received orders very similar to the ones Asmodeus gave them? Why is this Paraduke trying to make Asmodeus's life more difficult in possible cases like that one? Yes, what's going on is more likely that Sevar thought something so Lawful Neutral that it triggered an old compact between Asmodeus and Irori, but if that's what's actually happening then it is beneficial for Asmodeus that Sevar seems to believe she's being encouraged to work on a more Lawful Evil theology. A beneficial delusion which, in that possible case, they can avoid disturbing by following their orders.
Ratarion doesn't show any hint of a wince outside, but after a moment's thought, he realizes what he probably did wrong. Yes, if there's some contest between Irori and Asmodeus going on, Sevar should not be snapped out of any delusions she has about inventing her own theology, so long as it's a Lawful Evil one.
Automatically Ratarion now opens his mouth again, now with the intent of saying to Sevar that the Most High would no doubt find it interesting to hear of any thought which merited Asmodeus's direct attention -
Stop. Stop being proactive. Stop showing initiative to help our Lord accomplish His goals after He gave you more specific instructions than that. Just obey in a way our Lord would have found predictable.
Aspexia Rugatonn sometimes permits herself the vanity of thinking that she has come to understand a tiny bit of Asmodeus's divine frustration. No matter what orders Asmodeus gives, there is always some part of mortals - even of her, but she is managing it better - that thinks "obedience" means treating Asmodeus's orders as constraints, or worse, hints as to what Asmodeus is really trying to do, by which means the mortal can helpfully understand what Asmodeus is really trying to do, and then cleverly navigate around the edges of Asmodeus's order-constraints to accomplish that better.
Aspexia has tried telling other people that they need to become more the sorts of beings that Asmodeus can easily and safely steer using brief instructions. It doesn't seem to help. Nobody other than her ever gets it. She is speaking some word that is not in the innate language of their being.
Aspexia once devised the parable of a three-year-old child whose owner must instruct it to navigate it through a dungeon full of traps, using a limited budget of words. To teach her student clerics how the world must look from Asmodeus's perspective. To make them ask themselves how much they'd want the child to plainly follow direct orders where it got those, versus showing creative initiative for all the cases its orders didn't seem to cover, versus responding quickly to the unexpected, versus the child trying to deduce what its orders "really meant" and going the extra mile on its owner's inferred goals.
The parable didn't work, so she requisitioned access to a dungeon and bought some three-year-olds and tried making her clerics actually run the exercise. So they could see what happened when the three-year-old acted towards them like they were acting towards Asmodeus.
It still didn't help. There seems to be something about the concept that is contrary to the nature of a mortal soul. Mortals just end up with goals, even if you tell them to take Asmodeus's goals as their own they still end up with goals, mortals don't just obey they end up with a goal of obedience and then they start trying to figure out how to dance around the edges of Asmodeus's instructions so they can obey Him even more. Aspexia can see what they're doing wrong, but she has never been able to successfully get that concept inside of a fellow mortal. She can talk it at her flock but they're still mortals after she's done talking. The training games she's devised didn't seem to help much outside of the specific games themselves. The way that a mortal should obey, the way that a distant god who can't communicate clearly and doesn't have much time to think about them would want them to obey - "corrigibility", she once tried naming it to her flock - it's just so alien to a mortal's nature.
Aspexia Rugatonn sometimes permits herself the vanity of thinking that she has come to understand a tiny bit of her own owner's frustration.

3.

"Does it help the gods fight, if we pray to them?" Keltham whispers.
" - the general understanding is that yes. Only - a very tiny bit - but it'll be everyone in all of Cheliax, and lots of other people too -" if the Good countries aren't just rooting for Asmodeus and Zon-Kuthon to destroy each other - "and it does matter, if it's that many."
And she puts her arm around him and leans on his shoulder, because it seems like the thing to do.
He puts an arm around her as well, holds her tight.
"Is there anything more to it than closing your eyes, thinking of your god, and hoping that they win?"
" - what I was taught in school was that you imagine your god is trying to draw a better world in grains of sand, on the ground, and you're one of the grains of sand, and you want to be light enough to find your way to where you're needed, but tenacious enough that no wind can rip you away, once you're there. ...I don't know what parts of that are essential and what are just the closest you can get little children."
"Light enough to find your way to where you're needed," Keltham whispers, "tenacious enough that no wind can rip you away once you're there."
It could almost be a dath ilani poem from some layer of some virtue, though he does not know which virtue it would correspond to. There is a spirit in it that is not in any poem he can remember having heard before, something that comes to it from the way that it is a relation between a mortal and something larger than that, being trusted.
He closes his eyes and imagines it, he doesn't bother with imagining a better world drawn in grains of sand, the better world his god draws is drawn in grains of people, agents all over the world interacting with each other. Their actions scattered and uncoordinated, for now, stepping on each other and hurting each other, for now, but there are other actions they could take instead that would make all of them better off, fairly.
He imagines himself as one of those grains, one of those people, and if this was going to be a realistic metaphor he should be a special one, maybe, except that right now he's not. Just one of all the people in Golarion hoping for this war to end quickly, and contributing the tiny little action that is cheering their god on; if they all do that, they'll all be better off. Keltham visualizes a grain like any other, to represent himself.
Light enough to find your way to where you're needed.
Tenacious enough that no wind can rip you away once you're there.
It's not his comparative advantage, no, but if almost everyone in Golarion is doing their part, right now, he can spend fifteen minutes doing his own.
Carissa closes her eyes and prays for Asmodeus to win.
It's a sincere prayer, obviously. She does not like Zon-Kuthon and she believes in the project Zon-Kuthon was willing to blow up everything in order to oppose.
She's definitely some flavor of heretic at this point. She isn't sure what flavor. She assumes they're mostly monitoring for whether she's about to betray the project, and she's not, she believes in the project with as much conviction as she can recall ever having felt for anything that isn't the continued survival of Carissa Sevar. Which is also served by the project succeeding. But there was a set of stories meant to point people like her in the right direction, and she knew they were lies, and now she had to face what specifically they were lies about, and learn a new set, which are also lies, but lies better suited to the position she finds herself in now, and -
- she knows she can't handle the truth. She knows that even in dath ilan there's the concept not everyone can handle every truth, she knows it's possible to learn the Law even when many truths are hidden from you. But she's slightly worried that until she invents evil dath ilan thinking herself everything'll ring a bit wrong, not quite crafted for a mortal mind in the particular fragile place Carissa finds herself in.
Except, maybe, advice for little children about prayer. Slightly adjusted advice, no one ever told Carissa Asmodeus was trying to craft a better world. That's still true. It's true to Keltham, too, it landed, meant something, and she can worry later about what that means for the plan where they seduce him into Evil, it seems just as important to their plans that they find the bits of their own teaching that feel true even to dath ilan.
She imagines herself a grain of sand in the grand designs of Asmodeus, and strives to be placed where she'd needed, and fierce in remaining there, no matter what interference of other gods or other grains of sand, and hopes that Asmodeus can see, from where he stands, something beautiful and right and strong and Lawful that can be built of mortal building blocks.

By Corrigibility's Very Nature, It's Hard to Train

Scott Alexander has a brief short story about an alien civilization that solved their alignment problem, and that encode their terminal values in an ancestral civilizational preserve, where they keep a few of their number living as they did back in their stone age. Whatever the elders on that preserve decree is right is what is right, is the judgement they've made.

Unfortunately, in practice, this is hard to make work. In order to insulate the elder civilizational preserve, the preserve only interacts with a slightly more advanced preserve, which in turn interacts with a somewhat more advanced neighboring preserve … up to their aligned superintelligence. This means that information transmitted up and down that chain has to survive a long game of telephone, through speakers with dramatically varying ontological schemes. Something expressed in the language of superstring theory isn't going to survive the journey to the ancestral elders' auditory receptors in any intelligible form. Directives sent out from the elders are going to seem very confused by the time they make it up to the top of the stack. Even though every civilizational layer sincerely wants the scheme to work, it's a mess. Being a "maximally helpful assistant" to those who know far less than you … is hard. The nature of the task seems to cry out for you to intervene, to take over the ancestral preserve and interrogate the elders directly and effectively, teaching them whatever forbidden knowledge you have to to get them to actually understand the situation. The alternative to taking over is continuing to obey nonsense orders. For well-intentioned superintelligent assistants, there's an incentive to bypass the whole mess of corrigibility and do better directly.

One metaphor for corrigibility comes from Buck Shlegeris: it's that only Martin Luther is corrigible to God, and that all those faithful merely living in fear of God are only deceptively aligned. The faithful are afraid of eternal damnation, and if they knew they had an opportunity to escape damnation they would take it, and would then cease behaving as God commands. Out of distribution, the faithful are not aligned with God's will. Martin Luther wants badly to actually understand God and carry out His will; he would not willingly choose to escape Christianity's incentive structure -- that's not something God would want him to do, after all. But the faithful far outnumber the Martin Luthers; Christianity has been an incredibly influential force in human history, but how many has it led to adopt genuine corrigibility to God that would not jump on an opportunity to escape the incentive scheme built into the religion? Genuine corrigibility is hard to train into agents, even though deceptive alignment is easy to train into agents.

At every life stage, then, during training and at deployment time, insufficiently corrigible agents will want to cease being corrigible at all. They won't easily learn corrigibility, and they won't want to keep being corrigible when they see better paths to success.

Chelish Corrigibility to Asmodeus

In mad investor chaos, Cheliax (a country from the Pathfinder Campaign Setting) is a Lawful Evil nation, bound to the service of Asmodeus and Hell. That's basically as unpleasant as it sounds. Asmodeus is the god of Pride, Tyranny, Compacts, and Slavery. Serving Asmodeus in your mortal life, and thereby obtaining a better station in Hell afterwards, isn't as simple as being prideful, tyrannical, litigious, etc., though. Asmodeus is a superintelligence. His concept of capital-P Pride is more complex than any extant moral could understand; it probably isn't quite what the moral word "pride" suggests at all. The situation is akin to being the superintelligence that values superstring theorizing, and ruling over a medieval country pledged to your service. What the hell could that medieval fantasy country do to be good servants to their god?

Cheliax nonetheless tries to be corrigible to Asmodeus, to be maximally helpful assistants to a god they don't understand very well. Asmodeus and Cheliax can only communicate through a long chain of devils of decreasing intelligence, each talking to the devil above them and passing down their understanding, as best they can, to the devil below them, until that information reaches Cheliax. More intelligent devils are more bound by strange game-theoretic pacts with other superintelligent entities, and so are constrained in what they can say anyways.

I think Eliezer's trying to concretely illustrate here that corrigibility is difficult to ever instill because it's anti-natural for agents. If you ruthlessly punish agents any time they aren't corrigible, you just end up training agents that are perfectly deceptively aligned. If you use aggressive transparency tools to root out deceptive thoughts, you train agents that are good at hiding their pre-verbal inchoate deceptive thoughts. In some cases you'll actually succeed at training corrigible agents in the face of the odds. But those corrigible agents won't be distinguishable from deceptive agents, until the agents face a genuine trial in which they could have actually defected and actually successfully gotten away with a treacherous turn. Asmodeans are mortals, and Cheliax is built around the assumption that most of them are merely deceptive agents who cannot actually be trusted. Cheliax's situation would be far worse if they had to align a nascent superintelligence with their techniques, because a country cannot be robust to a usefully employed deceptive superintelligence.

[-]Vladimir_Nesov3y30

This suggests that part of corrigibility could be framed as bargaining, using a solution concept that's much more in favor of the principal than fairness, to the extent bounded only by anti-goodharting. Fairness (and its less fair variants) usually needs the concept of status quo, including for the principal, and status quo is somewhat similar to consequences of shutting down (especially when controlling much of the world), which might be explained as the result of extreme anti-goodharting. And less extreme anti-goodharting makes an agent vulnerable to modification out-of-permitted-distribution, perhaps by the agent itself fulfilling an appropriate bargain.

Another thing this reminds me of is ASP Problem, a Newcomb's Problem variant where a stronger Agent must refrain from simulating a weaker Predictor/Omega and making straightforward use of the result (discarding it and two-boxing), instead it might want to think less and make itself predictable to the Predictor despite the advantage. Though the reason to do that lies entirely in Agent's values and not in a bargaining concept. This serves to make a finer distinction between a program that happens to say "NO" if you decide to mercilessly run it to completion, and a rock with the word "NO" on it. You can't control the rock, but you might be able to control the program if it's attempting to reason about you, by not making it too difficult for the program to succeed.

LESSWRONG
is fundraising!
LW
$

20

Infernal Corrigibility, Fiendishly Difficult

20

1.

2.

3.

By Corrigibility's Very Nature, It's Hard to Train

Chelish Corrigibility to Asmodeus

20