Ah thanks for pointing out the typo.
I'll probably create a post soon-ish with more visualizations covering cases like the ones you suggested.
You're right about the model being pertinent to cases where we've already solved the alignment problem pretty well, but want to try other safety measures. I'm particularly thinking about cases where the AIs are so advanced that humans can't really supervise them well, so the AIs must supervise each other. In that case, I'm not sure how p would behave as a function of AI capability. Maybe it's best to assume that p is increasing with capability, just so we're aware of what the worst case could be?
I love the commentary about meaning and morality here, how Quirrel pushes back on Harry's "obvious theorems" about morality by pointing out that in the end everyone simply does what they want to do, with, I think, the suggestion that a Dark Lord is simply someone who is really good at getting what he wants (and a lot of ambiguity about whether this is actually a good way to act).
And of course the pervasive strangeness of the fact that this conversation is occurring between rationalist wizards in a HP fan fiction just makes it even better...
Thanks for pointing this out -- this is helpful to position this in idea space. Basically what I'm describing here is an information cascade where actors fail to realize that others' beliefs are not independent.