A corollary is a necessary condition for friendliness: if the utility function of an AI can take values much larger than the complexity of the input, then it is unfriendly. This kills Pascal's mugging and paperclip maximizers with the same stone. It even sounds simple and formal enough to imagine testing it on a given piece of code.
It doesn't quite kill Pascal's mugging - the threat does have to have some minimum level of credibility, but that minimum credibility can still be low enough that hand over the cash. Pascal's mugging only is killed if the expected utility of handing over the cash is negative. To show this I think you really do need to evaluate the probability to the end.
Neither does it kill paperclip maximizers. A bunch of paperclips requires about log2(N) bits to describe, plus the description of the properties of a paperclip. So the paperclip maximizer can still have...
For background, see here.
In a comment on the original Pascal's mugging post, Nick Tarleton writes:
Coming across this again recently, it occurred to me that there might be a way to generalize Vassar's suggestion in such a way as to deal with Tarleton's more abstract formulation of the problem. I'm curious about the extent to which folks have thought about this. (Looking further through the comments on the original post, I found essentially the same idea in a comment by g, but it wasn't discussed further.)
The idea is that the Kolmogorov complexity of "3^^^^3 units of disutility" should be much higher than the Kolmogorov complexity of the number 3^^^^3. That is, the utility function should grow only according to the complexity of the scenario being evaluated, and not (say) linearly in the number of people involved. Furthermore, the domain of the utility function should consist of low-level descriptions of the state of the world, which won't refer directly to words uttered by muggers, in such a way that a mere discussion of "3^^^^3 units of disutility" by a mugger will not typically be (anywhere near) enough evidence to promote an actual "3^^^^3-disutilon" hypothesis to attention.
This seems to imply that the intuition responsible for the problem is a kind of fake simplicity, ignoring the complexity of value (negative value in this case). A confusion of levels also appears implicated (talking about utility does not itself significantly affect utility; you don't suddenly make 3^^^^3-disutilon scenarios probable by talking about "3^^^^3 disutilons").
What do folks think of this? Any obvious problems?