You're thinking about this too hard.
There are, in fact, three solutions, and two of them are fairly obvious ones.
1) We have observed 0 such things in existence. Ergo, when someone comes up to me and says that they are someone who will torture people I have no way of ever knowing existing unless I give them $5, I can simply assign them the probability of 0 that they are telling the truth. Seeing as the vast, vast majority of things I have observed 0 of do not exist, and we can construct an infinite number of things, assigning a probability of 0 to any particular thing I have never observed and have no evidence of is the only rational thing to do.
2) Even assuming they do have the power to do so, there is no guarantee that the person is being rational or telling the truth. They may torture those people regardless. They might torture them BECAUSE I gave them $5. They might do so at random. They might go up to the next person and say the next thing. It doesn't matter. As such, their demand does not change the probability that those people will be tortured at all, because I have no reason to trust them, and their words have not changed the probabilities one way or the other. Ergo, again, you don't give them money.
3) Given that I have no way of knowing whether those people exist, it just doesn't matter. Anything which is unobservable does not matter at all, because, by its very nature, if it cannot be observed, then it cannot be changing the world around me. Because that is ultimately what matters, it doesn't matter if they have the power or not, because i have no way of knowing and no way of determining the truth of the statement. Similar to the IPU, the fact that I cannot disprove it is not a rational reason to believe in it, and indeed the fact that it is non-falsifiable indicates that it doesn't matter if it exists at all or not - the universe is identical either way.
It is inherently irrational to believe in things which are inherently non-falsifiable, because they have no means of influencing anything. In fact, that's pretty core to what rationality is about.
The problem is with formalizing solutions, and making them consistent with other aspects that one would want an AI system to have (e.g. ability to update on the evidence). Your suggested three solutions don't work in this respect because:
1) If we e.g. make an AI literally assign a probability 0 on scenarios that are too unlikely, then it wouldn't be able to update on additional evidence based on the simple Bayesian formula. So an actual Matrix Lord wouldn't be able to convince the AI he/she was a Matrix Lord even if he/she reversed gravity, or made it snow...
This is meant as a rough collection of five ideas of mine on potential anti-Pascal Mugging tactics. I don't have much hope that the first three will be any useful at all and am afraid that I'm not mathematically-inclined enough to know if the last two are any good even as a partial solution towards the core problem of Pascal's Mugging -- so I'd appreciate if people with better mathematical credentials than mine could see if any of my intuitions could be formalizable in a useful manner.
0. Introducing the problem (this may bore you if you're aware of both the original and the mugger-less form of Pascal's Mugging)
First of all the basics: Pascal's Mugging in its original form is described in the following way:
This is the "shallow" form of Pascal's mugging, which includes a person that (almost certainly) is attempting to deceive the prospective AI. However let's introduce some further statements similar to the above, to avoid particular objections that might be used in some (even shallower) attempted rebuttals:
And I won't tell you why, and I probably lie, but can you really take that chance?"
Blaise fills with trepidation as his calculations all turn out the devil's way.
And they say in the Paris catacombs, his ghost is fiddlin' to this day.
I think these are all trivial variations of this basic version of Pascal's Mugging: The utility a prankster derives from the pleasure of successfully pranking the AI wouldn't be treated differently in kind to the utility of 5 dollars -- nor is the explicit offer of a trade different than the supposedly free offer of information.
The mugger-less version is on the other hand more interesting and more problematic. You don't actually need a person to make such a statement -- the AI, without any prompting, can assign prior probabilities to theories which produce outcomes of positive or negative value vastly greater than their assigned improbabilities. I've seen its best description in the comment by Kindly and the corresponding response by Eliezer:
Kindly: Very many hypotheses -- arguably infinitely many -- can be formed about how the world works. In particular, some of these hypotheses imply that by doing something counter-intuitive in following those hypothesis, you get ridiculously awesome outcomes. For example, even in advance of me posting this comment, you could form the hypothesis "if I send Kindly $5 by Paypal, he or she will refrain from torturing 3^^^3 people in the matrix and instead give them candy." Now, usually all such hypotheses are low-probability and that decreases the expected benefit from performing these counter-intuitive actions. But how can you show that in all cases this expected benefit is sufficiently low to justify ignoring it?
Eliezer Yudkowsky: Right, this is the real core of Pascal's Mugging [...]. For aggregative utility functions over a model of the environment which e.g. treat all sentient beings (or all paperclips) as having equal value without diminishing marginal returns, and all epistemic models which induce simplicity-weighted explanations of sensory experience, all decisions will be dominated by tiny variances in the probability of extremely unlikely hypotheses because the "model size" of a hypothesis can grow Busy-Beaver faster than its Kolmogorov complexity.
The following list five ideas of mine, ordered as least-to-most-promising in the search for a general solution. Though I considered them seriously initially, I no longer really think that (1) (2) or (3) hold any promise, being limited, overly specific or even plain false -- I nonetheless list them for completeness' sake, to get them out of my head and in case anyone sees something in them that could potentially be the seed of something better. I'm slightly more hopeful for solutions (4) or (5) -- they feel to me intuitively as if they may be leading to something good. But I'd need math that I don't really have to prove or disprove it.
1. The James T. Kirk solution
Say there's a given prior possibility P(X=Matrix Lord) that any given human being is a Matrix Lord with the power to inflict 3^^^3 points of utility/disutility. The fact that such a being with such vast power seemingly wants five dollars (or a million dollars, or to be crowned Queen of Australia), makes it actually *less* likely that such a being is a Matrix Lord.
We don't actually need the vast unlikely probabilities to illustrate the truth of this. Let's consider an AI with a security backdoor -- it's known for a fact than there's one person in the world which has been given a 10-word passkey that can destroy the AI at his will. (The AI is also disallowed from attempting to avoid such penalty by e.g. killing the person in question).
So let's say the prior probability for any given person being the key keeper in question is "1 in 7 billion"
Now Person X says to the AI. "Hey, I'm the key keeper. I refuse to give you any evidence to the same, but I'll destroy you if you don't give me 20 dollars."
Does this make Person X more or less likely to be the key keeper? My own intuition tells me "less likely".
Unfortunately, one fundamental problem with the above syllogism is that at best it can tell us that it's only the muggerless version that we need fear. Useless for any serious purpose.
2. The presumption of unfriendliness
Blaise's tactic should have been not to obey the devil's warning, nor to even do the opposite than his suggestion (since the devil could be smart enough to know how to use reverse psychology), but rather to ignore him as much as possible: Blaise should end the song and dance at the point in time he would have done if he wasn't aware of the devil's statement.
All the above is obvious for cartoonish villains like the devil -- known malicious agents who are known to have a utility function opposed to ours -- and a Matrix Lord who is willing to torture 3^^^3 people for the purpose of getting 5 dollars is probably no better; better to just ignore them. But I wonder: Can't a similar logic be used in handling most any agents with utility functions that are merely different than one's own (which is the vast number of agents in mindspace)?
Moreover a thought that occurs: Doesn't it seem likely that for any supposed impact X, the greater the promised X, the less likely two different minds are both positively inclined towards it? So for any supposed impact X, shouldn't the presumption of unfriendliness (incompatibility in utility functions) increase in like measure ?
3. The Xenios Zeus.
Let's say that each current resident has a small chance (not necessarily the same small chance) of being a Matrix Lord willing to destroy the world and throw a temper tantrum that'll punish 3^^^3 people if you don't behave properly according to what he considers proper. Much like each traveller has a chance of being Zeus.
One might think that you might have to examine the data very closely to figure out which random person has the greatest probability of being Zeus -- but that rather fails to get the moral of the myth, which isn't "figure out who is secretly Zeus" but rather "treat everyone nicely, just in case". If someone does not reveal themselves to be a god, then they don't expect to be treated like a god, but might still expect human decency.
To put it in LW analogous terms one might argue that an AI could treat the value system of even Matrix Lords as roughly centered around the value system of human beings -- so that by serving the CEV of humanity, it would also have the maximum chance of pleasing (or at least not angering) any Matrix Lords in question.
Unfortunately in retrospect I think this idea of mine is, frankly, crap. Not only is it overly specific and again seems to treat the surface problem rather than the core problem, but I realized it reached the same conclusion as (2) by asserting the exact opposite -- the previous idea made an assumption of unfriendliness, this one makes an assumption of minds being centered around friendliness. If I'm using two contradictory ideas to lead to the same conclusion, it probably indicates that this is a result of having written the bottom line -- not of making an actually useful argument.
So not much hope remains in me for solutions 1-3. Let's go to 4.
4. The WWBBD Principle. (What Would a Boltzman Brain Do?)
Calculations of future utility have a discounting factor naturally built into them -- which is the uncertainty of being able to calculate and control such a future properly. So in a very natural (no need to program it in) manner an AI would prefer the same utility for 5 seconds in the future rather than 5 minutes in the future, and for 5 minutes in the future rather than 5 years in the future.
This looks at first glance as a time-discount, but in actuality it's an uncertainty-discount. So an AI that had a very good predictive capacity would be able to discount future utility less; because the uncertainty would be less. But the uncertainty would never be quite zero.
So even as the thought of 3^^^3 lives outweighs the tiny probability; couldn't it be that a similar factor punishes it to an opposite direction, especially when dealing with hypotheses in which the AI will be able to have no further control? I don't know. Bring in the mathematicians.
5. The Law of Visible Impact (a.k.a. The Generalized Hanson)
I have to say that I find this argument unappealing and unconvincing. One problem I have with it is that it seems to treat the concept of "person" as ontologically fundamental -- it's an objection I kinda have also against the Simulation argument and the Doomsday Argument.
Moreover wouldn't this argument cease to apply if I was merely witnessing the Pascal's mugging taking place, and that therefore if I was merely witnessing I should be hoping for the mugged entity to submit? This sounds nonsensical.
But I think Hanson's argument can be modified so here I'd like to offer what I'll call the Generalized Hanson: Penalize the prior probability of hypotheses which argue for the existence of high impact events whose consequences nonetheless remain unobserved.
If life's creation is easy, why aren't we seeing alien civilizations consuming the stars? Therefore most likely life's creation isn't easy at all.
If the universe allowed easy time-travel, where are all the time-travellers? Hence the world most likely doesn't allow easy time-travel.
If Matrix Lords exist that are apt to create 3^^^3 people and torture them for their amusement, why aren't we being tortured (or witnessing such torture) right now? Therefore most likely such Matrix Lords are rare enough to nullify their impact.
In short the higher the impact of a hypothetical event, the more evidence we should be expecting to see for it in the surrounding universe -- the non-visibility of such provides therefore evidence against the hypothesis analogous to the extent of such hypothetical impact.
I'm probably expressing the above intuition quite badly, but again: I hope someone with actual mathematical skills can take the above and make it into something useful; or tell me that it's not useful as an anti-Pascal Mugging tactic at all.