For those who aren't familiar, Pascal's Mugging is a simple thought experiment that seems to demonstrate an intuitive flaw in naive expected utility maximization. In the classic version, someone walks up to you on the street, and says, 'Hi, I'm an entity outside your current model of the universe with essentially unlimited capabilities. If you don't give me five dollars, I'm going to use my powers to create 3^^^^3 people, and then torture them to death.' (For those not familiar with Knuth up-arrow notation, see here). The idea being that however small your probability is that the person is telling the truth, they can simply state a number that's grossly larger - and when you shut up and multiply, expected utility calculations say you should give them the five dollars, along with pretty much anything else they ask for.
Intuitively, this is nonsense. However, an AI under construction doesn't have a piece of code that lights up when exposed to nonsense. Not unless we program one in. And formalizing why, exactly, we shouldn't listen to the mugger is not as trivial as it sounds. The actual underlying problem has to do with how we handle arbitrarily small probabilities. There are a number of variations you could construct on the original problem that present the same paradoxical results. There are also a number of simple hacks you could undertake that produce the correct results in this particular case, but these are worrying (not to mention unsatisfying) for a number of reasons.
So, with the background out of the way, let's move on to a potential approach to solving the problem which occurred to me about fifteen minutes ago while I was lying in bed with a bad case of insomnia at about five in the morning. If it winds up being incoherent, I blame sleep deprivation. If not, I take full credit.
Let's take a look at a new thought experiment. Let's say someone comes up to you and tells you that they have magic powers, and will make a magic pony fall out of the sky. Let's say that, through some bizarrely specific priors, you decide that the probability that they're telling the truth (and, therefore, the probability that a magic pony is about to fall from the sky) is exactly 1/2^100. That's all well and good.
Now, let's say that later that day, someone comes up to you, and hands you a fair quarter and says that if you flip it one hundred times, the probability that you'll get a straight run of heads is 1/2^100. You agree with them, chat about math for a bit, and then leave with their quarter.
I propose that the probability value in the second case, while superficially identical to the probability value in the first case, represents a fundamentally different kind of claim about reality than the first case. In the first case, you believe, overwhelmingly, that a magic pony will not fall from the sky. You believe, overwhelmingly, that the probability (in underlying reality, divorced from the map and its limitations) is zero. It is only grudgingly that you inch even a tiny morsel of probability into the other hypothesis (that the universe is structured in such a way as to make the probability non-zero).
In the second case, you also believe, overwhelmingly, that you will not see the event in question (a run of heads). However, you don't believe that the probability is zero. You believe it's 1/2^100. You believe that, through only the lawful operation of the universe that actually exists, you could be surprised, even if it's not likely. You believe that if you ran the experiment in question enough times, you would probably, eventually, see a run of one hundred heads. This is not true for the first case. No matter how many times somebody pulls the pony trick, a rational agent is never going to get their hopes up.
I would like, at this point, to talk about the notion of metaconfidence. When we talk to the crazy pony man, and to the woman with the coin, what we leave with are two identical numerical probabilities. However, those numbers do not represent the sum total of the information at our disposal. In the two cases, we have differing levels of confidence in our levels of confidence. And, furthermore, this difference has an actual ramifications on what a rational agent should expect to observe. In other words, even from a very conservative perspective, metaconfidence intervals pay rent. By treating the two probabilities as identical, we are needlessly throwing away information. I'm honestly not sure if this topic has been discussed before. I am not up to date on the literature on the subject. If the subject has already been thoroughly discussed, I apologize for the waste of time.
Disclaimer aside, I'd like to propose that we push this a step further, and say that metaconfidence should play a role in how we calculate expected utility. If we have a very small probability of a large payoff (positive or negative), we should behave differently when metaconfidence is high than when it is low.
From a very superificial analysis, lying in bed, metaconfidence appears to be directional. A low metaconfidence, in the case of the pony claim, should not increase the probability that the probability of a pony dropping out of the sky is HIGHER than our initial estimate. It also works the other way as well: if we have a very high degree of confidence in some event (the sun rising tomorrow), and we get some very suspect evidence to the contrary (an ancient civilization predicting the end of the world tonight), and we update our probability downward slightly, our low metaconfidence should not make us believe that the sun is less likely to rise tomorrow than we thought. Low metaconfidence should move our effective probability estimate against the direction of the evidence that we have low confidence in: the pony is less likely, and the sunrise is more likely, than a naive probability estimate would suggest.
So, if you have a claim like the pony claim (or Pascal's mugging), in which you have a very low estimated probability, and a very low metaconfidence, should become dramatically less likely to actually happen, in the real world, than a case in which we have a low estimated probability, but a very high confidence in that probability. See the pony versus the coins. Rationally, we can only mathematically justify so low a confidence in the crazy pony man's claims. However, in the territory, you can add enough coins that the two probabilities are mathematically equal, and you are still more likely to get a run of heads than you are to have a pony magically drop out of the sky. I am proposing metaconfidence weighting as a way to get around this issue, and allow our map to more accurately reflect the underlying territory. It's not perfect, since metaconfidence is still, ultimately, calculated from our map of the territory, but it seems to me, based on my extremely brief analysis, that it is at least an improvement on the current model.
Essentially, this idea is based on the understanding that the numbers that we generate and call probability do not, in fact, correspond to the actual rules of the territory. They are approximations, and they are perturbed by observation, and our finite data set limits the resolution of the probability intervals we can draw. This causes systematic distortions at the extreme ends of the probability spectrum, and especially at the small end, where the scale of the distortion rises dramatically as a function of the actual probability. I believe that the apparently absurd behavior demonstrated by an expected-utility agent exposed to Pascal's mugging, is a result of these distortions. I am proposing we attempt to compensate by filling in the missing information at the extreme ends of the bell curve with data from our model about our sources of evidence, and about the underlying nature of the territory. In other words, this is simply a way to use our available evidence more efficiently, and I suspect that, in practice, it eliminates many of the Pascal's-mugging-style problems we encounter currently.
I apologize for not having worked the math out completely. I would like to reiterate that it is six thirty in the morning, and I've only been thinking about the subject for about a hundred minutes. That said, I'm not likely to get any sleep either way, so I thought I'd jot the idea down and see what you folks thought. Having outside eyes is very helpful, when you've just had a Brilliant New Idea.
I am skeptical about "metaconfidence."
If you have all the relevant probabilities, you have all the information you need to calculate expected utilities for all possible choices--you don't need to decide on how "metaconfident" you are in these probabilities. Your probabilities may not be based on very good data, and so you might anticipate that these probabilities will change drastically when you update on new observations--I think you would call this "low metaconfidence." But your strategy for updating based on new evidence is already encoded in your priors for what evidence you expect to observe. I don't think metaconfidence is useful as a new, independent concept, apart from indicating that your probabilities are susceptible to change based on future evidence.
In particular, if "metaconfidence" considerations seem to be prompting you to distrust your probability as being either systematically too high or too low, then you should just immediately update your probability in that direction, by conservation of expected evidence. So when you say
then if you really believe that, you ought to preemptively update your "naive" probabilities until you no longer believe they are systematically biased.
I think a particular misconception may be leading you astray:
I think what you are referring to here is that if you repeated the 100-coin-flip experiment 2^100 times, you expect to see 100 heads on average once. But even if the pony guy tried 2^100 times, you do not expect an average of one pony to fall from the sky. You expect him to simply lack the power, and so expect him to fail every time.
The real difference here is not metaconfidence, it's that each set of 100 coin flips is completely independent of any other set of 100 coin flips. But the pony guy's first attempt to summon a pony is not independent of his second attempt to summon a pony. If he fails on his first attempt, you strongly expect him to fail on every future attempt.
In more detail:
If you are assigning a probability of 1/2^100 to the pony guy's claim, you think that the claim is pretty improbable. How improbable? You are saying, "I expect that if I heard 2^100 similarly improbable, but completely independent and uncorrelated claims, on average one such claim would be true and all the rest would be false." One such completely independent claim to which you might assign the same probability might be if I predicted that "tomorrow, physicists will discover a new sentient fundamental particle that vacations in Bermuda every August." If your probability is 1/2^100, you really should expect that if you heard 2^100 equally astonishing claims, one of them would actually turn out to be true. But in particular you don't expect that if the pony guy repeated his claim 2^100 times, on average he'd be telling the truth once. Repetitions of the same claim by the same person are clearly not independent claims.
Superficially this may look different from the case of the coin flips. There, you expect that if you make 2^100 tests of the claim "flipping this coin 100 times will give all heads," it would be true on average once. To symmetrize the situation, let me make the coin claim more specific without changing it in any meaningful way: "The first 100 flips of this coin will all be heads." This, like "I can summon ponies from the sky" is a claim that is simply either true or false. When you assign a probability of 1/2^100 to the coin claim, you are again saying, "I expect that if I heard 2^100 similarly improbable, but completely independent and uncorrelated claims, on average one of them would be true." Because different flips of the same coin happen to have the property of statistical independence, similarly improbable but completely independent claims are easy to construct. For instance, "The second 100 flips of this coin will all be heads," etc.
This is all to say that I disagree that
The two probabilities represent exactly analogous claims, and there is no need for a notion of metaconfidence to distinguish them.