This is meant as a rough collection of five ideas of mine on potential anti-Pascal Mugging tactics. I don't have much hope that the first three will be any useful at all and am afraid that I'm not mathematically-inclined enough to know if the last two are any good even as a partial solution towards the core problem of Pascal's Mugging -- so I'd appreciate if people with better mathematical credentials than mine could see if any of my intuitions could be formalizable in a useful manner.

0. Introducing the problem (this may bore you if you're aware of both the original and the mugger-less form of Pascal's Mugging)

First of all the basics: Pascal's Mugging in its original form is described in the following way:

  • Now suppose someone comes to me and says, "Give me five dollars, or I'll use my magic powers from outside the Matrix to run a Turing machine that simulates and kills 3^^^3 people."

This is the "shallow" form of Pascal's mugging, which includes a person that (almost certainly) is attempting to deceive the prospective AI. However let's introduce some further statements similar to the above, to avoid particular objections that might be used in some (even shallower) attempted rebuttals:

  • "Give me five dollars, and I'll use my magic powers from outside the Matrix to increase the utility of every human being by 3^^^^3 utilons" (a supposedly positive trade rather than a blackmailer's threat)
  • "I'm an alien in disguise - unless you publicly proclaim allegiance to your insect overlords, we will destroy you then torture all humanity for 3^^^^3 years" (a prankster asks for something which might be useful to an actual alien, but on a material-level not useful to a human liar)
  • "My consciousness has partially time-travelled from the future into the past, and one of the few tidbits I remember is that it would be of effectively infinite utility if you asked everyone to call you Princess Tutu." (no trade offered at all, seemingly just a statement of epistemic belief)
  • Says the Devil "It's infinitely bad to end that song and dance
    And I won't tell you why, and I probably lie, but can you really take that chance?"
    Blaise fills with trepidation as his calculations all turn out the devil's way.
    And they say in the Paris catacombs, his ghost is fiddlin' to this day.

I think these are all trivial variations of this basic version of Pascal's Mugging: The utility a prankster derives from the pleasure of successfully pranking the AI wouldn't be treated differently in kind to the utility of 5 dollars -- nor is the explicit offer of a trade different than the supposedly free offer of information.

The mugger-less version is on the other hand more interesting and more problematic. You don't actually need a person to make such a statement -- the AI, without any prompting, can assign prior probabilities to theories which produce outcomes of positive or negative value vastly greater than their assigned improbabilities. I've seen its best description in the comment by Kindly and the corresponding response by Eliezer:

Kindly: Very many hypotheses -- arguably infinitely many -- can be formed about how the world works. In particular, some of these hypotheses imply that by doing something counter-intuitive in following those hypothesis, you get ridiculously awesome outcomes. For example, even in advance of me posting this comment, you could form the hypothesis "if I send Kindly $5 by Paypal, he or she will refrain from torturing 3^^^3 people in the matrix and instead give them candy." Now, usually all such hypotheses are low-probability and that decreases the expected benefit from performing these counter-intuitive actions. But how can you show that in all cases this expected benefit is sufficiently low to justify ignoring it?

Eliezer Yudkowsky: Right, this is the real core of Pascal's Mugging [...]. For aggregative utility functions over a model of the environment which e.g. treat all sentient beings (or all paperclips) as having equal value without diminishing marginal returns, and all epistemic models which induce simplicity-weighted explanations of sensory experience, all decisions will be dominated by tiny variances in the probability of extremely unlikely hypotheses because the "model size" of a hypothesis can grow Busy-Beaver faster than its Kolmogorov complexity.

The following list five ideas of mine, ordered as least-to-most-promising in the search for a general solution. Though I considered them seriously initially, I no longer really think that (1) (2) or (3) hold any promise, being limited, overly specific or even plain false -- I nonetheless list them for completeness' sake, to get them out of my head and in case anyone sees something in them that could potentially be the seed of something better. I'm slightly more hopeful for solutions (4) or (5) -- they feel to me intuitively as if they may be leading to something good. But I'd need math that I don't really have to prove or disprove it.

1. The James T. Kirk solution

To cut to the punchline: the James T. Kirk solution to Pascal's Mugging is "What does God need with a starship?"

Say there's a given prior possibility P(X=Matrix Lord) that any given human being is a Matrix Lord with the power to inflict 3^^^3 points of utility/disutility. The fact that such a being with such vast power seemingly wants five dollars (or a million dollars, or to be crowned Queen of Australia), makes it actually *less* likely that such a being is a Matrix Lord.

We don't actually need the vast unlikely probabilities to illustrate the truth of this. Let's consider an AI with a security backdoor -- it's known for a fact than there's one person in the world which has been given a 10-word passkey that can destroy the AI at his will. (The AI is also disallowed from attempting to avoid such penalty by e.g. killing the person in question).

So let's say the prior probability for any given person being the key keeper in question is "1 in 7 billion"

Now Person X says to the AI. "Hey, I'm the key keeper. I refuse to give you any evidence to the same, but I'll destroy you if you don't give me 20 dollars."

Does this make Person X more or less likely to be the key keeper? My own intuition tells me "less likely".

Unfortunately, one fundamental problem with the above syllogism is that at best it can tell us that it's only the muggerless version that we need fear. Useless for any serious purpose.

2. The presumption of unfriendliness

Robot DevilThis is most obviously the method that, in the examples above, poor Blaise should have used to defeat the devil's claim of infinite badness. In a universe where ending 'that song and dance' can be X-bad, the statement should also be considered that it could be X-bad to NOT end it, or indeed X-good to do end it. The devil (being a known malicious entity) is much more likely to push Pascal towards doing what would result to the infinite badness. And indeed in the fictional evidence provided by the song, that's indeed what the devil achieves: to harm Blaise Pascal and his ghost for an arbitrarily long time - by warning against the same and using Pascal's calculations against him.

Blaise's tactic should have been not to obey the devil's warning, nor to even do the opposite than his suggestion (since the devil could be smart enough to know how to use reverse psychology), but rather to ignore him as much as possible: Blaise should end the song and dance at the point in time he would have done if he wasn't aware of the devil's statement.

All the above is obvious for cartoonish villains like the devil -- known malicious agents who are known to have a utility function opposed to ours -- and a Matrix Lord who is willing to torture 3^^^3 people for the purpose of getting 5 dollars is probably no better; better to just ignore them. But I wonder: Can't a similar logic be used in handling most any agents with utility functions that are merely different than one's own (which is the vast number of agents in mindspace)?

Moreover a thought that occurs: Doesn't it seem likely that for any supposed impact X, the greater the promised X, the less likely two different minds are both positively inclined towards it? So for any supposed impact X, shouldn't the presumption of unfriendliness (incompatibility in utility functions) increase in like measure ?

3. The Xenios Zeus.

Pyramus and ThisbeThis idea was inspired from the old myth about Zeus and Hermes walking around pretending to be travellers in need, to examine which people were hospitable and which were not. I think there may exist similar myths about other gods in other mythologies.

Let's say that each current resident has a small chance (not necessarily the same small chance) of being a Matrix Lord willing to destroy the world and throw a temper tantrum that'll punish 3^^^3 people if you don't behave properly according to what he considers proper. Much like each traveller has a chance of being Zeus.

One might think that you might have to examine the data very closely to figure out which random person has the greatest probability of being Zeus -- but that rather fails to get the moral of the myth, which isn't "figure out who is secretly Zeus" but rather "treat everyone nicely, just in case". If someone does not reveal themselves to be a god, then they don't expect to be treated like a god, but might still expect human decency.

To put it in LW analogous terms one might argue that an AI could treat the value system of even Matrix Lords as roughly centered around the value system of human beings -- so that by serving the CEV of humanity, it would also have the maximum chance of pleasing (or at least not angering) any Matrix Lords in question.

Unfortunately in retrospect I think this idea of mine is, frankly, crap. Not only is it overly specific and again seems to treat the surface problem rather than the core problem, but I realized it reached the same conclusion as (2) by asserting the exact opposite -- the previous idea made an assumption of unfriendliness, this one makes an assumption of minds being centered around friendliness. If I'm using two contradictory ideas to lead to the same conclusion, it probably indicates that this is a result of having written the bottom line -- not of making an actually useful argument.

So not much hope remains in me for solutions 1-3. Let's go to 4.

4. The WWBBD Principle. (What Would a Boltzman Brain Do?)

If you believed with 99% certainty that you were a Boltzman Brain, what should you do? The smart thing would probably be: Whatever you would do if you weren't a Boltzman Brain. You can dismiss the hypotheses where you have no control over the future; because it's only the ones where you have control that matter on a decision-theoretic basis.

Calculations of future utility have a discounting factor naturally built into them -- which is the uncertainty of being able to calculate and control such a future properly. So in a very natural (no need to program it in) manner an AI would prefer the same utility for 5 seconds in the future rather than 5 minutes in the future, and for 5 minutes in the future rather than 5 years in the future.

This looks at first glance as a time-discount, but in actuality it's an uncertainty-discount. So an AI that had a very good predictive capacity would be able to discount future utility less; because the uncertainty would be less. But the uncertainty would never be quite zero.

One thing that may be missed in the above is that there exists not only an uncertainty in reaching a certain future state of the world, but there's an uncertainty in how it would affect all consequent future states as well. And the greater the impact the greater such uncertainty for the future must be.

So even as the thought of 3^^^3 lives outweighs the tiny probability; couldn't it be that a similar factor punishes it to an opposite direction, especially when dealing with hypotheses in which the AI will be able to have no further control? I don't know. Bring in the mathematicians.

5. The Law of Visible Impact (a.k.a. The Generalized Hanson)

Robin Hanson's suggested solution to Pascal's Mugging has been the penalization of "the prior probability of hypotheses which argue that we are in a surprisingly unique position to affect large numbers of other people who cannot symmetrically affect us."

I have to say that I find this argument unappealing and unconvincing. One problem I have with it is that it seems to treat the concept of "person" as ontologically fundamental -- it's an objection I kinda have also against the Simulation argument and the Doomsday Argument

Moreover wouldn't this argument cease to apply if I was merely witnessing the Pascal's mugging taking place, and that therefore if I was merely witnessing I should be hoping for the mugged entity to submit? This sounds nonsensical. 

But I think Hanson's argument can be modified so here I'd like to offer what I'll call the Generalized Hanson: Penalize the prior probability of hypotheses which argue for the existence of high impact events whose consequences nonetheless remain unobserved.

If life's creation is easy, why aren't we seeing alien civilizations consuming the stars? Therefore most likely life's creation isn't easy at all.

If the universe allowed easy time-travel, where are all the time-travellers? Hence the world most likely doesn't allow easy time-travel. 

If Matrix Lords exist that are apt to create 3^^^3 people and torture them for their amusement, why aren't we being tortured (or witnessing such torture) right now? Therefore most likely such Matrix Lords are rare enough to nullify their impact.

In short the higher the impact of a hypothetical event, the more evidence we should be expecting to see for it in the surrounding universe -- the non-visibility of such provides therefore evidence against the hypothesis analogous to the extent of such hypothetical impact.

I'm probably expressing the above intuition quite badly, but again: I hope someone with actual mathematical skills can take the above and make it into something useful; or tell me that it's not useful as an anti-Pascal Mugging tactic at all.

New Comment
54 comments, sorted by Click to highlight new comments since: Today at 5:21 PM

I thought the standard (human) solution is to evaluate the probability first and reject anything at the noise level, before even considering utilities. Put the scope insensitivity to good use. Unfortunately, as an amateur, I don't understand Eliezer's (counter?) point, maybe someone can give a popular explanation. Or is this what Kindly did?

I don't think that's it. If there's an asteroid on an orbit that passes near Earth, and the measurement errors are such that there's 10^-6 chance of it hitting the Earth, we absolutely should spend money on improving the measurements. Likewise for money spent on reliability of the software controlling nuclear missiles.

On the other hand, prior probability assignment is imperfect, and there may be hypotheses (expressed as strings in English) which have unduly high priors, and a sufficiently clever selfish agent can find and use one of such hypotheses to earn a living, potentially diverting funds from legitimate issues.

That's a good point, actually. (Your comment probably got downvoted owing to your notoriety, not its content.) But somehow the asteroid example does not feel like Pascal's mugging. Maybe because we are more sure in the accuracy of the probability estimate, whereas for Pascal's mugging all we get is an upper boundary?

But somehow the asteroid example does not feel like Pascal's mugging.

If we're to talk directly about what Dmytry is talking about without beating around the bush, an asteroid killed the dinosaurs and AI did not, therefore discussion of asteroids is not Pascal's mugging and discussion of AI risks is; the former makes you merely scientifically aware, and the latter makes you a fraudster, a crackpot or a sucker. If AI risks were real, then surely the dinosaurs would have been killed by an AI instead.

So, let's all give money to prevent an asteroid extinction event that we know only happens once in a few hundred million years, and let's not pay any attention to AI risks, because AI risks afterall never happened to the dinosaurs, and must therefore be impossible, much like self-driving cars, heavier-than-air flight, or nuclear bombs.

Downvoted for ranting.

[-][anonymous]11y-40

I thought it was satire?

It is. It's also a rant.

It makes a good point, mind, and I upvoted it, but it's still needlessly ranty.

Then maybe my sarcasm/irony/satire detector is broken.

Well, I seriously doubt Aris actually thinks AI risks are Pascal's Mugging by definition. That doesn't prevent this from being a rant, it's just a sarcastic rant.

Your comment probably got downvoted owing to your notoriety, not its content.

The smirkiness of the last paragraph, which is yet another one of Dmytry's not-so-veiled accusations that EY and all of MIRI are a bunch of fraudsters. I hate this asshole's tactics. Whenever he doesn't directly lie and slander, he prances about smirking and insinuating.

Your comment probably got downvoted owing to your notoriety, not its content.

By total of 3 people, 1 of them definitely ArisKatsaris, and 2 others uncomfortable about the "sufficiently clever selfish agent".

But somehow the asteroid example does not feel like Pascal's mugging. Maybe because we are more sure in the accuracy of the probability estimate, whereas for Pascal's mugging all we get is an upper boundary?

I am thinking speculation is the key, especially guided speculation. Speculation comes together with misevaluation of probability, misevaluation of consequences, misevaluation of utility of alternatives, and basically just about every issue with how humans process hypotheses can get utilized as the mugger is also human and would first try to self convince. In real world examples you can see tricks such as asking "why are you so certain in [negation of the proposition]", only re-evaluating consequences of giving the money without re-evaluating consequences of keeping the money, trying to teach victims to evaluate claims in a way that is more susceptible (update one side of the utility comparison then act out of superfluous difference), and so on.

Low probabilities really are a red herring. Even if you were to do something formal, like Solomonoff induction (physics described with a Turing machine tape), you could do some ridiculous modifications to laws of physics which add invisible consequences to something, in as little as 10..20 extra bits. All actions would be dominated by the simplest modification to laws of physics which leaves most bits to making up huge consequences.

If you believed with 99% certainty that you were a Boltzman Brain, what should you do? The smart thing would probably be: Whatever you would do if you weren't a Boltzman Brain. You can dismiss the hypotheses where you have no control over the future; because it's only the ones where you have control that matter on a decision-theoretic basis.

The Boltzman brain has a fraction of a second to a few seconds to exist. if there's a sufficiently high chance you are a Boltzman brain it may make sense to maximize your very short term utility. One obvious way to do so is to spend that time fantasizing about whatever gender or genders you prefer.

So even as the thought of 3^^^3 lives outweighs the tiny probability; couldn't it be that a similar factor punishes it to an opposite direction, especially when dealing with hypotheses in which the AI will be able to have no further control? I don't know. Bring in the mathematicians.

In general, yes, there is such a similar factor. There's another one pushing it back, etc. The real problem isn't that when you take tiny probabilities of huge outcomes into account it doesn't give you what you want. It's that it diverges and doesn't give you anything at all. Pascal's mugging at least has the valid, if unpopular, solution of just paying the mugger. There is no analogous solution to a divergent expected utility.

  1. The Law of Visible Impact (a.k.a. The Generalized Hanson)

This one only works if they are making 3^^^3 people. If instead they are just making one person suffer with an intensity of 3^^^3, it doesn't apply.

The real problem isn't that when you take tiny probabilities of huge outcomes into account it doesn't give you what you want. It's that it diverges and doesn't give you anything at all.

Exactly.

In essence, money change hands only if expected utilities converge. Transaction happens because the agent computes an invalid approximation to the (convergent or divergent) expected utility, by summing expected utilities over available scenarios; a sum which is greatly affected by a scenario that has been made available via it's suggestion by the mugger, creating a situation where your usual selfish agent does best by producing a string that makes you give it money.

In essence, money change hands only if expected utilities converge.

Great when someone tries to Pascal mug you. Not so great when you're just trying to buy groceries, and you can't prove that they're not a Pascal mugger. The expected utilities don't just diverge when some tries to take advantage of you. They always diverge.

Also, not making a decision is itself a decision, so there's no more reason for money to not change hands than there is for it to change hands, but since you're not actually acting against your decision theory, that problem isn't that bad.

Well, de facto they always converge, mugging or not, and I'm not going to take as normative a formalism where they do diverge. edit: e.g. instead I can adopt speed prior, it's far less insane than incompetent people make it out to be - code size penalty for optimizing out the unseen is very significant. Or if I don't like speed prior (and other such "solutions"), I can simply be sane and conclude that we don't have a working formalism. Prescriptivism is silly when it is unclear how to decide efficiently under bounded computing power.

I can simply be sane and conclude that we don't have a working formalism.

That's generally what you do when you find a paradox that you can't solve. I'm not suggesting that you actually conclude that you can't make a decision.

Of course. And on the practical level, if I want other agents to provide me with more accurate information (something that has high utility scaled by all potential unlikely scenarios), I must try to make production of falsehoods non-profitable.

This one only works if they are making 3^^^3 people. If instead they are just making one person suffer with an intensity of 3^^^3, it doesn't apply.

I think it still works, at least for the particular example, because the probability P(Bob=Matrix Lord who is willing to torture 3^^^3 people) and the probability P(Bob=Matrix Lord who is willing to inflict suffering of 3^^^3 intensity on one person) are closely correlated via the probabilities P(Matrix), P(Bob=Matrix Lord), P(Bob=Matrix Lord who is a sadistic son of a bitch), etc...

So if we have strong evidence against the former, we also have strong evidence against the latter (not quite as strong, but still very strong). The silence of the gods provides evidence against loud gods, but it also provides evidence against any gods at all...

The argument is that, since you're 3^^^3 times more likely to be one of the other people if there are indeed 3^^^3 other people, that's powerful evidence that what he says is false. If he's only hurting one person a whole lot, then there's only a 50% prior probability of being that person, so it's only one bit of evidence.

The prior probabilities are similar, but we only have evidence against one.

The argument is that, since you're 3^^^3 times more likely to be one of the other people if there are indeed 3^^^3 other people, that's powerful evidence that what he says is false.

Enh, that's Hanson's original argument, but what I attempted to do is generalize it so that we don't actually need to be relying on the concept of "person", nor be counting points of view. I would want the generalized argument to hopefully work even for a Clippy who would be threatened with the bending of 3^^^3 paperclips, even though paperclips don't have a point of view. Because any impact, even impact on non-people, ought have a prior for visibility analogous to its magnitude.

That's not a generalization. That's an entirely different argument. The original was about anthropic evidence. Yours is about prior probability. You can accept or reject them independently of each other. If you accept both, they stack.

Because any impact, even impact on non-people, ought have a prior for visibility analogous to its magnitude.

I don't think that works. Consider a modification of laws of physics so that alternate universes exist, incompatible with advanced AI, with people and paperclips, each paired to a positron in our world. Or what ever would be the simplest modification which ties them to something that clippy can affect. It is conceivable that some such modification can be doable in 1 in a million.

There's sane situations with low probability, by the way, for example if NASA calculates that an asteroid, based on measurement uncertainties, has 1 in a million chance of hitting the earth, we'd be willing to spend quite a bit of money on "refine measurements, if its still a threat, launch rockets" strategy. But we don't want to start spending money any time someone who can't get a normal job gets clever about crying 3^^^3 wolves, and even less so for speculative, untestable laws of physics under description length based prior.

If Matrix Lords exist that are apt to create 3^^^3 people and torture them for their amusement, why aren't we being tortured (or witnessing such torture) right now?

Maybe we are being tortured, as viewed by people in the other worlds, who view us as we view North Korea.

If Matrix Lords exist that are apt to create 3^^^3 people and torture them for their amusement, why aren't we being tortured (or witnessing such torture) right now?

Questions of this type make no sense if you unpack "we": "If Matrix Lords exist that are apt to create 3^^^3 people and torture them for their amusement, why aren't [people that live in a normal universe and are not being tortured] being tortured (or witnessing such torture) right now?"

...I don't think that's how unpacking works.

Let's consider the statement "99% of dogs are green" -- isn't it reasonable to ask "If so many dogs are green, why haven't I ever seen a green dog?"

Surely I shouldn't be unpacking this to "If so many dogs are green why haven't [people who haven't seen a green dog] seen a green dog." My own experience of the universe does provide evidence about the universe.

Good point. I notice I'm confused, but I can't figure out why. I'll keep thinking, thanks for the example.

It occurs to me that the mugger is trying to get us to use a penalty from beyond our hypothesis space, to make decisions within the hypothesis space.

"I will torture 3^^^3 humans forever", necessarily comes from a hypothesis space of at least size 3^^^3. However, our universe, at least as far as we know, has nowhere near 3^^^3 possible states available. A single one of the total number of states is the finest granularity available to use.

Consider the mugging using numbers from within our universe. We might get for example "I will torture 1e50 humans until the end of the universe". This is something we can work with - getting a probability of order 1e-50 shouldn't be terribly difficult.

In order to counter a 3^^^3 penalty, we need to look at the larger hypothesis space, not just what's available in our current understanding of the universe.

your image of Slender Man!Hanson

why did you do this I will never sleep again

[-][anonymous]11y10

I have a point I have thought of in regards to Pascal's Mugging where I notice I am confused: What if an entire localized area is being mugged?

Example: Assume you give into standard muggings. Someone drops off a collection box right near your house. The collection box says "For each person that steps near this box on a day, there must be a deposit of 5 dollars before one day passes or I will simulate and kill 3^^^3 people as a Matrix Lord. Any interference or questioning of the box will make me simulate and kill 3^^^^3 people as a Matrix Lord."

Since we are assuming you are the kind of person who gives into standard muggings, when you step to within 1 foot near the box out of curiosity to read the text, you think "Yikes! That's a standard Mugging!" and deposit 5 dollars, determined to not interfere with or question the box.

The next day, you get within 10 feet of the box when walking home. Do you give money to the box?

Do you give money if it's a person repeating the same message over and over again?

What if you only get within 100 feet, or you can barely hear the message from your distance?

So my confusion is that making the assumption "I give into muggings." Works fairly clearly for single target muggings, but it doesn't resolve much at all in the case of a localized area of effect mugging. And there doesn't seem to be reason to assume a mugging would not affect an area. Fake Muggers would certainly want more victims. A Real Matrix Lord might want to make demands from multiple people for their inscrutable purposes. And in my example the penalty for questioning is so bad it seems unlikely you could justify gathering information about what "near" meant to try resolving the problem.

This is clearly not a problem if you don't give into muggings at all, but I realized it means that only giving into muggings alone doesn't actually clearly define your behavior in a case like this. There's the kind of a person who only gives when they are a foot away, and the kind of person who hears about the box from a mile away and still comes by and gives every day and the kind of person who shoots anyone who steps near the box because they might interfere, and the kind of person who shoots HIM because HE is interfering... I can see potential arguments any of these people could make for how they were trying to follow the will of the mugger, but none of them agree with each other at all.

Now that I've laid that out though, I'm not sure if that can be turned into a clear argument against muggings.

A closer rationalist analog of your #3 would be Newcomb's problem, in which you can't tell "from the inside" whether you're really in a Newcomblike situation, or Omega's simulation of you being in that situation, in order for Omega to decide what to put in the sealed box.

I understand the first and the last solutions the best....

My estimate of probability for the statement "the mugger will kill X people" is not independent of X. The larger the number of people the mugger claims to be able to kill, the smaller the probability I assign to him following through with it. With an appropriate algorithm for deriving the probability estimate from X, the expected gain from letting myself be mugged is finite even though X can get arbitrarily large.

This is a very post-hoc justification. It isn't at all clear that the most accurate set of priors will actually do that. Note that for example, from a Kolmogorov standpoint, 3^^^^3 is pretty simple.

It's not post-hoc at all. The more the number of people the mugger gives for X, the less significant $5 would be to him. This decreases the likelihood of the scenario that such a mugger would actually want to mug someone for a value that is so meaningless to him. This in turn means that given such a mugging, it is more likely that the mugger is a fake the larger X is.

Of course, you could argue that the mugger has some motive other than direct financial gain from the money (if he just wants to watch humans squirm, the insignificance of $5 doesn't matter), but the same applies to all motives: watching a human squirm is less significant to a very powerful mugger than to a less powerful mugger, just like $5 is less significant to a very powerful mugger than to a less powerful mugger.

Furthermore, larger numbers for X are more beneficial for fake muggers than smaller numbers, when mugging utility maximizers. I should therefore expect the distribution of fake muggers to be more weighted towards high values of X than the distribution of real muggers (it is beneficial to some real muggers to increase X, but not to all of them). Again, the larger X is, the greater the chance the mugger is fake.

Which mission? The FAI mission? The GiveWell mission? I am confused :(

I don't suppose he said this somewhere linkable?

Here he claims that the default outcome of AI is very likely safe, but attempts at Friendly AI are very likely deadly if they do anything (although I would argue this neglects the correlation between what AI approaches are workable in general and for would-be FAI efforts, and what is dangerous for both types, as well as assuming some silly behaviors and that competitive pressures aren't severe):

I believe that unleashing an all-powerful "agent AGI" (without the benefit of experimentation) would very likely result in a UFAI-like outcome, no matter how carefully the "agent AGI" was designed to be "Friendly." I see SI as encouraging (and aiming to take) this approach. I believe that the standard approach to developing software results in "tools," not "agents," and that tools (while dangerous) are much safer than agents. A "tool mode" could facilitate experiment-informed progress toward a safe "agent," rather than needing to get "Friendliness" theory right without any experimentation. Therefore, I believe that the approach SI advocates and aims to prepare for is far more dangerous than the standard approach, so if SI's work on Friendliness theory affects the risk of human extinction one way or the other, it will increase the risk of human extinction. Fortunately I believe SI's work is far more likely to have no effect one way or the other

... doesn't GiveWell recommend that for pretty much every charity, because you should be giving it to the Top Three Effective Charities?

"Right, this is the real core of Pascal's Mugging [...]. For aggregative utility functions over a model of the environment which e.g. treat all sentient beings (or all paperclips) as having equal value without diminishing marginal returns, and all epistemic models which induce simplicity-weighted explanations of sensory experience, all decisions will be dominated by tiny variances in the probability of extremely unlikely hypotheses because the "model size" of a hypothesis can grow Busy-Beaver faster than its Kolmogorov complexity."

I think others have expressed doubts that the promised utility of hypotheses should be able to grow in Busy-Beaver fashion faster than a properly updated probability, but I don't see a similar argument on your list. It seems to have been taken for granted because a simple hypothesis clearly can describe a very large utility.

But what is the probability that any process, including a Matrix Lord, can successfully perform O(3^^^3) computations/operations/actions? Even a Matrix Lord should have a mean time to failure (MTTF) and the estimate of MTTF should be directly derivable from the complexity of the Matrix Lord. A Matrix Lord of complexity O(2^256) should have a MTTF like O(2^256) operations, not O(3^^^3) operations. Even if each operation produces 1 utilon without regard to whether the process completes or not that limits the expected number of utilons is 1MTTF or O(2^256), making the expected value O(2^256) P(Matrix Lord will is telling the truth | Matrix Lord of complexity 2^256 exists).

This doesn't work if the MTTF increases over time due perhaps to a self-improving and self-sustaining process, but that actually makes sense. I'm far more likely to believe a mugger that says "Give me $5 or I'm going to release a self-replicating nanobot that will turn the universe into grey goo" or "Give me $5 or I turn on this UFAI" than one that says "Give me $5 or I'll switch off the Matrix that's running your universe." "Give me $5 or I'll start an unreasonably long process that eventually will produce so much negative utility that it overrides all your probability estimation ability" just makes me think that whatever process the mugger may have will fail long before the expected value of accepting would become positive.

The limit as predicted MTTF goes to infinity is probably just Pascal's Wager. If you believe that an entity can keep a Matrix running infinitely long, that entity can affect your utility infinitely. That probably requires an (eventually) infinitely complex entity so that it can continue even if any finite number of its components fail. That leaves the door open to being mugged by a self-improving process in an infinite universe, but I think that also makes sense. Bound your utility if you're worried about infinite utility.

You're thinking about this too hard.

There are, in fact, three solutions, and two of them are fairly obvious ones.

1) We have observed 0 such things in existence. Ergo, when someone comes up to me and says that they are someone who will torture people I have no way of ever knowing existing unless I give them $5, I can simply assign them the probability of 0 that they are telling the truth. Seeing as the vast, vast majority of things I have observed 0 of do not exist, and we can construct an infinite number of things, assigning a probability of 0 to any particular thing I have never observed and have no evidence of is the only rational thing to do.

2) Even assuming they do have the power to do so, there is no guarantee that the person is being rational or telling the truth. They may torture those people regardless. They might torture them BECAUSE I gave them $5. They might do so at random. They might go up to the next person and say the next thing. It doesn't matter. As such, their demand does not change the probability that those people will be tortured at all, because I have no reason to trust them, and their words have not changed the probabilities one way or the other. Ergo, again, you don't give them money.

3) Given that I have no way of knowing whether those people exist, it just doesn't matter. Anything which is unobservable does not matter at all, because, by its very nature, if it cannot be observed, then it cannot be changing the world around me. Because that is ultimately what matters, it doesn't matter if they have the power or not, because i have no way of knowing and no way of determining the truth of the statement. Similar to the IPU, the fact that I cannot disprove it is not a rational reason to believe in it, and indeed the fact that it is non-falsifiable indicates that it doesn't matter if it exists at all or not - the universe is identical either way.

It is inherently irrational to believe in things which are inherently non-falsifiable, because they have no means of influencing anything. In fact, that's pretty core to what rationality is about.

The problem is with formalizing solutions, and making them consistent with other aspects that one would want an AI system to have (e.g. ability to update on the evidence). Your suggested three solutions don't work in this respect because:

1) If we e.g. make an AI literally assign a probability 0 on scenarios that are too unlikely, then it wouldn't be able to update on additional evidence based on the simple Bayesian formula. So an actual Matrix Lord wouldn't be able to convince the AI he/she was a Matrix Lord even if he/she reversed gravity, or made it snow indoors, etc.

2) The assumption that a person's words provides literally zero evidence one way or another seems again something you axiomatically assume rather than something that arises naturally. Is it really zero? Not just effectively zero where human discernment is concerned, but literally zero? Not even 0.000000000000000000000001% evidence towards either direction? That would seem highly coincidental. How do you ensure an AI would treat such words as zero evidence?

3) We would hopefully want the AI to care about things it can't currently directly observe, or it wouldn't care at all about the future (which it likewise can't currently directly observe).

The issue isn't helping human beings not fall prey to Pascal's Mugging -- they usually don't. The issue is to figure out a way to program a solution, or (even better) to see that a solution arises naturally from other aspects of our system.

[quote]1) If we e.g. make an AI literally assign a probability 0 on scenarios that are too unlikely, then it wouldn't be able to update on additional evidence based on the simple Bayesian formula. So an actual Matrix Lord wouldn't be able to convince the AI he/she was a Matrix Lord even if he/she reversed gravity, or made it snow indoors, etc.[/quote]

Neither of those feats are even particularly impressive, though. Humans can make it snow indoors, and likewise an apparent reversal in gravity can be achieved via numerous routes, ranging from inverting the room to affecting one's sense of balance to magnets.

Moreover, there are numerous more likely explanations for such feats. An AI, for instance, would have to worry about someone "hacking its eyes", which would be a far simpler means of accomplishing that feat. Indeed, without other personnel around to give independent confirmation and careful testing, one should always assume that you are hallucinating, or that it is trickery. It is the rational thing to do.

You're dealing with issues of false precision here. If something is so very unlikely, then it shouldn't be counted in your calculations at all, because the likelihood is so low that it is negligible and most likely any "likelihood" you have guessed for it is exactly that - a guess. Unless you have strong empirical evidence, treading its probability as 0 is correct.

[quote]2) The assumption that a person's words provides literally zero evidence one way or another seems again something you axiomatically assume rather than something that arises naturally. Is it really zero? Not just effectively zero where human discernment is concerned, but literally zero? Not even 0.000000000000000000000001% evidence towards either direction? That would seem highly coincidental. How do you ensure an AI would treat such words as zero evidence?[/quote]

Same way it thinks about everything else. If someone walks up to you on the street and claims souls exist, does that change the probability that souls exist? No, it doesn't. If your AI can deal with that, then they can deal with this situation. If your AI can't deal with someone saying that the Bible is true, then it has larger problems than pascal's mugging.

[quote]3) We would hopefully want the AI to care about things it can't currently directly observe, or it wouldn't care at all about the future (which it likewise can't currently directly observe).[/quote]

You seem to be confused here. What I am speaking of here is the greater sense of observability, what someone might call the Hubble Bubble. In other words, causality. Them torturing things that have no casual relationship with me - things outside of the realm that I can possibly ever effect, as well as outside the realm that can possibly ever effect me - it is irrelevant, and it may as well not happen because there is not only no way of knowing if it is happening, there is no possible way that it can matter to me. It cannot affect me, I cannot affect them. Its just the way things work. Its physics here.

Them threatening things outside the bounds of what can affect me doesn't matter at all - I have no way of determining their truthfulness one way or the other, nor has it any way to impact me, so it doesn't matter if they're telling the truth or not.

[quote]The issue isn't helping human beings not fall prey to Pascal's Mugging -- they usually don't. The issue is to figure out a way to program a solution, or (even better) to see that a solution arises naturally from other aspects of our system.[/quote]

The above three things are all reasonable ways of dealing with the problem. Assigning it a probability of 0 is what humans do, after all, when it call comes down to it, and if you spend time thinking about it, 2 is obviously something you have to build into the system anyway - someone walking up to you and saying something doesn't really change the likelihood of very unlikely things happening. And having it just not care about things outside of what is causally linked to it, ever, is another reasonable approach, though it still would leave it vulnerable to other things if it was very dumb. But I think any system which is reasonably intelligent would deal with it as some combination of 1 and 2 - not believing them, and not trusting them, which are really quite similar and related.

You're being too verbose, which makes me personally find discussion with you rather tiring, and you're not addressing the actual points I'm making. Let me try to ask some more specific questions

1) Below which point do you want us treating a prior probability as effectively 0, and should never be updated upwards no matter what evidence? E.g. one in a billion? One in a trillion? What's the exact point, and can you justify it to me?

2) Why do you keep talking about things not being "causally linked", since all of the examples of Pascal's mugging given above do describe causal links? It's not as if I said anything weird about acausal trade or some such, every example I gave describes normal causal links.

Assigning it a probability of 0 is what humans do, after all, when it call comes down to it,

Humans don't tend to explicitly assign probabilities at all.

If someone walks up to you on the street and claims souls exist, does that change the probability that souls exist? No, it doesn't.

Actually since people rarely bother to claim that things exist when they actually do (e.g. nobody is going around claiming "tables exist", "the sun exists") such people claiming that souls exist are probably minor evidence against their existence.

I wouldn't debate with someone who assigns a "probability of 0" to anything (especially as in "actual 0"), other than to link to any introduction to Bayesian probability. But your time is of course your own :-), and I'm biting the lure too often, myself.

people claiming that souls exist are probably minor evidence against their existence

Well, it points to the belief as being one which constantly needs to be reaffirmed, so it at least hints at some controversy regarding the belief (alternative: It could be in-group identity affirming). Whether you regard that as evidence in favor (evolution) or against (resurrection) the belief depends on how cynical you are about human group beliefs in general.

I've always understood the Kirk solution as the correct answer to the actual Pascal's Wager:

God doesn't value self-modification. God values faith. One of the properties of faith is that self-modification cannot create faith that did not previously exist.

In short, faith is not a kind of belief.

Not that this argument addresses the problems discussed in the post.

God doesn't value self-modification. God values faith. One of the properties of faith is that self-modification cannot create faith that did not previously exist.

You seem to be privileging the Abrahimic hypothesis. Of the vast space of possible gods, why would you expect that variety to be especially likely?

Hell is an abrahamic (Islamic/christian only I think) thing. To the extent that we should automatically discount inferences about a God's personality based on christianity/Islam we should also discount the possibility of hell.

One of the properties of faith is that self-modification cannot create faith that did not previously exist.

Yes it can.

That's just evidence against it. Is it really strong enough?

Suppose someone says to you:

My only terminal value is for people to acknowledge their insect overlords. Please clap thirty times or I will inflict 3^^^3 suffering.

Assuming that there is no way to convert from clapping to acknowledging-insect-overlords, your optimal response is essentially, "You are very confused. No claps for you."

I suppose I'm assuming that Matrix lords are not confused to the point of incoherence. Matrix Lords who are that confused fall under the ignore-because-cannot-affect-the-future rationale.

Assuming that there is no way to convert from clapping to acknowledging-insect-overlords

You also have to consider the possibility that this assumption is false. Perhaps you're the one who is confused.

Matrix Lords who are that confused fall under the ignore-because-cannot-affect-the-future rationale.

Confused people can affect the future. You can't be as certain about what they'll do, but how certain do you need to be with a payoff that big?

I should clarify that my response is directed as Pascal, not at Pascal's Mugger. Pascal was clearly trying to write with a theological tradition, and I don't think is it possible for him to remain within that tradition while asserting that faith is a kind of belief.

Perhaps you're the one who is confused.

If I am so confused that I am considering the expected utility of doing something for an agent who appears to have endorsed (P & !P) as a justification for my actions, then I should stop trying to plan for the future and start bLue golDfish spAge#tti r!m@

I'm not suggesting that your error was deriving a contradiction from P & !P. I'm suggesting that your error was when you concluded !P in the first place. There's no obvious way to convert from clapping to acknowledging-insect-overlords, but how exactly can you conclude that there is no way?

If that were the case, why would the church care about gaining converts? This seems like it would demand a particularly idiosyncratic definition of "self modification."