# Notes on logical priors from the MIRI workshop

(This post contains a lot of math, and assumes some familiarity with the earlier LW posts on decision theory. Most of the ideas are by me and Paul Christiano, building on earlier ideas of Wei Dai.)

The September MIRI workshop has just finished. There were discussions of many topics, which will probably be written up by different people. In this post I'd like to describe a certain problem I brought up there and a sequence of ideas that followed. Eventually we found an idea that seems to work, and also it's interesting to tell the story of how each attempted idea led to the next one.

The problem is a variant of Wei Dai's A Problem About Bargaining and Logical Uncertainty. My particular variant is described in this comment, and was also discussed in this post. Here's a short summary:

In Counterfactual Mugging with a logical coin, a "stupid" agent that can't compute the outcome of the coinflip should agree to pay, and a "smart" agent that considers the coinflip as obvious as 1=1 should refuse to pay. But if a stupid agent is asked to write a smart agent, it will want to write an agent that will agree to pay. Therefore the smart agent who refuses to pay is reflectively inconsistent in some sense. What's the right thing to do in this case?

Thinking about that problem, it seems like the right decision theory should refuse to calculate certain things, and instead behave updatelessly with regard to some sort of "logical prior" inherited from its creators, who didn't have enough power to calculate these things. In particular, it should agree to pay in a Counterfactual Mugging with a digit of pi, but still go ahead and calculate that digit of pi if offered a straight-up bet instead of a counterfactual one.

What could such a decision theory look like? What kind of mathematical object is a "logical prior"? Perhaps it's a probability distribution over inconsistent theories that the creators didn't yet know to be inconsistent. How do we build such an object, what constraints should it satisfy, and how do we use it in a decision algorithm?

#### Attempt 1

Sometime ago, Benja Fallenstein and Paul Christiano came up with a way to build a probability distribution over all consistent theories. We start with a single empty theory with weight 1, and then refine it by successively adding all possible axioms. Each new axiom A is either already settled by the current theory T (i.e. either A or ¬A is provable in T), in which case we do nothing; or A is independent of T, in which case we choose randomly between T+A and T+¬A. This process gives nonzero probabilities to all finitely axiomatized consistent theories, and leads to well-behaved conditional probabilities like P(PA+Con(PA)|PA). (I gloss over the distinction between finitely and recursively axiomatizable theories. Building a variant of this construction that works for recursively axiomatizable theories is left as an exercise to the reader.)

At the workshop we tried to use a similar construction, but with local consistency instead of global consistency. (Let's say that a "locally consistent theory" is a set of statements that has no short proof of inconsistency, for some reasonable meaning of "short".) We keep adding axioms as before, but only check that the theory so far is locally consistent. Hopefully this way we might end up with a nice distribution over all theories, including inconsistent ones.

The trouble with this approach is that for some axiom A and some locally consistent theory T, both T+A and T+¬A might be locally inconsistent, so we won't be able to continue. We can try to salvage the idea by somehow adjusting the probabilities retroactively, but then it seems hard to guarantee that the probability of each statement goes to a well-defined limit, it might end up oscillating. So we abandoned this idea.

#### Attempt 2

Instead of trying to build up a prior over possibly inconsistent theories, we can try assigning each theory a "quality" between 0 and 1. For example, we can take a probability distribution over all local consistency constraints that a theory must satisfy, and define the "quality" of a theory as the probability that it satisfies a randomly chosen constraint. For purposes of that definition, the theories have to be complete, but possibly inconsistent. Theories with quality 1 will be the consistent ones, because they satisfy all the constraints. Then we can define the quality of an individual statement as the maximum quality of any theory that includes that statement.

This all seems fine so far, but at the next step we run into trouble. In Counterfactual Mugging with a logical coin, Omega needs to evaluate a counterfactual statement like "if a digit of pi was different, the agent would do such-and-such thing". We can try to straightforwardly define "A counterfactually implies B" as true if the quality of "A and B" is equal to the quality of A. Intuitively that means that the most consistent theory that includes A also includes B.

Unfortunately, that definition does not have very good properties. Both "PA counterfactually implies Con(PA)" and "PA counterfactually implies ¬Con(PA)" will be defined as true, because the theories PA+Con(PA) and PA+¬Con(PA) are both consistent and have quality 1. That doesn't seem to agree with our intuitions about how logical counterfactuals should work.

#### Some ideas about logical counterfactuals

Our intuitions about decision theory seem to say that some logical counterfactuals are "natural" and others are "spurious". For example, if the agent does not in fact take a certain action, then there's a "natural" proof of how much utility it would have implied, and also many "spurious" proofs that prove all sorts of absurdities by first simulating the agent and then using the fact that the action was not taken. (See this post for more about spurious proofs.)

Natural counterfactuals look more correct to us than spurious ones because their proofs don't rely on fully simulating the agent. Maybe we could try weighting logical counterfactuals based on their proof length, and see where that idea takes us?

In simple decision theory problems, we are interested in counterfactuals like "A()=1 implies U()=1", "A()=1 implies U()=2", and so on, where A is the agent program that returns an action, and U is the universe program that contains the agent and returns a utility value. (See this post or this post for more about this setting.) Naively assigning some probability P(L) to each counterfactual statement with proof length L doesn't seem to work, because the problem might be so complex that all relevant statements have long proofs. But if we took some set of counterfactuals assumed to be mutually exclusive, with proof lengths L_{1}, L_{2}, ..., then it might be reasonable to give them probabilities like P(L_{1})/(P(L_{1})+P(L_{2})+...). That looks suspiciously similar to Bayes' theorem, so it seems natural to try and come up with some sort of probabilistic interpretation where this will be an actual application of Bayes' theorem. Thankfully, it's easy to find such an interpretation.

Let's take a theory like PA, generate random proofs in it with probability depending on length, and then throw out the invalid ones. If we view each statement of the form "X implies Y" as an "arrow" from statement X to statement Y, then the above process of picking random valid proofs induces a probability distribution on all true arrows, including those where statement X is false. Now we can define the conditional probability P(U()=u|A()=a) as the probability that a random arrow ends at "U()=u", conditional on the event that the arrow starts from "A()=a" and ends at some statement of the form "U()=...". This way, all possible statements "U()=u" for a given "A()=a" have probabilities that sum to 1, and the true statement "A()=a" leads to a single statement "U()=u" with probability 1. These seem like good properties to have.

More generally, when people are uncertain about some math statement, they often seem to reason like this: "I guess I'm likely to find a proof if I try this method. Hmm, I checked a couple obvious proofs and they didn't work. Guess I have to lower my probability of the statement." Such reasoning corresponds to Bayesian updating as you observe new proofs, which is similar to the above setting.

Now, can we extend this idea to obtain a "logical prior" over inconsistent theories as well?

### Attempt 3, successful

There's nothing stopping us from using the above construction with an inconsistent theory, because the properties that follow from having a coherent probability distribution over arrows should still hold. Let's assume that we care about theories T_{1},T_{2},... which are incomplete and possibly inconsistent. (An example of such a theory is PA+"the billionth digit of pi is even"). In each of these theories, we can construct the above distribution on arrows. Then we can add them up, using some weighting of the theories, and get an overall distribution on statements of the form "A()=a implies U()=u". It has the nice property that you only need to sample finitely many proofs to compute the conditional probabilities to any desired precision, so it's easy to write a decision algorithm that will do approximate utility maximization based on these probabilities. If we only care about a single theory, like PA, then the algorithm seems to give the same answers as our earlier algorithms for UDT.

There's still a remaining problem to fix. If we care equally about the theory where the billionth digit of pi is even and the one where it's odd, then our algorithm will accept a straight bet on that digit of pi, instead of calculating it and accepting conditionally. An approach to solving that problem can be found in Wei Dai's proposed "UDT2", which optimizes over the next program to run, instead of the value to return. We can use that idea here, ending up with an algorithm that I will call "UDT1.5", to avoid incrementing the major version number for something that might turn out to be wrong. To put it all together:

- We care about some program U() that returns a utility value. Also we care about some probability distribution over theories T
_{1},T_{2},... that are incomplete and possibly inconsistent. - Our decision algorithm A() attempts to find a program B such that running it will approximately maximize the expected value of U(), using the probabilities defined below. Then A runs B() and returns the resulting value.
- For each possible program B, the conditional probability P(U()=u|A runs B) is defined as the probability that a random valid proof sampled from a random T
_{i}proves that "if A runs B, then U()=u", conditional on it being a proof that "if A runs B, then U()=something".

(The algorithm can easily be extended to the case where A receives an argument representing observational data. Namely, it should just pass the argument to B, which corresponds to Wei Dai's UDT1.1 idea of optimizing over the agent's input-output map.)

Given a suitable prior, the above algorithm solves my original variant of Counterfactual Mugging with stupid and smart agents, but still correctly decides to calculate the digit of pi if offered a straight bet instead of a counterfactual one. And even apart from that, it feels right to make decisions using a weighted sum of theories T_{1},T_{2},... as well as a weighted sum of universes U_{1},U_{2},... composing U. I'm not sure if the right decision theory should eventually work like this, using "Tegmark level 4" over possible universe programs and "Tegmark level 5" over logically inconsistent theories, or perhaps we can define a single kind of uncertainty that generalizes both indexical and logical. In any case, that's as far as the workshop went, and it seems worthwhile to continue thinking about such questions.

Many thanks to all participants, to MIRI which invited me, to Kevin and Sean who hosted me, and anyone else I might have missed!

## Comments (45)

BestWait, what is Tegmark level 5?

*16 points [-]It's a name we made up for mathematically impossible universes that we still care about because we haven't yet proved them to be mathematically impossible. That becomes relevant in problems like Counterfactual Mugging with a logical coin.

What is a "locally consistent theory"?

Let's say "a set of statements that has no short proof of inconsistency", for some reasonable meaning of "short".

That's what I guessed, but you said "So we abandoned this idea," implying that the rest of the article was completely different, while the rest of the article was about still proof lengths, just smoothing thresholds into weights, so I became skeptical of the guess. I don't have any suggestions for how to talk about false starts and how they relate, but I think it might be useful for insight into my confusion. (Actually, in this particular case, I do have a suggestion, which is to use the term "proof length" much earlier.)

*0 points [-]Thanks! Made a small edit to the post.

You lost me at part

The problem is that, I see no reason why smart agent should refuse to pay. Both stupid and smart agent know it as logical certainty that they just lost. There's no meaningful difference between being smart and stupid in this case, that I can see. Both however like to be offered such bets, where logical coin is flipped, so they pay.

I mean, we all agree that a "smart" agent, that refused to pay here, would receive $0 if Omega flipped logical coin of asking if 1st digit of pi was an odd number, while "stupid" agent would get $1,000,000.

*1 point [-]Note that there's no prior over Omega saying that it's equally likely to designate 1=1 or 1≠1 as heads. There's only one Omega, and with that Omega you want to behave a certain way. And with the Omega that designates "the trillionth digit of pi is even" as heads, you want to behave differently.

Specifically, you want to bet on 'heads'. The trillionth digit of pi is a two.

I think we need to find a trickier logical uncertainty as a default example. There is a (mildly) interesting difference between logical uncertainties that we could easily look up or calculate like "Is 1,033 a prime?" or "is the trillionth digit of pi even?" and logical uncertainties that can not be plausibly looked up. Both types of uncertainty are sometimes relevant but often we want a 'logical coin' that isn't easily cheated.

After asking about this on #LW irc channel, I take back my initial objection, but I still find this entire concept of logical uncertainty kinda suspicious.

Basically, if I'm understanding this correctly, Omega is simulating an alternate reality which is exactly like ours, and where the only difference is that Omega

sayssomething like "I just checked if 0=0, and turns out it's not. If it was, I would've given you moneyzzz(iff you would give me moneyzzz in this kind of situation), but now that 0!=0, I must ask you for $100." Then the agent notices, in that hypothetical situation, that actually 0=0, so actually Omega is lying, so he is in hypothetical, and thus he can freely give moneyzzz away to help to real you. Then, because some agents can't tell for all possible logical coins if they are lied to or not, they might have to pay real moneyzzz, while sufficiently intelligent agents might be able to cheat the system if they are able to notice if they are lied to about the state of the logical coin.I still don't understand why a stupid agent would want to make a smart AI that did pay. Also, there are many complications that restrict decisions of both smart and stupid agents, given argument I've given here, stupid agents still might prefer not paying, and smart agents might prefer paying, if they gain some kind of insght to how Omega chose these logical coins. Also, this logical coin problemacy seems to me like a not-too-special special class of Omega problems where some group of agents is able to detect if they are in counterfactuals

Note that the agent is not necessarily able to detect that it's in a counterfactual, see Nesov's comment.

Yes, those agents you termed "stupid" in your post, right?

*1 point [-]The smart ones too, I think. If you have a powerful calculator and you're in a counterfactual, the calculator will give you the wrong answer.

Well, to be exact, your formulation of this problem has pretty much left this counterfactual entirely undefined. Naive approximation, that the world is just like ours, and Omega just lies in counterfactual, would not contain such weird calculators which give you wrong answers. If you want to complicate problem by saying that some specific class of agents have a special class of calculators that one would usually think to work in certain way, but actually they work in a different way, well, so be it. That's however just a free-floating parameter you have left unspecified and that, unless stated otherwise, should be assumed not to be the case.

*0 points [-]Hmm, no, I assumed that Omega would be using logical counterfactuals, which are pretty much the topic of the post. In logical counterfactuals, all calculators behave differently ;-) But judging from the number of people asking questions similar to yours, maybe it wasn't a very transparent assumption...

*0 points [-]I asked about these differences in my second post in this post tree, where I explained how I understood these counterfactuals to work. I explained as clearly as I could that, for example, calculators should work as they do in real world. I did this explaining in hopes of someone voicing disagreement if I had misunderstood how these logical counterfactuals work.

However, modifying any calculator would mean that there can not be, in principle, any "smart" enough ai or agent that could detect it was in counterfactual. Our mental hardware that checks if logical coin should've been heads or tails is a calculator the same as any computer, and again, there does not seem to be any reason to assume Omega leaves some calculators unchanged while changes results of others.

Unless, this thing is just assumed to happen, with some silently assumed cutaway point where calculators become so internal they are left unmodified.

*1 point [-]Calculators are not modified, they are just interpreted differently, so that when trying to answer the question of what happens in a certain situation (containing certain calculators etc.) we get different answers depending on what the assumptions are. The situation is the same, but the (simplifying) assumptions about it are different, and so simplified inferences about it are different as well. In some cases simplification is unavoidable, so that dependence of conclusions on assumptions becomes an essential feature.

*1 point [-]My current understanding of logical counterfactuals is something like this: if the inconsistent formal theory PA+"the trillionth digit of pi is odd" has a short proof that the agent will take some action, which is much shorter than the proof in PA that the trillionth digit of pi is in fact even, then I say that the agent takes that action in that logical counterfactual.

Note that this definition leads to only one possible counterfactual action, because two different counterfactual actions with short proofs would lead to a short proof by contradiction that the digit of pi is odd, which by assumption doesn't exist. Also note that the logical counterfactual affects all calculator-like things automatically, whether they are inside or outside the agent.

That's an approximate definition that falls apart in edge cases, the post tries to make it slightly more exact.

*0 points [-]As a sort of metaphor/intuition pump/heuristic thingy to make sure I understood things right: Can you think of this as A living in platonic space/tegmark 4 multiverse "before" your universe and having unlimited power, you importing B from it and using that repeatedly as your limited AI, and the set of axiom sets A cares about as a "level 5 tegmark multiverse" with each of the logics being it's own tegmark 4 multiverse?

Perhaps not so useful in building such an AI, but I find these kind of more intuitive rough approximations useful when trying to reason about it using my own human brain.

*1 point [-]I'm not sure "existence" is the best intuition pump, maybe it's better to think in terms of "caring", like "I care about what these programs would return" and "I care about what would happen if these logical facts were true". There might well be only one existing program and only one set of true logical facts, but we care about many different ones, because we are uncertain.

I already have an intuition setup where "what I care about" and "what really exists" are equivalent. Since, you know, there's nothing else "exists" could mean I can think of and "what I care about" is what it seems to be used like?

May or may not need an additional clause about things that exist to things that exist also existing, recursively.

*3 points [-]Let's say you "care" about some hypothetical if you'd be willing to pay a penny today unconditionally in order to prevent your loved ones from dying in that hypothetical. If we take some faraway digit of pi, you'll find that you "care" about both the hypothetical where it's even and the hypothetical where it's odd, even though you know in advance that one of those provably does not "exist". And if you only had a limited time to run a decision theory, you wouldn't want to run any decision theory that threw away these facts about your "care". That's one of the reasons why it seems more natural to me to use "care" rather than "existence" as the input for a decision theory.

Sounds like both of us could use either interpretation without any difference in conclusion, and just find different abstractions useful for thinking about it due to small differences in what other kinds of intuitions we've previously trained. Just a minor semantic hiccup and nothing more.

How does a logical coin work exactly? To come up with such a thing, wouldn't Omega first need to pick a particular formula? If the statement is about the nth digit of pi, then he needs to pick n. Was this n picked at random? What about the sign of the test itself? If not, how can you be sure that the logical coin is fair?

*0 points [-]The approach outlined in the post assumes that "fairness" of the coin is determined by your initial state of logical uncertainty about which math statements are true, rather than indexical uncertainty about which particular Omega algorithm you're going to face. Though I agree that's a big assumption, because we still don't understand logical uncertainty very well.

A priori I wouldn't trust Omega to be fair. I only know that he doesn't lie. If Omega also said that he chose the logical statement in some fair way, then that would assure me the logical coin is identical to a normal coin. He can do this either using real uncertainty, like rolling a die to pick from a set of statements where half of them are true. Or he could use logical uncertainty himself, by not calculating the digit of pi before deciding to make the bet, and having a prior that assigns 50% probability to either outcome.

*1 point [-]For what it's worth, the post assumes that Omega decides to participate in the game unconditionally, its code doesn't have a branch saying it should play only if such-and-such conditions are met. I'm not sure if that answers your question.

*0 points [-]Wait, I'm afraid I'm already lost and this question seems so simple as to suggest I'm missing some important premise of the hypothetical scenario: Why would the stupid agent want this? Why wouldn't it want to write a smart agent that calculates the millionth digit and makes the winning choice?

Restatement of what I understand about the problem:

You offer me lots of money if the millionth digit of pi is even and a small loss of it is odd. I should take the bet since I can't calculate the answer and it might as well be random .

You offer me lots of money if the millionth digit of pi is even and a small loss of it is odd,

andthe chance to build a calculator to calculate the answer. I should still take the bet, even if my calculator tells me that it's odd.If I'm rephrasing it correctly it, then

why?! If you'regiventhe chance to make a calculator to solve the problem, why wouldn't youuseit?*2 points [-]What you're describing is not Counterfactual Mugging, it's just a bet, and the right decision is indeed to use the calculator. The interesting feature of Counterfactual Mugging is that Omega is using counterfactual reasoning to figure out what you would have done if the coin had come out differently. You get the money only if you would have paid up in the counterfactual branch. In that case the right decision is to not use the calculator, I think. Though other people might have different intuitions, I'm sort of an outlier in how much I'm willing to follow UDT-ish reasoning.

*2 points [-]The setup is such that muggings and rewards are grouped in pairs, for each coin there is a reward and a mugging, and the decision in the mugging only affects the reward of that same coin. So even if you don't know where the coin comes from, or whether there are other coins with the same setup, or other coins where you don't have a calculator, your decision on a mugging for a particular coin doesn't affect them. If you can manage it, you should pay up only in counterfactuals, situations where you hypothetically observe Omega asserting an incorrect statement.

Recognizing counterfactuals requires that the calculator can be trusted to be more accurate than Omega. If you trust the calculator, the algorithm is that if the calculator disagrees with Omega, you pay up, but if the calculator confirms Omega's correctness, you refuse to pay (so this confirmation of Omega's correctness translates into a different decision than just observing Omega's claim without checking it).

Perhaps in the counterfactual where the logical coin is the opposite of what's true, the calculator should be assumed to also report the incorrect answer, so that its result will still agree with Omega's. In this case, the calculator provides no further evidence, there is no point in using it, and you should unconditionally pay up.

*2 points [-]Yeah, that's pretty much the assumption made in the post, which goes on to conclude (after a bunch of math) that you should indeed pay up unconditionally. I can't tell if there's any disagreement between us...

*0 points [-]Hm, yeah, that sounds really odd.

I think the reason is sounds so odd is: how the hell is Omega calculating what your answer would have been if 1=0?

If what Omega is really calculating is what you would have done if you were merely

toldsomething equivalent to 1=0, then sure, paying up can make sense.*1 point [-]It seems to me that the relevant difference between "1=0" and "the billionth digit of pi is even" is that the latter statement has a really long disproof, but there might be a much shorter proof of what the agent would do if that statement were true. Or at least I imagine Omega to be doing the same sort of proof-theoretic counterfactual reasoning that's described in the post. Though maybe there's some better formalization of Counterfactual Mugging with a logical coin that we haven't found...

*0 points [-]Even if you're cutting off Omega's proofs at some length, there are plenty of math problems that people can't do that are shorter than high-probability predictions that people will or won't pay up. Certainly when I imagine the problem, I imagine it in the form of predicting someone who's been told that the trillionth digit of pi is even and then paying out to that person depending on their counterfactual actions.

Of course, that leads to odd situations when the agent being predicted can do the math problem, but Omega still says "no bro, trust me, the trillionth digit of pi really is even." But an agent who can do the math will still give Omega the money because decision theory, so does it really matter?

*0 points [-]If you're proposing to treat Omega's words as just observational evidence that isn't connected to math and could turn out one way or the other with probability 50%, I suppose the existing formalizations of UDT already cover such problems. But how does the agent assign probability 50% to a particular math statement made by Omega? If it's more complicated than "the trillionth digit of pi is even", then the agent needs some sort of logical prior over inconsistent theories to calculate the probabilities, and needs to be smart enough to treat these probabilities updatelessly, which brings us back to the questions asked at the beginning of my post... Or maybe I'm missing something, can you specify your proposal in more detail?

*0 points [-]Well, I was thinking more in terms of a logical prior over single statements, see my favorite here.

But yeah I guess I was missing the point of the problem.

Also: suppose Omega comes up to you and says "If 1=0 was true I would have given you billion dollars if and only if you would give me 100 dollars if 1=1 was true. 1=1 is true, so can you spare $100?" Does this sound trustworthy? Frankly not, it feels like there's a principle of explosion problem that insists that Omega would have given you all possible amounts of money at once if 1=0 was true.

A formulation that avoids the principle of explosion is "I used some process that I cannot prove the outcome of to pick a digit of pi. If that digit of pi was odd I would have given you a billion dollars iff [etc]."

*-1 points [-]Are you saying that Omega won't even

offeryou the deal unless it used counter-factual reasoning to figure out what you'll do once it offers?So if Omega has already offered you the deal and you know the coin came out against your favor, and you find you are physically capable of rejecting the deal, you

shouldreject the deal. You've already fooled Omega into thinking you'll take the deal.It's just that if you've successfully "pre-committed" to the extent that a 100% accurate Omega has predicted you will take the offer, you'll be physically incapable of not taking the offer. It's just like Newcombs problem.

And if

that'strue, it means that the problem we are facing is, how to make an algorithm that can't go back on its pre-commitments even after it gains the knowledge of how the bet came out.*-1 points [-]Retraction was unintentional - I thought this was a duplicate comment and "unretract" isn't a thing.

You can delete and then re-post a retracted comment if it has no replies yet.