# A model of UDT with a halting oracle

This post requires some knowledge of mathematical logic and computability theory. The basic idea is due to Vladimir Nesov and me.

Let the universe be a computer program U that can make calls to a halting oracle. Let the agent be a subprogram A within U that can also make calls to the oracle. The source code of both A and U is available to A.

def U():

# Fill boxes, according to predicted action.

box1 = 1000

box2 = 1000000 if (A() == 1) else 0

# Compute reward, based on actual action.

return box2 if (A() == 1) else (box1 + box2)

A complete definition of U should also include the definition of A, so let's define it. We will use the halting oracle only as a provability oracle for some formal system S, e.g. Peano arithmetic. Here's the algorithm of A:

- Play chicken with the universe: if S proves that A()≠a for some action a, then return a.
- For every possible action a, find some utility value u such that S proves that A()=a ⇒ U()=u. If such a proof cannot be found for some a, break down and cry because the universe is unfair.
- Return the action that corresponds to the highest utility found on step 2.

Now we want to prove that the agent one-boxes, i.e. A()=1 and U()=1000000. That will follow from two lemmas.

Lemma 1: S proves that A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000. Proof: you can derive that from just the source code of U, without looking at A at all.

Lemma 2: S doesn't prove any other utility values for A()=1 or A()=2. Proof: assume, for example, that S proves that A()=1 ⇒ U()=42. But S also proves that A()=1 ⇒ U()=1000000, therefore S proves that A()≠1. According to the first step of the algorithm, A will play chicken with the universe and return 1, making S inconsistent unsound (thx Misha). So if S is sound, that can't happen.

We see that the agent defined above will do the right thing in Newcomb's problem. And the proof transfers easily to many other toy problems, like the symmetric Prisoner's Dilemma.

### But why? What's the point of this result?

There's a big problem about formalizing UDT. If the agent chooses a certain action in a deterministic universe, then it's a true fact about the universe that choosing a different action would have caused Santa to appear. Moreover, if the universe is computable, then such silly logical counterfactuals are not just true but provable in any reasonable formal system. When we can't compare actual decisions with counterfactual ones, it's hard to define what it means for a decision to be "optimal".

For example, one previous formalization searched for formal proofs up to a specified length limit. Problem is, that limit is a magic constant in the code that can't be derived from the universe program alone. And if you try searching for proofs without a length limit, you might encounter a proof of a "silly" counterfactual which will make you stop early before finding the "serious" one. Then your decision based on that silly counterfactual can make it true by making its antecedent false... But the bigger problem is that we can't say exactly what makes a "silly" counterfactual different from a "serious" one.

In contrast, the new model with oracles has a nice notion of optimality, relative to the agent's formal system. The agent will always return whatever action is proved by the formal system to be optimal, if such an action exists. This notion of optimality matches our intuitions even though the universe is still perfectly deterministic and the agent is still embedded in it, because the oracle ensures that determinism is just out of the formal system's reach.

**P.S.** I became a SingInst research associate on Dec 1. They did not swear me to secrecy, and I hope this post shows that I'm still a fan of working in the open. I might just try to be a little more careful because I wouldn't want to discredit SingInst by making stupid math mistakes in public :-)

## Comments (100)

Best*8 points [-](In the setting where there were no oracles, this problem was generally discussed on the decision theory list, with various conditions used to prove when intended moral arguments [A=A1=>U=U1] will in fact be proven by the agent (are agent-provable, not just provable). The "chicken rule" condition used here I figured out in explicit form sometime in April ("To avoid confusion, immediately perform any action that implies absurdity"), though I'm not sure anyone published an actual proof using this condition. A few days ago, cousin_it posted a thought experiment involving a Turing oracle, and I pointed out that using the chicken rule in this setting elicits a much nicer result.)

*3 points [-]That's an accurate account, thanks for posting it here! I'm always worried about coming off as though I'm grabbing priority...

*5 points [-]Thanks for writing this up. I'm currently trying to repurpose my failed unpublished post draft from August (on acausal dependencies) to this result, to give a more systematic account of the problem it addresses, and describe the no-oracles case in more detail. Hope to get it done in about a week.

Edit: And posted, 2.5 months late.Congratulations! This is a key step forward. And also, congrats to the SIAI for getting both you and Vladimir Nesov as official researchers.

*4 points [-]Thanks!

In case anyone's interested: I'm not a paid researcher and don't want to become one, as long as I can support myself by programming.

Congratulations! Do you mind an off-topic question? I see you work at Google. What's Google's attitude toward such extracurricular activities? Obviously, I am not asking about specific negotiations, but about written and unwritten company rules. For example, can you dedicate Google company time to SingInst work? (I am thinking of the famous Google 20% rule here.)

The answer is refreshingly boring: I can do this stuff, but not on company time. 20% time is supposed to be used for Google projects.

*11 points [-]Sorry, I just had this image:

cousin_it: I want to use my 20% time to prevent the extermination of humanity.

Google overlord: Okay, and this would help Google ... how, exactly?

(I know, I know, oversimplification.)

Ads^^^3

Heh.

*5 points [-]Would it be naive to hope for a criterion that roughly says: "A conditional P ⇒ Q is silly iff the 'most economical' way of proving it is to deduce it from ¬P or else from Q." Something like: "there exists a proof of ¬P or of Q which is strictly shorter than the shortest proof of P ⇒ Q"?

A totally different approach starts with the fact that your 'lemma 1' could be proved without knowing anything about A. Perhaps this could be deemed a sufficient condition for a counterfactual to be serious. But I guess it's not a necessary condition?

*4 points [-]Both these approaches have been proposed on the workshop list. Good job figuring them out so quickly!

I can make such a criterion fall into nasty Loebian traps by maliciously tweaking the formal system to make some proofs longer than others. That means any proof of correct behavior (like one-boxing) must rely on the intimate details of the proof enumeration order, but we have no idea how to talk formally about such things.

A doesn't necessarily get the code of U neatly factored into A and everything else. The agent has to find copies of itself in the universe, it doesn't get told the positions of all copies explicitly. Note that if we replace the U in the post with some other U' that can be proven equivalent to U by S, then A can notice that equivalence, unscramble the code of U' into U, and win.

Something related to the 'most economical proof' thought is the following: suppose both P and Q depend on some variable x, and the conditional P(x) ⇒ Q(x) is true for all values of x. Then it is silly if either ¬P(x) holds for all x, or if Q(x) holds for all x.

The tricky thing would be to introduce x in a meaningful way. In the case where we want to prove conditionals of the form "agent does X ⇒ world does Y", we want to avoid ending up with a conditional that's true because we prove "agent doesn't do X" (I think we're actually alright with an unconditional proof of "world does Y"). So we want to make the agent's actions somehow depend on whatever x is. For instance, if we gave the agent a very very low probability of picking a random action (and let x be the source of randomness), would that do anything?

That solution has also been discussed on the list. Unfortunately, for it to work as intended, the different copies of the agent must use correlated random variables, which is not very realistic in problems like the Prisoner's Dilemma.

Right, but I'm not suggesting necessarily that the agent actually make decisions randomly, merely that we use this as an aid in proofs. For instance, suppose we assume that all agents (including A itself) have a probability of p (independently) of being replaced by a different algorithm, which just acts randomly. Then, we can try to prove that Pr[world does Y|agent does X] in this probability space approaches 1 as p approaches 0 (colloquially, "agent does X ⇒ world does Y with high probability").

Then if we end up accidentally proving silly things, then we merely end up proving that Pr[agent does X] goes to 0 (because it doesn't happen except when actions are chosen at random). However, the conditional probability shouldn't be affected.

For instance, take Newcomb's problem. The terrible thing in the usual case is that we stumble on a proof that A()=2, so then it follows that A()=1 => U() = 0, which scares us so we output A()=2.

Here that can't happen. Suppose we stumbled on a proof that A()=2 w.h.p. Then when we calculate the conditional probability Pr[U() = 0 | A() = 1], we calculate the probability Pr[U() = 0 and A() = 1] and end up with p^2/4 -- if even one of U() and A() ends up what it's supposed to be, this doesn't happen. So the conditional probability is p/2, which still goes to 0. On the other hand, when we calculate Pr[U() = 1000000 | A() = 1] = Pr[U() = 1M and A() = 1] / Pr[A() = 1], the numerator is p^2/4 + p/2(1-p) -- in the case where A went haywire but U didn't, this is what happens. So this conditional probability is 1 - p/2, which goes to 1. We believe this over the other probability, so we actually output A()=1 -- and this shouldn't happen.

*0 points [-]I tried to parse your comment but couldn't. What are expressions like Pr[U()=0 | A()=1] supposed to mean? What event is "A()=1"? Newcomb's problem makes two calls to A() which affect the resulting utility differently. If the two instances of A always get "randomly replaced" together or not at all, then I agree that it solves Newcomb's problem, but I think the assumption is too strong. On the other hand, if they get "randomly replaced" independently, I think you need to give a more careful argument, and also I think it won't work :-(

*0 points [-]This is why I defined the probability space to be that, instead of A sometimes doing something random, there's a low probability that A is replaced with a different agent that always does something random. I don't see why the assumption is too strong. We can define the probability space any way we like, since we don't actually have to implement it, all we need is to be able to prove things about the probability space.

Now that I say it carefully, it's somewhat reminiscent of the problem you're always objecting to: that we can't separate A from the rest of the universe. But if we can pick out the things that are "agents" -- basically, if we pick out anything that's not immediately predictable, and I think that can be made rigorous -- then we can make this definition.

Oh, but in the actual Newcomb's problem, the two calls to A are actually calls to different but identical routines, aren't they? Are they? One of them is A's actual thought process, the other is Omega's absolutely perfect prediction of A's thought process. But on the other hand, none of the proofs go through if you can't verify that the two copies are the same, which is equivalent to making them the same subroutine.

*2 points [-]Yeah, the problem in the non-oracle setting is about separating A from the rest of the universe. I feel that any good solution to this problem should be "crisp" rather than delegated to A's fuzzy reasoning abilities, because at this point in our research we're not yet trying to make a good optimizer, but trying to define mathematically what optimization means in the first place.

This is one of the most interesting posts I've seen on Less Wrong. It is a non-trivial result and it goes a long way towards dealing with a lot of the problems with UDT. I'm deeply impressed and wish I had some helpful comment to make other than general praise.

Thanks!

Is this the first time an advanced decision theory has had a mathematical expression rather than just a verbal-philosophical one?

This totally deserves to be polished a bit and published in a mainstream journal.

*3 points [-]That's a question of degree. Some past posts of mine are similar to this one in formality.

Nesov also said in an email on Jan 4 that now we can write this stuff up. I think Wei and Gary should be listed as coauthors too.

I still want to figure out games (like PD) in the oracle setting first. After the abortive attempt on the list, I didn't yet get around to rethinking the problem. Care to take a stab?

*1 point [-]The symmetric case (identical payoffs and identical algorithms) is trivial in the oracle setting. Non-identical algorithms seems to be moderately difficult, our candidate solutions in the non-oracle setting only work because they privilege one of the outcomes apriori, like Loebian cooperation. Non-identical payoffs seems to be very difficult, we have no foothold at all.

I think we have a nice enough story for "fair" problems (where easy proofs of moral arguments exist), and no good story for even slightly "unfair" problems (like ASP or non-symmetric PD). Maybe the writeup should emphasize the line between these two kinds of problems. It's clear enough in my mind.

Part of the motivation was to avoid specifying agents as algorithms, specifying them as (more general) propositions about actions instead. It's unclear to me how to combine this with possibility of reasoning about such agents (by other agents).

*0 points [-]That's very speculative, I don't remember any nontrivial results in this vein so far. Maybe the writeup shouldn't need to wait until this gets cleared up.

(It's not "advanced", it's not even in its infancy yet. On the other hand, there is a lot of decision theory that's actually advanced, but solves different problems.)

*2 points [-]I think Luke meant "advanced" as in

superrationality, not "advanced" as inhighly developed.BTW, nice work.

*2 points [-]Specifically, given

anyformal system S for reasoning about the world and agent's place in it, the chicken rule (step 1) forces S to generate consistent theories of consequences for all possible actions. This seems to crack a long-standing problem in counterfactual reasoning, giving a construction for counterfactual worlds (in form of consistent formal theories) from any formal theory that has actual world as a model.*3 points [-]...and the construction turns out not as interesting as I suspected. Something like this is very easy to carry out by replacing the agent A with another that can't be understood in S, but is equivalent to A (according to a system stronger than S). As a tool for understanding decision problems, this is intended to solve the problem of parsing the world in terms of A, finding how it depends on A, finding where A is located in the world, but if we can find all instances of A in the world to perform such surgery on them, we've already solved the problem!

Perhaps A can decide to make itself incomprehensible to itself (to any given S, rather), thus performing the transformation without surgery, formalization of free will by an act of then-mysterious free will? This could still be done. But it's not clear if this can be done "from the outside", where we don't have the power of making A transform to make the dependence of the world on its actions clear.

*1 point [-]You put the oracle outside the universe, but the logical impossibility of an oracle is still visible from inside the universe. For example, this agent (EDIT simplified from original comment):

I don't see why exploring a universe with such contradictions is useful for UDT, can you explain?

*3 points [-]The universe defined in the post doesn't have any contradictions. Every oracle studied in theoretical CS is "incomplete" in the sense that it can't answer all questions about programs that invoke the oracle itself, e.g. the halting oracle cannot answer all questions about programs that can invoke the halting oracle. But it can answer

somequestions about such programs, which turns out to be enough for the agent to make a decision.I am not convinced that the oracle can answer

anyquestions about programs that call the oracle. That is simply not one of the things it does.The halting problem corresponds to a language L which consists of all pairs (P, w) such that program P halts on input w. Then the oracle is defined to return 1 if you give it a pair in L, and 0 otherwise. But to do this, we have to have L defined already, so that definition cannot refer to this oracle. A recursive definition "L is the set of programs with an oracle for L that halt" is just ill-defined.

*5 points [-]We're not asking the oracle directly about the halting behavior of programs that call the oracle. Instead we're using it as a provability oracle, asking it only questions of the form "if a program exhaustively searches for proofs of such-and-such statement in the formal system S, will it ever terminate?"

Since the statements we're trying to prove, in turn, refer to the behavior of programs that call the oracle, we need S to have some assumptions about the behavior of the oracle. These assumptions can be pretty weak, e.g. the proof in the post seems to go through even if S knows only that the oracle is a pure function that returns identical outputs for identical inputs. Or you could make the assumptions stronger, e.g. formalize the notion of "halting oracle" within S. That wouldn't break anything either, I think.

*2 points [-]I see, thanks, that clears it up.

However, now I have an objection to a different step! (which I'm sure you will also address)

Just to clarify, let's say that the first step of A first checks if S proves that A()≠2, and then checks if S proves that A()≠1 (we could do this in either order, it doesn't matter, but this order makes the following explanation make sense better).

S proves that A()≠1. However, S will only be inconsistent if S can also prove that A()=1. To prove that A()=1, S needs to prove that S proves that A()≠1 (does this still make sense?). Furthermore, S needs to prove that S does not prove that A()≠2 (otherwise, A will perversely return 2 before it considers perversely returning 1).

In particular, the statement "S does not prove that A()≠2" implies "S is consistent" and therefore S cannot prove it.

I also have the suspicion that "S proves that A()≠1" is not something S can prove, but my logic-fu is not strong enough to decide one way or the other.

*2 points [-]Let's step through the algorithm. If S proves that A()≠2, then A will immediately return 2. That means S has proved a false statement about the universe. If S doesn't prove that A()≠2, but proves that A()≠1, then A will immediately return 1, which also means S has proved a false statement about the universe.

So yeah, I guess calling it a violation of consistency was wrong. It's actually a violation of soundness: all axioms of S must be true statements about the universe. If S is sound, then the above reasoning shows that S cannot prove A()≠a for any a. (Also soundness implies consistency, which might explain why I made the mistake.) Thanks for pointing that out! Edited the post accordingly.

Alright, that makes sense (if my reading of proofs seems uncharitable at times, it is because I know enough about logic to know when a statement doesn't make sense, but I don't know enough to tell what the statement wants to be).

Soundness seems like an interesting property to think about for a formal system. I am reminded of the bizarre systems you can get, e.g., by taking PA and adding the axiom "PA is inconsistent". This, if I recall correctly, is consistent provided PA itself is consistent, but (whether or not it's consistent) it definitely can't be sound.

*6 points [-]I want people to read my math uncharitably and poke holes in it, otherwise I wouldn't post :-)

Do we know in advance which questions are going to be answered? Do we know agent A from your OP isn't going to end up crying in step 2?

I apologize if I'm just asking about standard CS assumptions here. I have a CS bachelor's degree but I don't remember ever discussing programs that can call halting oracles. Either my memory is faulty (more likely) or I chose the wrong electives.

*4 points [-]It's easy to write a universe program U that would make A cry. The post only proves that A won't end up crying for one specific U, and outlines an argument why it won't cry in some other similar problems.

The class of universes where A doesn't end up crying is supposed to roughly correspond to the informal class of "fair" decision problems where the agent's action is the only thing that determines the resulting utility, but I have no strong argument why it's a good formalization of that class.

One example of an "unfair" decision problem would be a universe that rewarded you for having a specific algorithm, rather than for returning a specific value. Such "unfair" problems can be made up to punish any decision theory you can come up with, so they're probably not a good test case.

Thanks for explaining.

Is anyone at all working on classes of "unfair" problems, such as ones that give different utilities based upon the amount of time spent computing? Or ones that take into consideration any type of resource used to make that decision (energy or memory). This class seems important to me and less arbitrary than "unfair" problems that punish specific algorithms.

*1 point [-]Wei Dai has a tentative decision theory that covers some of those cases. I didn't find it very convincing, but it's likely that I overlooked something. Any work on such problems would be very welcome, of course.

I'll have a think. An optimal decision maker for all scenarios seems impossible if your utility is reduced by an amount proportional to the time take to make the decision ("solving death" has this structure, less people die if you solve it earlier). The best in general I can think of is an infinite table mapping scenarios to a the decision computed by something like your UDT + oracle for that scenario. And this can be beaten in each individual scenario by a specialised algorithm for that scenario, that needs no look up.

And it still has an infinite quantity which I don't like in my theories that I might want to connect to the real world one day (and requires an infinite amount of precomputation).

I wonder if there is a quality apart from strict optimality that we need to look for. Making the optimal decision in most problems( what is the correct weighting of scenarios)? Making the right decision eventually?

Anyway I'll think some more. It is definitely thornier and nastier than "fair" problems.

*0 points [-]I recently made some progress on your question. Section 4 seems to be the most relevant.

Would it be correct to say that S can't prove Lemma 2?

Yes.

*1 point [-]Is this right? I'm wondering if you're assuming soundness relative to the natural semantics about A, because in step one, it isn't clear that there is a contradiction in S rather than a failure of S to be sound with respect to the semantics it's supposed to model (what actions A can take and their utility). I might be confused, but wouldn't this entail contradiction of the

soundnessof S rather than entailing that S is inconsistent? S would only be inconsistent if it can prove both A() = a and A()≠a, but unless you have further hidden assumptions about S I don't see why A returning a would entail that SprovesA() = a.This is how I want to interpret this: S is some deduction system capable of talking about all actions A() can make, and proving some range of utility results about them. S is also consistent and sound.

Play chicken with the universe: if S proves that A()≠a for some action a, then return a.

If S proves for all a there is some u such that [A() = a ⇒ U() = u] , output argmax (a) else exit and output nothing.

Since proving A

can'ttake an action a ( that is, A()≠a ) entails that S is notsound(because A will take such an action in step 1), S can't prove any such result. Also, since proving that an action has two distinct utility values leads to A≠a, thesoundness and consistencyof S entails that this can't happen. Does this seem right?Also, step two seems too strong. Wouldn't it suffice to have it be:

1) For all actions a, if there is some u such that [A() = a ⇒ U() = u] and u > 0, (), else add [A() = a ⇒ U() = 0] to the axioms of S (only for the duration of this decision)

2) output argmax (a)

My thought is that there could be some possible actions the agent can take that might not have provable utility and it seems like you should assign an expected utility of 0 to them (no value being privileged, it averages out to 0), but if you can prove that at least one action has positive utility, then you maximize expected utility by choosing the one with the highest positive utility.

This is weaker than the current step two but still seems to have the desired effect. Does this hold water or am I missing something?

*1 point [-]You're right. Misha already pointed that out and I edited the post.

Yeah, such a variation would work, but I'm uncomfortable calling A the optimal algorithm in a universe where some actions don't have provable utility. Such universes often arise from non-symmetric multiplayer games and I'd rather not insinuate that I have a good solution for those.

"S proves that A()=1 ⇒ U()=42. But S also proves that A()=1 ⇒ U()=1000000, therefore S proves that A()≠1" I don't see how this follows. Perhaps it is because, if the system was sound, it would never prove more than one value for U() for a given a, therefore by the principle of explosion it proves A()≠1? But that doesn't seem to actually follow. I'm aware that this is an old post, but on the off chance that anyone ever actually sees this comment, help would be appreciated.

Suppose S proves that A()≠1 and A()≠2. What does A return?

A will check for one of those first, and then return it. The order in which it checks 1 and 2 doesn't matter, because this can only happen if S is unsound.

So, the primary point is to make it trivial to show that there is no proof in S regarding the output of A?

Exactly. We're worried about proving something like "A()=1 ⇒ U()=0" by proving "A()≠1", which would convince A() not to return 1. This avoids shortcuts of that nature.

*0 points [-]So, in the example given, it is clear that A()=1. and also clear that S cannot prove that A()=1. What system are we using to show that A()=1, and why do we believe that that system is sound?

EDIT: More generally, is there any sound formal system T which can prove that A()=1 for all sound formal systems S that can prove that A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000?

If so, then T cannot prove that A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000, why do we believe that? If not, why do we believe that A()=1?

*1 point [-]Presumably whichever system one typically uses to reason about the soundness of Peano arithmetic would suffice.

Considering that the algorithm involved is already impossible to implement by virtue of calling a halting oracle, I don't see what further harm it does to make the additional assumption that Peano arithmetic is sound.

The provable oracle is only needed for the general case implementation: If you have a proof that S does not prove A()≠n for any n, and a the further proofs that A()=n ⇒ U()=u for each n, then the need for the oracle is satisfied. If you replace the oracle with a routine which sometimes breaks down and cries, this specific example might still work.

What I'm saying is that either there exists a sound system which can prove the three statements A()=1, A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000, or there does not exist such a sound system. If such a system exists, then it does not prove that A()=1 ⇒ A()≠2 (or any other value), and therefore does not contain Peano arithmetic. Since the proof here does make all three claims, either the system used here either does not contain PA or is unsound. Since PA is assumed whenever numbers are used, the system is unsound.

The claim that the algorithm will do the right thing in Newcomb's problem is suspect, because any formal system that can use the proof provided is unsound.

*0 points [-]Why not?

Because such a system would prove that A()≠2, which would result in A() returning 2 in the first step, when that system is used as the system S. Since it proves something which is not true, it is unsound.

A sound system need not include the axioms of PA to include the axiom "Anything that PA proves is true".

Such a system T can be sound and still prove that A()=1, but cannot prove that A()=1 ⇒ A()≠2, because:

T does not contain the axiom A=n ⇒ A≠m, for m≠n. Therefore A()=1 ⇒ A()≠2 is not provable in T

PA does not prove (T proves A()=1 ⇒ A()≠2), and does not include the axiom "Anything that T proves is true." A()=1 is not provable in PA, and therefore A()≠2 is not provable in PA.

No sound system that can be used as S can prove that A()≠2: Since U only requires that S to be sound and be able to prove A()=a ⇒ U()=u for all permissible a, no sound system can prove A()=a ⇒ U()=u for all permissible a and A()≠2.

*3 points [-]Wait, no, no, no. The system in which we are working to analyze the algorithm A() and prove that it works correctly is not the same as the system S used by A()! Although S is not powerful enough to prove A()≠2 -- this would require that S is aware of its own soundness -- we can pick a system S that we know is sound, or can assume to be sound, such as Peano arithmetic, and based on that assumption we can prove that A()=1.

What about the prisoner's dilemma?

Here's a concrete example of a game between non-identical agents: agent A of this post, who maximizes outcome, and agent B, who cooperates iff the other player plays the same move.

You seem to imply that you consider this approach practically equivalent to your earlier approach with bounded proof length. Is that right?

But in your first post, you say you're not sure what happens with two player games, while in your second, you say "if agent A tries to reason about agent B in the same way, it will fail miserably." What changed your mind? Do you still hold this position? Did you consider this PD pairing back then?

*2 points [-]This post says only that the result transfers to the symmetric PD, i.e. the case with identical agents. I still don't have any good solution for other cases.

B cooperates iff S proves that A()=B(), right? In this case, if A and B are using the same S, I think A will cry and B will defect :-( Here's why:

1) S cannot prove any statement like "S cannot prove X", because that requires S to prove the consistency of S.

2) Therefore S cannot prove something like "S cannot prove that A doesn't cooperate".

3) Therefore S cannot prove that A won't play chicken on step 1.

4) Therefore S cannot prove that A()=B().

5) Therefore B will defect, and S cannot prove that.

6) Therefore A can't find a complete set of proofs saying all of A's actions imply B's defection.

7) If A can find a complete set of proofs saying one of A's actions implies B's cooperation, then A will do that action, making the counterfactual true. But B defects, so that can't happen.

8) By (6) and (7), the only remaining possibility is that A can't find a complete set of proofs at all, so it breaks down and cries.

I haven't checked the above argument very carefully, so it can contain many mistakes. Please recheck before accepting it!

ETA:Douglas found a fatal mistake in the argument, so don't bother checking :-)I don't see how you got from 3) S cannot prove that A won't play chicken to 4) S cannot prove that A()=B(). That inference seems way too general; it should apply to any agent of this type. In particular, S certainly can prove that A()=A(). Perhaps 3 is supposed to break some symmetry between A and B? Perhaps I wasn't clear that I wanted B to play chicken, too. That restores the symmetry, if that's what the point of 4 was.

I also think that there is a subtle gap in 5 => 6. If S could prove that all of A's actions imply B's defection, then S could prove that B defects, hence contradiction. But A's decision doesn't need to examine all of A's actions: crying can be omitted. I doubt that this actually helps, though.

*5 points [-]I can't solve your problem yet, but I found a cute lemma. Let P be the proposition "A()=a" where a is the first action inspected in step 1 of A's algorithm.

(S⊢¬P)⇒P (by inspection of A)

S⊢((S⊢¬P)⇒P) (S can prove 1)

(S⊢(S⊢¬P))⇒(S⊢P) (unwrap 2)

(S⊢¬P)⇒(S⊢(S⊢¬P)) (property of any nice S)

(S⊢¬P)⇒(S⊢P) (combine 3 and 4)

(S⊢¬P)⇒(S⊢(P∧¬P)) (rephrasing of 5)

(S⊢¬P)⇒¬Con(S)

All the above steps can also be formalized within S, so each player knows that if any player plays chicken with the first inspected action, then S is inconsistent. The proof generalizes to the second inspected action, etc., by looking at the first one that yields a contradiction. But if S is inconsistent, then it will make all players play chicken. So if one player plays chicken, then all of them do, and that fact is provable within S.

Did you manage to make any progress?

I tried this in the case that the output of A provably lies in the set {a,b}. I only managed to prove

(S⊢¬P)⇒¬Con(S+Con(S))

where P is the proposition "A()=b" where b is the second inspected action. But this still implies

Thanks! I think you found a real hole and the conclusion is also wrong. Or at least I don't see how

(S⊢¬P)⇒¬Con(S+Con(S))

implies the conclusion.

The conclusions that I think I can draw are

¬(S⊢A()=a)∧(S⊢A()=b) ⇒ Con(S)∧¬Con(S+Con(S)) ⇒ A()=b

So if one player plays chicken on the second inspected action, then all of them do.

I'm still not getting it. Can you explain the proof of the following:

Con(S)∧¬Con(S+Con(S)) ⇒ A()=b

¬Con(S+Con(S)) ⇒ S⊢¬Con(S)

⇒ S⊢Prv(A()≠a) (if S is inconsistent, then it proves anything)

⇒ S⊢A()=a (because A plays chicken)

⇒ S⊢A()≠b

Con(S)∧¬Con(S+Con(S)) ⇒ Con(S)∧(S⊢A()≠b) (follows directly from 4)

⇒ (S⊬A()=b)∧(S⊢A()≠b) (a consistent theory can't prove a proposition and its negation)

⇒ (S⊬A()≠a)∧(S⊢A()≠b) (here I'm assuming that S⊢(A()=a∨A()=b). I don't know what to do if A can choose between more than two alternatives.)

⇒ A()=b (A plays chicken on the second inspected action)

Cool.

I had a lot of trouble reading this because in my mind ⇒ binds tighter than ⊢. When I figured it out, I was going to suggest that you use spaces to hint at parsing, but you already did. I don't know what would have helped.

Since we like symmetry, I'm going to change notation from A and B to I and O for "I" and "opponent." (or maybe "input" and "output")

We should be careful about the definition of B. Simply saying that it cooperates if I()=O() causes it to blow up against the defectbot. Instead, consider the propositions PC: I()=C ⇒ O()=C and PD: I()=D ⇒ O()=D. We really mean that B should cooperate if S proves P=PC∧PD. What if it doesn't? There are several potential agents: B1 defects if S doesn't prove P; B2 defects if S proves ¬P, but breaks down and cries if it is undecidable; B3 breaks down if either PC and PD are undecidable, but defects they are both decidable and one is false. B3 sounds very similar to A and so I think that symmetry proves that they cooperate together. If we modified A not to require that every action had a provable utility, but only that one action had a utility provably as big as all others, then I think it would cooperate with B2.

These examples increase my assessment of the possibility that A and B1 cooperate.

(I'm ignoring the stuff about playing chicken, because the comment I'm responding to seems to say I can.)

*0 points [-]I think your conclusions can be right, but the proofs are vague. Can you debug your reasoning like you debugged mine above?

Nice catches! You're right on both points. I have to go to sleep now, will figure this out tomorrow.

*0 points [-]Wouldn't the same job be done by the agent using proper counterfactuals instead of logic ones, which seems like something that would also be needed for other purposes?

I don't know who (if anyone) has done any work on this, but when a human considers a counterfactual statement like "If Gore won in 2000" that is very underspecified, because the implicit assumption is to discard contradicting knowledge, but how to do that exactly is left open. Humans just know that they should assume something like "Bush didn't win Florida" instead of "266 > 271".

If an example agent needs to be able to use precisely defined proper counterfactuals I think it might be possible to do that with an ordering function for its current knowledge. The agent would start with the couterfactual under consideration, add items from its knowledge base in the order specified for that counterfactual, test for each item whether it can find a contradiction, and discard the current item whenever it finds a contradiction (from consideration for evaluating the counterfactual).

For the example I think the order would look like this: A=a, S, U except for A, the source code of A.

That would do much the same thing as "playing chicken with the universe" with respect of not being impressed by proofs about its output, no?

More generally I think the items would be further split up, particularly "A behaves like a program with this source code in all other cases" would come before "A behaves like a program with this source code in the case under consideration". Other instances of A would also have to be treated as instances of A and not as programs with the source code (i.e. statements like "A' has the same source code as A" would come before either of their source codes in the ordering).

Does that make sense?

*1 point [-]The idea of using an ordering function for knowledge is new to me, thanks!

The hard part is getting "U except for A". Given the source code of U and the source code for A, we don't know how to "factor" one program by another (or equivalently, find all correlates of the agent in the universe). If we knew how to do that, it would help a lot with UDT in general.

I guess you would actually have various knowledge items about the world, some of them implying things about A, and any that in conjunction with the others so far cause a contradiction with A=a that the agent can find would be discarded. Maybe that already would be enough; I'm not sure.

What considerations should be used to order the knowledge items?

*0 points [-]That's a really difficult question. It's hard to say what principles humans follow when evaluating counterfactuals, and even harder to say in how far that's a reasonable example to follow.

I think higher level observational laws should usually have a higher priority than concrete data points they are based on, and all else equal they should be in descending order of generality and confidence. That the US. president can veto US federal legislation and that the person who can veto US federal legislation is the same person as the commander in chief of the US military forces should both have a higher priority than that George W. Bush could veto US federal legislation.

It would also depend on what the counterfactual is used for. For counterfactuals concerning the past timing would obviously extremely important.

In the case of considering the counterfactual implications of a decision the agent makes you could try ascending order of strength as Bayesian evidence about the agent as a secondary criterion, maybe? Or perhaps instead ratio of that strength to general importance? (Which would probably require nested counterfactuals? Are we concerned with computability yet?)

EDIT: I think the knowledge items would have redundancy so that even if the agent can derive itself directly from the laws of physics and needs to reject (one of) them it can reconstruct almost normal physics from various observational laws. It also seems redundancy could reduce the importance of the initial order somewhat.

Stupid question: A() always returning 1 manifestly solves the Newcomb problem, no fancy proofs required. What am I missing?

*6 points [-]The implementation of A in the post works correctly for many different U's.

*-1 points [-]Another possible modification of the algorithm:

1. Remove the playing chicken step.

2. Keep the step 2 - proving "A()=a ⇒ U()=u".

3. Change the choice rule to:

Return the action for which S does not prove that A()≠a, and which corresponds to the highest utility found on step 2 among all such actions.

*1 point [-]Your scheme doesn't guarantee that all actions will be considered, or that a known collection of proofs of statements [A()=a ⇒ U()=u] will generate the action that agent actually chooses. If S proves that A()=a1, a1 is what your agent will choose, without any consequentialist reason for this choice apparent in the decision algorithm. You are letting S decide for you, implementing a kind of fatalist decision-making, asking "What do the laws of nature say I'm going to do?" instead of "What should I do?" The chicken rule (diagonal step) is what provides these guarantees (there are likely other ways to this same end).

I visualized the process like this: S would start proving things. It will notice that "if S does not prove A()!=a AND S does not prove A()!=b AND AgentStep2Proves("A()=a => U()=u") AND AgentStep2Proves("A()=b => U()=w") AND u>w THEN A()!=b", so it would prove A()!=b. Etc.

Which is wrong since it makes S look on itself, which would make it inconsistent. Except it is inconsistent anyway...

On the other hand, this method still does not let the agent prove its decision, since it would make the step 2 explode. Damn, I'm stupid :(

But this should work. It's just TDT, I think. Does it have serious drawbacks that make UDT/ADT desirable?

I still have doubts about the consistency of this architecture. What if the agent sees a copy of itself perform some action in a situation exactly equal to the situation in which the agent finds itself now. Would it not mean that the agent can now prove that it would perform the same action? (There would be a difference between the agent and the copy, but only an "additive" difference - the agent will have additional knowledge that the copy wouldn't - so whatever the copy proved, the agent must also be able to prove. [And this fact would be provable to the agent!]).

An example situation would be the Parfit's hitchhiker. If the system finds itself in town, saved by the driver, would it not be able to prove that it will cooperate?

In fact, the system has its own source code, so won't it directly simulate itself in any case, as a side-effect of running S? Guess it's just a standard proof why S is impossible...

*-1 points [-]Is it possible to change the definition of A() to something like this:

Let UX(X) be the function that results from the source of U() if all calls "A()" were changed to "eval(X)".

For some action p, prove that for any possible agent X, UX(X) <= UX("return p").

return p.

Then the agent would be able to prove that it returns a specific action, and yet there would be no contradictions.

It seems to me this should be better than the "chicken rule" (it's really funny and clever, though! :)). But I'm getting no response to similar comments in other posts. Am I making some ridiculous mistake here?

So, one of the classical examples of what we want this decision algorithm to do is the following: if it is playing a Prisoner's dilemma game with an identical algorithm, then it should cooperate. The informal reasoning (which this post is trying to formalize) is the following: the agent proves that the output of A() and of A'() are always equal, compares the values of "A() = A'() = cooperate" and "A() = A'() = defect" and decides to cooperate.

The key point is that the agent is not initially told which other agents are identical to it. So your suggestion fails to work because if we replace "A()" with "eval(cooperate)" or "eval(defect)" then we end up proving things about how the agent A'() plays against a CooperateBot or a DefectBot. We conclude that A'() defects against both of these, and end up defecting ourselves.

*0 points [-]Ok, modification:

def U():

box1 = 1000

box2 = (A_1() == 2) ? 0 : 1000000

return box2 + ((A_2() == 2) ? box1 : 0)

or

def U():

if(A_1() == 'C' and A_2() == 'C') return 2;

if(A_1() == 'D' and A_2() == 'C') return 3;

...

def A():

Forall i try to prove A()==A_i()

Let UX(X) be the function that results from the source of U() if all calls "A_i()", for which the system was able to prove A()==A_i() were changed to "eval(X)".

For some action p, prove that for any possible action q, UX("return q") <= UX("return p").

Return p.

What results should be equivalent to TDT, I think.

If every step in that code is valid, then that would work. In my opinion the shakiest step in your algorithm is the "Forall i" in the first line of A(). I'll probably reason myself into a hole if I try to figure out whether or not it's valid.

Well, at least in some cases (e.g., if A's code is equal to A_i's code) the proof is immediate. Proof is also possible for minor obvious modifications, like variable renaming, so the agent should still behave better than a quining agent for PD.

My goal for inventing this is to have a consistent agent that can know (can prove) its decision beforehand, at least sometimes, without getting into contradictions...

*0 points [-]No, it wouldn't defect against itself, because UX will call the same eval("return p") twice:

UX(X) = { return PDPayoffMatrix[eval(X), eval(X)]; }

The payoff with p=cooperate is greater, so the agent will cooperate.

EDIT: Sorry, that was all wrong, and missed your point. Thinking now...