A model of UDT with a halting oracle

cousin_it

70 A model of UDT with a halting oracle

18th Dec 2011

3 min read

70

This post requires some knowledge of mathematical logic and computability theory. The basic idea is due to Vladimir Nesov and me.

Let the universe be a computer program U that can make calls to a halting oracle. Let the agent be a subprogram A within U that can also make calls to the oracle. The source code of both A and U is available to A.

Here's an example U that runs Newcomb's problem and returns the resulting utility value:

  def U():
    # Fill boxes, according to predicted action.
    box1 = 1000
    box2 = 1000000 if (A() == 1) else 0
# Compute reward, based on actual action.
return box2 if (A() == 1) else (box1 + box2)

A complete definition of U should also include the definition of A, so let's define it. We will use the halting oracle only as a provability oracle for some formal system S, e.g. Peano arithmetic. Here's the algorithm of A:

Play chicken with the universe: if S proves that A()≠a for some action a, then return a.
For every possible action a, find some utility value u such that S proves that A()=a ⇒ U()=u. If such a proof cannot be found for some a, break down and cry because the universe is unfair.
Return the action that corresponds to the highest utility found on step 2.

Now we want to prove that the agent one-boxes, i.e. A()=1 and U()=1000000. That will follow from two lemmas.

Lemma 1: S proves that A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000. Proof: you can derive that from just the source code of U, without looking at A at all.

Lemma 2: S doesn't prove any other utility values for A()=1 or A()=2. Proof: assume, for example, that S proves that A()=1 ⇒ U()=42. But S also proves that A()=1 ⇒ U()=1000000, therefore S proves that A()≠1. According to the first step of the algorithm, A will play chicken with the universe and return 1, making S inconsistent unsound (thx Misha). So if S is sound, that can't happen.

We see that the agent defined above will do the right thing in Newcomb's problem. And the proof transfers easily to many other toy problems, like the symmetric Prisoner's Dilemma.

But why? What's the point of this result?

There's a big problem about formalizing UDT. If the agent chooses a certain action in a deterministic universe, then it's a true fact about the universe that choosing a different action would have caused Santa to appear. Moreover, if the universe is computable, then such silly logical counterfactuals are not just true but provable in any reasonable formal system. When we can't compare actual decisions with counterfactual ones, it's hard to define what it means for a decision to be "optimal".

For example, one previous formalization searched for formal proofs up to a specified length limit. Problem is, that limit is a magic constant in the code that can't be derived from the universe program alone. And if you try searching for proofs without a length limit, you might encounter a proof of a "silly" counterfactual which will make you stop early before finding the "serious" one. Then your decision based on that silly counterfactual can make it true by making its antecedent false... But the bigger problem is that we can't say exactly what makes a "silly" counterfactual different from a "serious" one.

In contrast, the new model with oracles has a nice notion of optimality, relative to the agent's formal system. The agent will always return whatever action is proved by the formal system to be optimal, if such an action exists. This notion of optimality matches our intuitions even though the universe is still perfectly deterministic and the agent is still embedded in it, because the oracle ensures that determinism is just out of the formal system's reach.

P.S. I became a SingInst research associate on Dec 1. They did not swear me to secrecy, and I hope this post shows that I'm still a fan of working in the open. I might just try to be a little more careful because I wouldn't want to discredit SingInst by making stupid math mistakes in public :-)

Decision theoryNewcomb's Problem

Frontpage

70

New Comment

Rendering 0/102 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 7:39 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

70 A model of UDT with a halting oracle

by cousin_it

18th Dec 2011

3 min read

102

70

This post requires some knowledge of mathematical logic and computability theory. The basic idea is due to Vladimir Nesov and me.

Here's an example U that runs Newcomb's problem and returns the resulting utility value:

Play chicken with the universe: if S proves that A()≠a for some action a, then return a.
For every possible action a, find some utility value u such that S proves that A()=a ⇒ U()=u. If such a proof cannot be found for some a, break down and cry because the universe is unfair.
Return the action that corresponds to the highest utility found on step 2.

Now we want to prove that the agent one-boxes, i.e. A()=1 and U()=1000000. That will follow from two lemmas.

Lemma 1: S proves that A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000. Proof: you can derive that from just the source code of U, without looking at A at all.

We see that the agent defined above will do the right thing in Newcomb's problem. And the proof transfers easily to many other toy problems, like the symmetric Prisoner's Dilemma.

But why? What's the point of this result?

Decision theoryNewcomb's Problem

Frontpage

70

Mentioned in

122Robust Cooperation in the Prisoner's Dilemma

80Botworld: a cellular automaton for studying self-modifying agents embedded in their environment

79MIRI Research Guide

62A model of UDT with a concrete prior over logical statements

56AI Risk and Opportunity: Humanity's Efforts So Far

Load More (5/24)

New Comment

Rendering 0/102 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 7:39 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from cousin_it

Curated and popular this week

102Comments

102

Comment Permalink

cousin_it15y90

I can't solve your problem yet, but I found a cute lemma. Let P be the proposition "A()=a" where a is the first action inspected in step 1 of A's algorithm.

(S⊢¬P)⇒P (by inspection of A)
S⊢((S⊢¬P)⇒P) (S can prove 1)
(S⊢(S⊢¬P))⇒(S⊢P) (unwrap 2)
(S⊢¬P)⇒(S⊢(S⊢¬P)) (property of any nice S)
(S⊢¬P)⇒(S⊢P) (combine 3 and 4)
(S⊢¬P)⇒(S⊢(P∧¬P)) (rephrasing of 5)
(S⊢¬P)⇒¬Con(S)

All the above steps can also be formalized within S, so each player knows that if any player plays chicken with the first inspected action, then S is inconsistent. The proof generalizes to the second inspected action, etc., by looking at the first one that yields a contradiction. But if S is inconsistent, then it will make all players play chicken. So if one player plays chicken, then all of them do, and that fact is provable within S.

Did you manage to make any progress?

Nisan14y40

The proof generalizes to the second inspected action, etc., by looking at the first one that yields a contradiction.

I tried this in the case that the output of A provably lies in the set {a,b}. I only managed to prove

(S⊢¬P)⇒¬Con(S+Con(S))

where P is the proposition "A()=b" where b is the second inspected action. But this still implies

if one player plays chicken, then all of them do, and that fact is provable within S.

0Douglas_Knight15y

Since we like symmetry, I'm going to change notation from A and B to I and O for "I" and "opponent." (or maybe "input" and "output") We should be careful about the definition of B. Simply saying that it cooperates if I()=O() causes it to blow up against the defectbot. Instead, consider the propositions PC: I()=C ⇒ O()=C and PD: I()=D ⇒ O()=D. We really mean that B should cooperate if S proves P=PC∧PD. What if it doesn't? There are several potential agents: B1 defects if S doesn't prove P; B2 defects if S proves ¬P, but breaks down and cries if it is undecidable; B3 breaks down if either PC and PD are undecidable, but defects they are both decidable and one is false. B3 sounds very similar to A and so I think that symmetry proves that they cooperate together. If we modified A not to require that every action had a provable utility, but only that one action had a utility provably as big as all others, then I think it would cooperate with B2. These examples increase my assessment of the possibility that A and B1 cooperate. (I'm ignoring the stuff about playing chicken, because the comment I'm responding to seems to say I can.)

1Douglas_Knight15y

Cool. I had a lot of trouble reading this because in my mind ⇒ binds tighter than ⊢. When I figured it out, I was going to suggest that you use spaces to hint at parsing, but you already did. I don't know what would have helped.

See in context