By requests from Blueberry and jimrandomh, here's an expanded repost of my comment which was itself a repost of my email sent to decision-theory-workshop.

(Wait, I gotta take a breath now.)

A note on credit: I can only claim priority for the specific formalization offered here, which builds on Vladimir Nesov's idea of "ambient control", which builds on Wei Dai's idea of UDT, which builds on Eliezer's idea of TDT. I really, really hope to not offend anyone.

(Whew!)

Imagine a purely deterministic world containing a purely deterministic agent. To make it more precise, agent() is a Python function that returns an integer encoding an action, and world() is a Python function that calls agent() and returns the resulting utility value. The source code of both world() and agent() is accessible to agent(), so there's absolutely no uncertainty involved anywhere. Now we want to write an implementation of agent() that would "force" world() to return as high a value as possible, for a variety of different worlds and without foreknowledge of what world() looks like. So this framing of decision theory makes a subprogram try to "control" the output of a bigger program it's embedded in.

For example, here's Newcomb's Problem:

def world():
  box1 = 1000
  box2 = (agent() == 2) ? 0 : 1000000
  return box2 + ((agent() == 2) ? box1 : 0)

A possible algorithm for agent() may go as follows. Look for machine-checkable mathematical proofs, up to a specified max length, of theorems of the form "agent()==A implies world()==U" for varying values of A and U. Then, after searching for some time, take the biggest found value of U and return the corresponding A. For example, in Newcomb's Problem above there are easy theorems, derivable even without looking at the source code of agent(), that agent()==2 implies world()==1000 and agent()==1 implies world()==1000000.

The reason this algorithm works is very weird, so you might want to read the following more than once. Even though most of the theorems proved by the agent are based on false premises (because it is obviously logically contradictory for agent() to return a value other than the one it actually returns), the one specific theorem that leads to maximum U must turn out to be correct, because the agent makes its premise true by outputting A. In other words, an agent implemented like that cannot derive a contradiction from the logically inconsistent premises it uses, because then it would "imagine" it could obtain arbitrarily high utility (a contradiction implies anything, including that), therefore the agent would output the corresponding action, which would prove the Peano axioms inconsistent or something.

To recap: the above describes a perfectly deterministic algorithm, implementable today in any ordinary programming language, that "inspects" an unfamiliar world(), "imagines" itself returning different answers, "chooses" the best one according to projected consequences, and cannot ever "notice" that the other "possible" choices are logically inconsistent with determinism. Even though the other choices are in fact inconsistent, and the agent has absolutely perfect "knowledge" of itself and the world, and as much CPU time as it wants. (All scare quotes are intentional.)

This is progress. We started out with deterministic programs and ended up with a workable concept of "could".

Hopefully, results in this vein may someday remove the need for separate theories of counterfactual reasoning based on modal logics or something. This particular result only demystifies counterfactuals about yourself, not counterfactuals in general: for example, if agent A tries to reason about agent B in the same way, it will fail miserably. But maybe the approach can be followed further.

New Comment
111 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

A subtlety: If the proof search finds "agent()==1 implies world()==5" and "agent()==2 implies world()==1000", then the argument in the post shows that it will indeed turn out that agent()==2 and world()==1000. But the argument does not show in any similar sense that 'if the agent had returned 1, then world() would equal 5' -- in fact this has no clear meaning (other than the one used inside the algorithm, i.e. that the theorem above is found in the proof search).

I do not think that this is actually a problem here, i.e. I don't think it'... (read more)

4cousin_it
Sorry, my previous reply was based on a misparsing of your comment, so I deleted it. So the "mathematical intuition module" can be 100% truthful and evil at the same time, because making true statements isn't the same as making provable statements. Funny. It seems the truth vs provability distinction is actually the heart of decision theory.
5Wei Dai
I came across this paragraph from Bruno Marchal today, which strikingly reminds me of Vinge's "Hexapodia as the key insight": Bruno is a prolific participant on the "everything-list" that I created years ago, but I've never been able to understand much of what he talked about. Now I wonder if I should have made a greater effort to learn mathematical logic. Do his writings make sense to you?

No, they look like madness.

But logic is still worth learning.

EDIT: it seems I was partly mistaken. What I could parse of Marchal's mathy reasoning was both plausible and interesting, but parsing is quite hard work because he expresses his ideas in a very freaky manner. And there still remain many utterly unparseable philosophical bits.

Thanks for taking another look. BTW, I suggest if people change the content of their comments substantially, it's better to make a reply to yourself, otherwise those who read LW by scanning new comments will miss the new content.

0[anonymous]
This is basically the idea of the "mathematical intuition module". Wei Dai explained it to me and I spent some time thinking about it, and now I think such an idea cannot work. To make it provably work, you'd need way too many assumptions about "well-behavedness" of the module that our actual math intuition doesn't satisfy.

Note that by proving "agent()==1 implies world()==1000000", and then performing action 1, you essentially conclude that "world()==1000000" was valid in the first place. "You had the mojo all along", as Aaronson said (on Löb's theorem). Decision making is all about lack of logical transparency, inability to see the whole theory from its axioms.

0cousin_it
Woah, I just noticed the weird alternative proof of Löb's theorem contained in that PDF. Can't even tell if it's correct. Thanks!

I was confused on the first reading: I thought you were presenting a contrasting reduction of couldness. Then, on reading the earlier comment, I saw the context in which you were presenting this: as a way to clarify the existing free will solution EY gave. Now, it makes more sense.

(It might be helpful to include more of the motivation for this explanation at the start.)

1cousin_it
It's true that I agree with Eliezer's solution and my post can be seen as a formalization of that, but I'd much rather not discuss free will here and now :-)
0Vladimir_Nesov
It is a formalization of UDT, but I don't see how it's a formalization of TDT.
0cousin_it
I think Silas meant Eliezer's "solution to free will", not TDT.
2SilasBarta
Posting to confirm. This is why I said:
0Vladimir_Nesov
I believe "solution to free will" was based on TDT (although if you make TDT informal, and then formalize the result, it can well be no longer a formalization of TDT, and work better too).

This has been extremely useful for me, even though I understood the basic idea already, because with the discussion it made me understand something very basic that somehow I'd always missed so far: I'd always thought of this kind of program in the context of applying classical decision theory with the possible actions being particular agent()s. But using classical decision theory already presupposes some notion of "could." Now I understand that the principle here yields (or is a real step towards) a notion of could-ness even in the real world: If... (read more)

0cousin_it
Good question, but surprisingly hard to parse. Took me three attempts. Basically, you point out that my implementation of agent() quantifies over possible choices, but sometimes it's useful for agents to quantify over programs that lead to choices. Maybe this falls magically out of the proof checker, maybe not. I cannot answer this. My brain hurts.
0Benya
Ouch, sorry! Too much time spent thinking about these things by myself without having to explain what I mean, probably. I think I still haven't gotten across the simple point I was trying to make; let me try again. Given an optimization problem like the TSP, there is a well-defined notion of "what utility/score do I get if I choose to do X," independent of the algorithm you use to compute a solution -- so there is a theoretical optimum solution independent of the algorithm you use to make a choice, and you can compare different heuristic algorithms on how close they come to the theoretical optimum. It seems useful to have this sort of description of the problem when trying to find actual practical algorithms. I used to hope that we could find an equivalent for timeless decision theories, an abstract description of the optimum solution that we could then try to approximate through heuristic algorithms, but in your work, the definition of "what would happen if I did X" is so intimately connected to the algorithm making the decision that it's at least not obvious to me how this could work.
2cousin_it
Most (all?) proofs of statements like "if I did X, utility would be U" don't actually need to look at the implementation of the agent, only find the call sites in world(). So we can compare different heuristics without things blowing up. Or did you mean something else? Do you have an example of a decision theory problem where the optimal solution is hard to find, but heuristics help? Or should I use the TSP as an example?
0Benya
(Sorry for the delayed reply.) I'm a little confused: it's true that we can easily compare heuristics for world()s that depend on the agent only through easily identifiable call sites, but if we're only considering this class of problems, what does your algorithm add over the simpler algorithm of replacing all call sites to agent() by a constant, running the result, and choosing the constant that gives the highest value? (I suppose I was hoping for more direct game theoretical applications than you, so I had situations in mind where the (rest of) the world does depend on the source of the agent.) Regarding examples, what I had in mind was: Surely your literal algorithm of enumerating all proofs is not efficient enough to run even on toy problems like your implementation of Newcomb! Also, surely for any real-world problem, it won't be feasible to find the optimal solution. So at some point, we will want to use the theory as a background for looking for practical (and thus necessarily heuristic) algorithms.
0cousin_it
Admittedly, not much. If world() calls agent2() that provably returns the same value as agent(), agent() will notice it. But this wasn't really the point of the post. The point was to show that having perfect knowledge of yourself doesn't necessarily stop you from considering counterfactuals about yourself. Of course it's all completely impractical. Perhaps a closer approximation to reality would be a "listener" algorithm that accepts proofs (e.g. by reading game theory books), instead of trying to find them.
0[anonymous]
It's true that agent() cannot see where world() ends and agent() begins, but we can see that! Otherwise it would be unclear what a "decision-theoretic problem" even means. So we can just compare the return values of world() for different implementations of agent(). Or have I mis-parsed your question?

Note that world() essentially doesn't talk about what can happen, instead it talks about how to compute utility, and computation of utility is a single fixed program without parameters (not even depending on the agent), that the agent "controls" from the inside.

To clarify, in the Newcomb example, the preference (world) program could be:

def world(): 
  box1 = 1000 
  box2 = (agent1() == 2) ? 0 : 1000000 
  return box2 + ((agent2() == 2) ? box1 : 0)

while the agent itself knows its code in the form agent3(). If it can prove both agent1() and agent... (read more)

1cousin_it
Yep - thanks. This is an important idea that I didn't want to burden the post with. It would have brought the argument closer to UDT or ADT research, and I'd much prefer if you and Wei Dai wrote posts on these topics, not me. In matters of ethics and priority, one cannot be too careful. Besides, I want to read your posts because you have different perspectives from mine on these matters.
0Vladimir_Nesov
Thanks, I think I'll write up this notion of preference remark (a little more generally, as a theory not program, with program as a special case, and lack of explicit dependence in the discussion of why the program/theory is "fixed").
0Perplexed
One possible problem: There is a difference between a program and the function it computes. The notion of preference is perhaps captured best by a function, not a program. Many programs could be written, all computing the same function. This wouldn't matter much if we were only going to call or invoke the program, since all versions of the program compute the same result. But, this is not what cousin_it suggests here. He wants us to examine the source of the program, to prove theorems about the program. Given finite resources, the things we can prove about one program may not match the things we can prove about another program, even if the two programs would compute the same result in all cases if agent() returned the same result.
1Vladimir_Nesov
No, a program. What the program defines is a single constant value, not a function (remember: if we are talking about a program, it's a constant program, taking no parameters!). And of course it matters how that constant is defined, and not at all what that constant is. More generally, we can define that constant not by a program, but by a logical theory, which will be the topic of my next post.
1Perplexed
By a "logical theory" do you mean what logicians usually mean by a "theory"? A deductively closed set of sentences? Wow! Well, that eliminates a lot of the arbitrary character I was objecting to in using programs to represent the world/decision problem. But there still are a lot of deductive systems to choose among. I await your next post with interest.
0Vladimir_Nesov
I won't settle the choice, only point out the generality of notion and how it applies, direction to look for further refinements.
0[anonymous]
Yep, in my original email world() was called utility().

OK, so just spelling out the argument for my own amusement:

The agent can prove the implications "agent() == 1 implies world() == 1000000" and "agent() == 2 implies world() == 1000", both of which are true.

Let x and y in {1, 2} be such that agent() == x and agent() != y.

Let Ux be the greatest value of U for which the agent proves "agent() == x implies world() == U". Let Uy be the greatest value of U for which the agent proves "agent() == y implies world() == U".

Then Ux is greater than or equal to Uy.

If the agent's for... (read more)

2Chris_Leong
Thanks for this comment! The idea that only a single Uy can be provable, otherwise we get a contradiction and hence we can prove an arbitrarily high utility has greatly clarified this post for me. I still haven't quite figured out why the agent can't use the above proof that those are the only two provable and the fact that agent()==1 gives a higher value to prove that agent()!=2 and then create a contradiction. Edit: Actually, I think I now understand this. After finding those two proofs, the agent can only conclude that there aren't any more proofs of the form "agent()=_ implies world()=_"if its reasoning (say PA) is consistent. But it can't actually prove this!
0cousin_it
Yes, this looks right.
[-]gRR10

The setup can be translated into a purely logical framework. There would be constants A (Agent) and W (World), satisfying the axioms:

(Ax1) A=1 OR A=2

(Ax2) (A=1 AND W=1000) OR (A=2 AND W=1000000)

(Ax3) forall a1,a2,w1,w2 ((A=a1 => W=w1) AND (A=a2 => W=w2) AND (w1>w2)) => NOT (A=a2)

Then the Agent's algorithm would be equivalent to a step-by-step deduction:

(1) Ax2 |- A=1 => W=1000

(2) Ax2 |- A=2 => W=1000000

(3) Ax3 + (1) + (2) |- NOT (A=1)

(4) Ax1 + (3) |- A=2

In this form it is clear that there are no logical contradictions at any tim... (read more)

0Vladimir_Nesov
The direction of setting the problem up in logical setting is further pursued here and here. The axioms as you wrote them don't fit or work for multiple reasons. Conceptually, the problem is in figuring out how decisions influence outcomes, so having something like (Ax2) defies this purpose. Statements like that are supposed to be inferred by the agent from more opaque-to-the-purpose-of-deciding definitions of A and W, and finding the statements of suitable form constitutes most of the activity in making a decision. If the agent can figure out what it decides, then by default it becomes able to come up with spurious arguments for any decision, which makes its inference system (axioms) either inconsistent or unpredictable (if we assume consistency and don't let (Ax3) into it). The rules that say which arguments lead to which actions shouldn't be part of the theory that the agent uses to come up with the arguments. This excludes your (Ax3), which can only be assumed "externally" to the agent, a way of seeing if it's doing OK, not something it should itself be able to see. A contradiction can be inferred from your axioms in this way: (4) says A=2, which implies [A=1 => W=0], which by (Ax3) together with [A=2 => W=1000000] implies that A=1, which is a contradiction and a decision to do whatever (I think you've also mixed up A=1 and A=2 while going from (Ax2) to (1) and (2)).
0gRR
Thanks! This site is rather large :) And the most interesting (for me) stuff is rarely linked from the sequences. That is, it seems so... Is there a more efficient way to get up to date? [I would SO like to have found this site 5 years ago!] The axioms were not proposed seriously as a complete solution, of course. The (Ax2) can be thought of as a cached consequence of a long chain of reasoning. The main point of the post was about the word "could", after all. (Ax3) is just an axiomatic restatement of the "utility maximization" goal. I don't see why an agent can't know about itself that it's a utility maximizer. Yes, A=1 and A=2 were mixed up in the (Ax2), fixed it now. But I don't see your contradiction. (Ax3) + [A=1 => W=0] + [A=2 => W=1000000] does not imply A=1. It implies NOT(A=1)
0Vladimir_Nesov
See the posts linked to from UDT, ADT, TDT, section "Decision theory" in this post. It can, it's just a consideration it can't use in figuring out how the outcome depends on its decision. While looking for what is influenced by a decision, the decision itself should remain unknown (given that the decision is determined by what this dependence turns out to be, it's not exactly self-handicapping). It seems the mixing up of A=1 and A=2 didn't spare this argument, but it's easy enough to fix. From A=2 we have NOT(A=1), so [A=1 => W=1,000,000,000] and together with [A=2 => W=1,000,000] by (Ax3) we have A=1, contradiction.
1gRR
I thought about a different axiomatization, which would not have the same consistency problems. Not sure whether this is the right place, but: Let X be a variable ranging over all possible agents. (World Axioms) forall X [Decides(X)=1 => Receives(X)=1000000] forall X [Decides(X)=2 => Receives(X)=1000] (Agent Axioms - utility maximization) BestX = argmax{X} Receives(X) Decision = Decides(BestX) Then Decision=1 is easily provable, regardless of whether any specific BestX is known. Does not seem to lead to contradictions.
1gRR
Yes, I see now, thanks. If I attempted to fix this, I would try to change (Ax3) to something like: forall a1,a2,w1,w2 ((A=a1 => W=w1) AND (A=a2 => W=w2) AND (w1>w2)) => NOT (FinalA=a2) where FinalA is the actual decision. But then, why should the world's axiom (Ax2) be defined in terms of A and not FinalA? This seems conceptually wrong... Ok, I'll go read up on the posts :)
[-][anonymous]10

I enjoyed reading and thinking about this, but now I wonder if it's less interesting than it first appears. Tell me if I'm missing the point:

Since world is a function depending on the integers agent outputs, and not on agent itself, and since agent knows world's source code, agent could do the following: instead of looking for proofs shorter than some fixed length that world outputs u if agent outputs a, it could just simulate world and tabulate the values world(1), world(2), world(3),... up to some fixed value world(n). Then output the a < n + 1 for ... (read more)

0cousin_it
You're not completely wrong, but the devil is in the details. The world's dependence on the agent's output may indeed be non-obvious, as in Nesov's comment.
3[anonymous]
Are you saying there's a devil in my details that makes it wrong? I don't think so. Can you tell me a tricky thing that world can do that makes my code for agent worse than yours? About "not obvious how world depends on agent's output," here's a only very slightly more tricky thing agent can do, that is still not as tricky as looking for all proofs of length less than n. It can write a program agent-1 (is this what Nesov is getting at?) that always outputs 1, then compute world(agent-1). It can next write a program agent-2 that always outputs 2, then compute world(agent-2). On and on up to agent-en. Then have agent output the k for which world(agent-k) is largest. Since world does not depend on any agent's source code, if agent outputs k then world(agent) = world(agent-k). Again this is just a careful way of saying "ask world what's the best agent can get, and do that."
[-][anonymous]10

Am I right that if world's output depends on the length of time agent spends thinking, then this solution breaks?

Edit: I guess "time spent thinking" is not a function of "agent", and so world(agent) cannot depend on time spent thinking. Wrong?

Edit 2: Wrong per cousin's comment. World does not even depend on agent, only on agent's output.

1cousin_it
Yes. This solution is only "optimal" if world() depends only on the return value of agent(). If world() inspects the source code of agent(), or measures cycles, or anything like that, all bets are off - it becomes obvously impossible to write an agent() that works well for every possible world(), because world() could just special-case and penalize your solution. Edit: You weren't wrong! You identified an important issue that wasn't clear enough in the post. This is the right kind of discussion we should be having: in what ways can we relax the restrictions on world() and still hope to have a general solution?
0[anonymous]
One can interpret the phrase "world calls agent and returns utility" with different levels of obtuseness: 1. World looks up agent, examines it, runs it, sees what it's output was and intermediate steps were, then decides what agent deserves. 2. World looks at a sheet of paper agent has written a number on. Analyzes handwriting. Then decides what agent deserves 3. World does not even analyze handwriting. You mean 3, right? That's all I meant by edit 2.
0cousin_it
Yes, that's right.
[-][anonymous]00

Neat. It seems like one could think of the agent as surveying mathematical objects (sets of theorems of the form "agent()==A implies world()==U") and then forcing itself to be in a particular mathematical object (that which contains a particular true statement "agent()==A implies world()==U"). If we're thinking in terms of an ensemble the agent would already be in a particular mathematical object, so perhaps it's procedure could be thought of as forcing itself into a particular mathematical sub-structure.

When I think about this stuff I get the feeling I'm confused on many levels.

you might want to read the following more than once. Even though most of the theorems proved by the agent are based on false premises (because it is obviously logically contradictory for agent() to return a value other than the one it actually returns), the one specific theorem that leads to maximum U must turn out to be correct, because the agent makes its premise true by outputting A.

You were right: reading a second time helped. I figured out what is bothering me about your post. The above seems misspoken, and may lead to an overly constricted conc... (read more)

0cousin_it
I don't understand your comment at all. You seem to assume as "trivial" many things that I set out to demystify. Maybe you could formalize your argument?
3Benya
I'm not sure if I understand torekp's comment right, but it seems to me that the issue is the difference between "physical" and "logical" counterfactuals in the following sense: For concreteness and simplicity, let's suppose physics can be modeled by a function step(state) that takes a state of the world, and returns the state of the world one instant later. Then you can ask what would happen if the current state were slightly different ("suppose there were an apple here in the air at this point in time; then it would fall down"). Call this a "physical" counterfactual. Before reading Eliezer's arguments, I used to think this could be used to model decision-making: Let stateA be what the state of the world will really be at the point I've made my decision. For every choice X I could make, let stateA_X be the same as stateA except that the future copy of me is replaced by a version that has decided to do X. Now it's well-defined what the future of every stateA_X looks like, and if I have a well-defined utility function, then it's well-defined which choice(s) this utility is maximized. Of course I cannot calculate this, but, I thought, the actual decision making process could be seen as an approximation of this. Except that this would lead to classical decision/game theory, with its unattractive consequences -- the case where Omega has just made two identical physical copies of me, and has us play a true Prisoner's Dilemma, etc. Since I can prove that the other copy of me will do the same thing in every stateA_X (because I'm only replacing myself, not the other copy), it follows that I should defect, because it gives me the higher payoff no matter what the other copy of me does. Thus "physical" counterfactuals do not make for an acceptable decision theory, and we're forced to look for some notion not of, "what would happen if the state of the world were different," but of something like, "what would happen if this Turing machine would return a different result?" Th
0torekp
Benja's got it: I'm interested in physical counterfactuals. They are the type that is involved in the everyday notion of what a person "could" do. As for decision theory, I think that the "logical" counterfactuals should supplement, not supplant, the physical counterfactuals.
3cousin_it
I understand that many people share this opinion, but I don't see much in the way of justification. What arguments do you have, besides intuition? I see no reason to expect the "correct" decision theory to be intuitive. Game theory, for example, isn't.
1torekp
"Logical" counterfactuals either mean those whose antecedents are logically impossible, or those whose consequents follow from the antecedents by logical necessity, or both. In your Newcomb's-problem-solving algorithm example, both features apply. But there are many decisions (or sub-problems within decisions) where neither feature applies. Where the agent is a human being without access to its "source code", it commits no contradiction in supposing "if I did A... but if I did B ..." even though at most one of these lies in the future that causality is heading toward. Additionally, one may be unable to prove that the expected utility of A would be U1 and the expected utility of B would be U2, even though one may reasonably believe both of these. Even when we know most of the relevant physical laws, we lack knowledge of relevant initial or boundary conditions, and if we had those we'd still lack the time to do the calculations. Our knowledge of physical counterfactuals isn't deductive, usually, but it's still knowledge. And we still need to know the physical consequences of various actions - so we'll have to go on using these counterfactuals. Of course, I just used another intuition there! I'm not concerned, though - decision theory and game theory will both be evaluated on a balance of intuitions. That, of course, does not mean that every intuitively appealing norm will be accepted. But those that are rejected will lose out to a more intuitively appealing (combination of) norm(s). In a previous comment, I said that physical counterfactuals follow trivially from physical laws, giving an ideal-gas example. But now I just said that our knowledge of physical counterfactuals is usually non-deductive. Problem? No: in the previous comment I remarked on the relation between truths in attempting to show that counterfactuals needn't be metaphysical monsters. In this comment I have been focusing on the agent's epistemic situation. We can know to a moral certainty that a cau

the agent has absolutely perfect "knowledge" of itself and the world, and as much CPU time as it wants.

I'm uncomfortable with the unbounded CPU time assumption. For example, what if world() computes the payoff in a Prisoner's Dilemma between two identical agent()s? You get non-termination due to a recursion without an escape clause.

0cousin_it
The implementation of agent() given in the post is guaranteed to terminate. It enumerates proofs about itself and the world up to a specified max proof length, it doesn't call itself recursively.
0Perplexed
Ok, I had been mistakenly assuming that it accomplished the proofs by simulating world(), which would necessarily involve recursive calls back to agent(), at least in a multi-player game. My bad. But I am still not convinced that agent() is guaranteed to terminate. Suppose, for example, that it cannot find any proofs of the form "agent()==A implies world()==U"? Also, in my example in which world() is supposed to compute the results of a Prisoner's Dilemma between two identical agents, is the "knowlege" that the two agents are indeed identical supposed to reside in world() or in agent()? I suspect that the right way to extend this thought experiment to the multi-player game scenario is to keep world() in the dark as to whether agent1() and agent2() are identical algorithms. Don't hard-code this fact into the agents, either. I think you have to let the two agents discover that they are identical to each other. The thing is, this "understanding" has to be achieved by proving things about proofs. So, I think that modal logic slips back in anyways.
3cousin_it
It's true that the program might fail to find any proofs at all. Then it could output a random action, I guess? Or just return zero? Puzzled about your last paragraph. The "knowledge" that the agents are equivalent has to be derivable by inspecting the program as a whole. (This is a very strong condition, but it can be satisfied.) Proving things about proofs is something proof checkers with sufficiently strong axioms can do just fine, no modal logic needed. But in general, I don't believe this sort of reasoning carries over to multi-player games. If we take the Prisoner's Dilemma and tweak the payoffs slightly to make them non-symmetric, the agents are back to square one. Game theory is inherently harder than individual rationality. I really should write another post on that, because it seems to be a Pons Asinorum that many intelligent people fail to cross.
2Perplexed
Seems to me you understood it and responded appropriately. Yes it is. One important difference is that game-playing agents generally use mixed strategies. So we need to distinguish the reading the strategy source code of another player from reading the pseudo-random-number source code and "seed". Rational agents will want to keep secrets. A second difference is that the "preference structure" of the agent ought to be modeled as data about the agent, rather than data about the world. And some of this is information which a rational agent will wish to conceal as well. So, as a toy model illustrating the compatibility of determinism and free will, your example (i.e. this posting) works fine. But as a way of modeling a flavor of rationality that "solves" one-shot PD and Newcomb-like problems in a "better" way than classical game theory, I think it doesn't (yet?) work.
3cousin_it
Slightly tangential: I have a semi-worked-out notion of a "fair" class of problems which includes Newcomb's Problem because the predictor makes a decision based only on your action, not the "ritual of cognition" that preceded it (to borrow Eliezer's phrase). But the one-shot PD doesn't reside in this "fair" class because you don't cooperate by just predicting the opponent's action. You need to somehow tie the knot of infinite recursion, e.g. with quining or Löbian reasoning, and all such approaches seem to inevitably require inspecting the other guy's "ritual of cognition". This direction of research leads to all sorts of clever tricks, but ultimately it might be a dead end. As for your main point, I completely agree. Any model that cannot accommodate limited information is bound to be useless in practice. I'm just trying to solve simple problems first, see what works and what doesn't in very idealized conditions. Don't worry, I'll never neglect game theory - it was accidentally getting my hands on Ken Binmore's "Fun and Games" that brought me back to math in the first place :-)
0Benya
I don't understand. Do you mean changing the numeric values, leaving the game the same, or changing the ordering, turning it into different game? In the first case, it seems to me that the argument from your other post should work just fine [Edit: Wrong, see cousin_it's explanation below] -- roughly: Suppose there's a short proof that agent1()=="C" iff agent2()=="cooperate". Then there's an only slightly longer proof that agent1()=="C" implies that both agents get the C/C payoff, and that agent1()=="D" implies both get the D/D payoff. Therefore, under the supposition, agent1 will choose "C"; similarly, agent2 will choose "cooperate". In other words: If there's a short proof that agent1()=="C" iff agent2()=="cooperate", then agent1()=="C" and agent2()=="cooperate". By the Löbian reasoning in your previous post (linked above), it follows that both agents cooperate. Am I confused about something? Or is it just that you were in fact talking about changing the payoffs so that the game is not a PD any longer? (I think it still at least bears mentioning if your work already solves the class of problems that Eliezer has been talking about, "where the other agents' decisions can be viewed as functions of one argument, that argument being your own choice in that particular case.")
1cousin_it
This post does solve the class of problems Eliezer was talking about (I call it the "fair" class), but my previous Löbian post used a different algorithm that didn't do utility maximization. That algorithm explicitly said that it cooperates if it can find any proof that agent1()==agent2(), without comparing utilities or trying to prove other statements, and that explicitness made the proof work in the first place. Your proof fails here: It's extremely difficult for the agent to conclude that it will make any particular choice, because any proof of that (starting from your assumptions or any other assumptions) must also prove that the agent won't stumble on any other proofs that lead to yet higher outcomes. This amounts to the agent assuming the consistency of its own proof checker, which is a no go. As far as I can see right now, making the payoffs slightly non-symmetric breaks everything. If you think an individual utility-maximizing algorithm can work in general game-theoretic situations, you have a unique canonical solution for all equilibrium selection and bargaining problems with non-transferable utility. Judging from the literature, finding such a unique canonical solution is extremely unlikely. (The transferable utility case is easy - it is solved by the Shapley value and my Freaky Fairness algorithm.)
0Vladimir_Nesov
When does the agent need to have that knowledge? It just acts on certain conclusions, at the moment it obtains them. It can as well act on conclusions1 about its future conclusions2, the only problem might be that it never reaches those conclusions1 (and so won't act on them)
0Benya
The point is that I relied on the fact that if the agent can prove "If I cooperate, I get 5" and "If I defect, I get 2", then the agent will cooperate. This is a fact, but the agent cannot prove it: if its proof system were inconsistent, then it might (within maxsteps) also be able to prove "If I defect, I get 42," and thus end up defecting. Maybe your comment suggests that the agent should choose to cooperate as soon as it has proven the first two statements above; but as far as the agent knows, if its proof system is inconsistent, it might find the proof of "If I defect, I get 42" earlier than the proof of "If I defect, I get 2." The agent cannot prove that "I will prove C -> 5 and D -> 2" implies "I will cooperate."
1Vladimir_Nesov
The agent can just cooperate as soon as it proves "If I cooperate, I get X" and "If I defect, I get Y", for X>Y. This algorithm is justified, and the agent doesn't need to write its own algorithm, we are writing it.
2cousin_it
I don't completely understand your idea, but Löbian cooperation seems to require enumerating a lot of proofs without stopping early. If Bob can stop early, Alice now needs to prove that Bob will stumble on a specific proof earlier than that, rather than at all. I remember trying and failing to remove the exhaustive search for proofs from my previous post. Or maybe I'm missing something? Maybe you could formalize your idea and give some sort of proof? If it works, it'll be definite progress.
0Benya
IIUC, Vladimir is arguing that we can write an algorithm specifically for the PD that will cooperate if it finds any two proofs "C -> I get X" and "D -> I get Y", for X>Y. I think this algorithm will indeed do the right thing, do you agree? Intuitively, the point is that both in Vladimir's algorithm and your earlier proof, it is sufficient to know something of the form "there exists a proof such that..." to conclude that the program will have a certain behavior (cooperate), whereas in my purported proof, you need to know something of the form "there exists no proof such that...", aka "for all proofs, it holds that..." (which you'd also need in both Vladimir's proposal and your earlier proof to show that one of the agents defects).
0cousin_it
Ah. Now I see where the actual work is being done. Vladimir, sorry, I retract my request for clarification. The proposed algorithm will probably work for the usual Löbian reasons (don't have the energy to write a formal proof right now, though). But it is not utility maximization. This algorithm asymmetrically "privileges" cooperation over defection: finding a proof that cooperation is better will make it cooperate, but finding a proof that defection is better won't make it defect. So in spirit it's pretty similar to my old algorithm, and we probably won't find a solution to game theory on these lines.
0Vladimir_Nesov
Conjecture: If you just look for the proofs of statements of the form (action=A1 => U=U1) AND (action=A2 => U=U2) AND (U1>=U2) where U1 and U2 are variables, and A1 and A2 are different constants (cooperate and defect in some order), up to some sufficient timeout (after which you act randomly), and perform A1 the moment you see the first one, you'll play correctly against your near-copy (with different timeout or syntactic differences), and defecting and cooperating rocks.
0[anonymous]
Maybe my math intuition is utterly failing me today, but I find this conjecture very hard to believe. Could you try proving it?
0Benya
Now that I disagree with! If the algorithm finds a proof that defection is better, it will defect, because the proof system is consistent, so the algorithm can't in this case find a proof that cooperation is better. (It's true that it cannot prove that it will defect, of course.) In fact I'd say the algorithm privileges defection, because it'll defect unless it finds a reason to cooperate. I don't see any straight-forward generalization to arbitrary games, but a natural generalization to games where every player has only two choices springs to mind. Take as parameters the payoff matrix and the other players' source code; then play minimax unless you can prove that playing the other strategy will result in a higher payoff. [Edit: Er, unless the utility you can prove you get under the assumption that you play the other strategy is higher than etc etc... you know what I mean.] Haven't yet looked into whether this actually does give sensible answers to non-PD games, though. *goes off to do just that*
2cousin_it
What on Earth do you mean by this counterfactual? :-) The obvious extension of your proposed algorithm to games with many players and many strategies: assume any commonly-accepted method of finding a "fair strategy profile", attempt Löbian cooperation to reach it, otherwise play minimax. This doesn't solve fairness/bargaining, and we all are quite familiar with this sort of trick already. Essentially, what you're doing is "porting" old ideas from one enforcement method (quining) to another (Löbian cooperation), hoping that fairness will magically pop out this time. No. Fairness and enforcement are orthogonal issues. I'm really frustrated that people stubbornly fail to see fairness as the serious obstacle to "solving" game theory. There's no cheap accidental way around it. ETA: on rereading, this comment sounded harsh. Please don't take it personally - I just feel that at this point in the discussion, it's time to forcefully state some of my background intuitions so other people can break them down more easily.
2Benya
It's not a counterfactual, it's a logical implication, so it's true! :-) No, seriously, if it wasn't clear, I mean that against every opponent program for which it can prove a higher utility for defection, it will indeed defect; e.g. against the always-cooperating program and the defection rock. My intuition is that the algorithm above precisely does not encode a notion of "fair strategy profile," it just tries to maximize its own utility; I suppose that's a fuzzy notion, though. I wasn't arguing that it solves bargaining. An extension to many strategies should cooperate in the finitely iterated PD without wiring in any more notion about "fairness" than in the algorithm above, and without special-casing IPD -- if that's indeed simple I'm curious how it works, it's not obvious to me.
0cousin_it
Ohhh, I managed to miss that. Thank you. This is really good. About bargaining, I reason like this: if you have a program that's good enough at maximizing utility that it can cooperate with itself in the PD, and make this program play different games against itself, this will give us a canonical way to calculate a "fair" outcome for every game. But this is very unlikely, so a utility-maximizer that plays arbitrary games well is by implication also very unlikely. How do your programs play the iterated PD? Do they have memory? I can't understand what you mean yet.
0Vladimir_Nesov
It seems to me that bargaining will be resolved simultaneously with the problem of deciding under uncertainty (when you can't hope to find a proof of utility being precisely U).
0cousin_it
On one hand, this sounds reasonable apriori. On the other hand, making games "fuzzy" to solve bargaining has been tried, and it's not enough. On the third hand, I feel that some games might be genuinely indeterminate because they abstract too much, they don't include enough information from the real-world situation - information that in practice ends up determining the outcome. For example, (instantaneous) bargaining in the Rubinstein model depends on the players' (temporal) discount rates, and if you forgot to look at them, the instantaneous game seems pretty damn indeterminate.
0Benya
My proposal can't handle the iterated PD, only games where each player has two choices. I was replying to you saying there was an obvious generalization to more than two strategies -- if there is, then we can pass in the payoff matrix of the normal form of a finitely iterated PD (and if it's a sensible generalization, it should cooperate with itself on every round). Your argument about bargaining makes sense, though alas I don't know enough about bargaining to have a really informed opinion. It may be that the idea only works for sufficiently PD-like games, but if we can handle a class of them without special-casing that doesn't seem so bad. It does at least handle Chicken [Edit: and Stag Hunt] correctly, AFAICS, so it's not as if it's PD-only.
0cousin_it
Um, the obvious generalization to many strategies must "privilege" one of the strategies apriori, the same way as your algorithm "privileges" cooperation. Otherwise, what single statement would the proof checker be trying to prove? I don't see a way around that.
0Benya
Ah, sorry, now I understand what's going on. You are saying "there's an obvious generalization, but then you'd have to pick a 'fair' strategy profile that it would privilege." I'm saying "there's no obvious generalization which preserves what's interesting about the two-strategy case." So we're in agreement already. (I'm not entirely without hope; I have a vague idea that we could order the possible somehow, and if we can prove a higher utility for strategy X than for any strategy that is below X in the ordering, then the agent can prove it will definitely choose X or a strategy that is above it in the ordering. Or something like that. But need to look at the details much more closely.)
0Benya
I'm not sure when exactly you added this, is this still what you think I'm doing? What I'm trying to do is to give an algorithm that correctly handles games "where the other agents' decisions can be viewed as functions of one argument, that argument being your own choice in that particular case" (same Eliezer quote as upthread), without special-casing for the details of particular games like you did with your solution of the PD. I.e., I'm trying to do what I at first thought your original post did, but what you explained to me it doesn't. AFAICS, if Eliezer's informal arguments are correct, these games have a clear correct solution without solving bargaining. [The ETA's duly noted :-)]
0cousin_it
Can you describe this class of games precisely? And can you define the solution precisely, without referring to our algorithms?
0Benya
I can describe a class of such games and its solution, but I'd hope that if a good decision theoretical agent exists that solves this class, it might also solve some wider class of problems in an intuitively correct way. -- That said, the class is symmetric games, and the solution is the strategy profile that yields the highest utility, among those strategy profiles in which all players choose the same strategy. [More precisely, a good algorithm should choose this algorithm when playing against itself or a similar algorithm, should maximize its utility when playing against an opponent that unconditionally plays some given strategy, and should "do something sensible" in other cases.]
1cousin_it
The quining-cooperator algorithm in the Prisoner's Dilemma forms a "Nash equilibrium in algorithms" against itself. This is a very desirable property to have, which roughly corresponds to "reflective consistency" in single-player decision theory. However, the algorithm you're trying to develop (which includes maximizing utility against an opponent that plays a strategy unconditionally) will not form a "Nash equilibrium in algorithms" against itself, even in symmetric games. To see this, consider a simple bargaining game, a variant of "dividing the dollar". Each player sees the other's source code and outputs an integer. If the sum of the two integers is less than or equal to 10, both players get the amount in dollars that they asked for. If the sum is greater than 10, both get nothing. Your algorithm will output 5 when playing against itself. But this means it's not a best reply to itself, because if your opponent is replaced with the simple program "return 9", you will maximize utility and get only 1.
0Benya
On reflection: I agree that "Nash equilibrium in algorithms" would be desirable, but you seem to find it a more knock-out argument than I do. If we're discussing what kinds of algorithms to enter into something like Axelrod's tournament, then clearly, Nash equilibrium in algorithms would be a very compelling property. But if world() really is (a utility function over) the whole universe, then it's less clear to me that "take 1" is not the right thing to do when encountering a "take 9" rock. Intuitively, the reason you wouldn't want to "take 1" in this case is that you would not want someone else -- Clippy, say -- a motive to leave "take 9" rocks around for you to find. But there's the counter-intuition that: (a) If Clippy does this as a result of reasoning about how you behave, then that's in fact a different situation (the assumption being that you're reasoning about the source of the whole world, not just about the source of the "take 9" rock in front of you). What you do in this situation influences whether Clippy will have left behind a "take 9" rock, so your decision algorithm should not be able to conclude that you maximize utility by "taking 1." [I'm assuming an UDT-like agent which one can think of as taking the source of the world and returning the actions to perform in different contingencies, so it does not make sense to say that we cannot influence what Clippy does because our sensory input tells us that Clippy has "already" left behind a "take 9" rock.] (b) If Clippy does not do this as a result of actually reasoning about how you behave, then if you "take 5" (say), Clippy will still have left a "take 9" rock behind -- by assumption, there's no logical causality from what you do to what Clippy does. I can't formalize this argument, and I'm not sure that (b) especially is the right way of thinking about this problem, but this counter-intuition seems at least as compelling to me as the original intuition. Am I missing something? -- All this said, I n
3cousin_it
Hello again. User Perplexed pointed me to Nash's 1953 paper Two-person cooperative games that seems to give a unique "fair" solution for all two-person competitive games in our setting. I've been thinking how to spin the algorithm in the paper into a flavor of utility maximization, but failing so far.
6Perplexed
The key is to understand that when two players form a coalition they create a new entity distinct from its members which has its own revealed preferences and its own utility to maximize. That is, you need to step up a level and imagine each member of a coalition as making a deal with the coalition itself, rather than simply dealing with the other members. In joining a coalition, a player gives up some or all of his decision-making power in exchange for a promise of security. Define coalescence utility for player A as the amount of extra A utility A gets because the pair plays the cooperation game rather than the threat game. If you wish, think of this as a compensation paid to A by the collective in exchange for his cooperation in the objectives of the collective. Define the utility of cooperation to the 'collective' (i.e. to the union of A and B considered as an entity at a different level than A and B themselves) as the sum of the logarithms of the coalescence utilities of A and B. Then, the collective maximizes its own utility by maximizing the product of the coalescence utilities (which means the same as maximizing the sum of the logs). This approach of taking the logarithm of individual utilities to get components of corporate or collective utility extends nicely to coalitions with more than 2 members. For some analogies in other (biological) kinds of optimization processes where taking the logarithm is the trick that makes it all work, see my favorite paper ever or this classic paper from Dick Lewontin: R. C. Lewontin and D. Cohen. On population growth in a randomly varying environment. Proceedings of the National Academy of the Sciences, 62:1056–1060, 1969.
0Benya
True; thanks for pointing that out.
0Vladimir_Nesov
I don't see what theory do you refer to here, in saying "proof theory is consistent". We have an algorithm that proves some theorems in PA. The only theory under discussion is PA. If the algorithm finds a proof that defection is better, it defects, period.
0Benya
If PA is the system the agent uses, then by "the proof system," I mean PA. I was speaking generically, in case you'd prefer to use ZFC or PA + Con(PA) or whatever. Er, are you saying that as soon as the algorithm finds such a proof, it stops searching and returns "defect"? That will not work, because then the Löbian reasoning won't go through, for the reason cousin_it gave. [Because PA cannot prove Con(PA), it also cannot prove that "there's a short proof that cooperate is better" implies "the agent cooperates" -- for all PA knows, PA might be inconsistent, and there might be an even shorter proof that defect is better.] Or do you mean that the algorithm looks through the whole list of proofs, and if it never finds a proof that cooperate is better, it defects? Then we're in agreement; I was simply pointing out that in this case, we know that if the algorithm proves "defect is better," then it will not cooperate, because we know that PA is consistent, so if the algorithm proves "defect is better" then it will not prove "cooperate is better."
0Vladimir_Nesov
It can enumerate all shorter proofs, and see if any of them imply a different action of the agent. (Note that PA is reflexive, that is can prove that any given finite set of its statements is logically consistent.)
0Benya
Can you give some link or citation about "reflexive"? I can't figure out what this means. It seems to me that if PA can prove every finite subset of its axioms consistent, then it can prove itself consistent: "If PA is inconsistent, then there would be a statement A and a proof of A and a proof of NOT A; but each proof only uses a finite set of axioms; the union of two finite sets is finite; so there is a finite set of axioms of PA that proves both A and NOT A; but every finite subset of PA is consistent, contradiction. Ergo, PA is consistent." Clearly, I'm misunderstanding something. (PA can't talk about sets in general, but finite sets are fine, and "finite subset of PA" just means "finite set of propositions, all of which are in the enumeration that is PA.")
4Vladimir_Nesov
It can't prove that every finite set of its statements is consistent, but for every finite set of its statements it can prove its consistency. (See this chapter, Corollary 8.)
0Benya
I don't understand. The relevant step in the argument, as far as I understand it, requires an inference from "there exists a proof in n steps that cooperate is better" to "the agent cooperates". It seems to me that "enumerating all shorter proofs" would requiring knowing the precise proof (or at least its length), not just the statement "there exists such a proof". But I'm probably not following what, exactly, you have in mind; could you expand on your argument? [I should say I was wrong to say "it ... cannot prove that" above -- all I can say is that one particular technique for showing that it can prove it doesn't go through.]
0Vladimir_Nesov
The setting is sufficiently confusing at this point that I give up (I'm not sure what we are talking about). I'll try to work on a proof (not the proof) of something along the lines of this conjecture.
0[anonymous]
If the agent checked all proofs of length less than X, and none of them prove statement T, there are no proofs of T shorter than X, even if the theory is inconsistent and T is provable (therefore, by longer proofs).
0Benya
I.e., that the agent won't find (contradictory) proofs that the same actions will lead to different, even higher utilities. Right, thanks.
0Vladimir_Nesov
(The following sounds too confusing, but making it clearer doesn't seem straightforward. Posting just in case communication succeeds on some level.) Payoffs are part of preference, preference is (aspect of) world program, and world program determines the outcome. By changing the payoffs, you change the world program, and where decision-making is based on logical inferences about it, you can thus fundamentally change the shape of possible inferences. "Nice" solutions mean parametrization of the solution by elements of the problem statement, but here parametrization is bound to talk about what can be logically inferred, so expect no nice solutions.
0Vladimir_Nesov
Sounds right (but it's a rather strong supposition).
0Benya
Sorry, I wasn't being as clear as I should have been: The above is true as a metatheorem, but uninteresting, because as you say, the supposition is rather strong. What I was trying to argue is that in the agent's proof system (for concreteness, say PA), when assuming the supposition, the conclusion follows, so by the deduction theorem, PA proves "the supposition implies the conclusion". I wanted to use this for a proof-size-bounded Löbian argument à la cousin_it. But this does not work because in PA, the conclusion does not follow when assuming the supposition, as pointed out by cousin_it.
0[anonymous]
This post does solve the class of problems Eliezer was talking about (I call it the "fair" class), but my previous Löbian post used a different algorithm that didn't do utility maximization. That algorithm explicitly said that it cooperates if it can find the proof that agent1()==agent2(), and that explicitness made the proof work in the first place. As far as I can see right now, making the payoffs slightly non-symmetric breaks everything. If you think an individual utility-maximizing algorithm can work in general game-theoretical situations, you have a unique canonical solution for all equilibrium selection and bargaining problems with non-transferable utility. Judging from the literature, finding such a unique canonical solution is extremely unlikely.

Would I be right in thinking that this implies that you can't apply could to world()? i.e. "Our universe could have a green sky" wouldn't be meaningful without some sort of metaverse program that references what we would think is the normal world program?

Or have the utility computed also depend on some set of facts about the universe?

I think both of those would require that the agent have some way of determining the facts about the universe (I suppose they could figure it out from the source code, but that seems sort of illegitimate to me).

0cousin_it
This post only gives a way to apply "could" to yourself. This doesn't imply you can never apply "could" to the world, we might find another way to do that someday.
2atucker
It seems like there are two kinds of "could" at work here, one that applies to yourself and is based on consistent action to utility relationships, and another that involves uncertainty as to what actions cause what utilities (based on counterfactuals about the universe).
0cousin_it
Thanks, good point about uncertainty. I'm making a mental note to see how it relates to counterfactuals.

Following is a generalization to take into account logical uncertainty (inability to prove difficult statements, which can lead in particular to inability to prove any statements of the necessary kind).

Instead of proving statements of the form "agent()==A implies world()==U", let's prove statements of the form "(X and agent()==A) implies world()==U", where X are unsubstantiated assumptions. There are obviously provable statements of the new form. Now, instead of looking for a statement with highest U, let's for each A look at the sets ... (read more)

5cousin_it
Ouch. X could simply say "false" and the whole thing explodes in your face.
2Johnicholas
There are logics (relevant or paraconsistent) logics which do not have the classical "ex falso quodlibet".
0Vladimir_Nesov
True, didn't notice that. I still feel there must be a workaround.
2Johnicholas
If you sum over all explored X (and implicitly treat unexplored X as if they are zero probability) then you can be arbitrarily far from optimal Bayesian behavior. See Fitelson's "Bayesians sometimes cannot ignore even very implausible theories." http://fitelson.org/hiti.pdf
0cousin_it
Thanks for that link. Good stuff.
0atucker
What if instead of assuming your Xs, you got them out of the program? To simplify, imagine a universe with only one fact. A function fact() returns 1 if that fact is true, and 0 if its not. Now, agent can prove statements of the form "fact()==1 and agent()==A implies world()==U". This avoids the problem of plugging in false, agent doesn't start from assuming X as a statement of its own from which it can derive things, it looks at all the possible outputs of fact(), then sees how that combines with its action to produce utility. Trivially, if agent has access to the code of fact(), then it can just figure out what that result would be, and it would know what actions correspond to what utilities. Otherwise, agent() could use a prior distribution or magically infer some probability distribution of if fact is true or not, then choose its action based on normal expected utility calculations. Note: The above departs from the structure of the world() function given here in that it assumes some way of interacting with world (or its source code) to find out fact() without actually getting the code of fact. Perhaps world is called multiple times, and agent() can observe the utility it accumulates and use that (combined with its "agent()==A and fact()==1 implies world() == U" proofs) to figure out if fact is true or not? Interestingly, while this allows agent to apply "could" to the world ("It could rain tomorrow"), it doesn't allow the agent to apply "could" to things that don't effect world's utility calculation ("People could have nonmaterial souls that can be removed without influencing their behavior in any way").
[-][anonymous]00

Does this differ from, or contradict something in, how EY reduces "could" in his free will solution? I thought the could-issue was resolved.