I think you're invoking TDT-style reasoning here, before the agent has self-modified into TDT.
I already said that agents which start out as pure CDT won't modify into pure TDTs - they'll only cooperate if someone gets a peek at their source code after they self-modified. However, humans, at least, are not pure CDT agents - they feel at least the impulse to one-box on Newcomb's Problem if you raise the stakes high enough.
This has nothing to do with evolutionary contexts of honor and cooperation and defection and temptation, and everything to do with our evolved instincts governing abstract logic and causality, which is what governs what sort of source code you think has what sort of effect. Even unreasonably pure CDT agents recognize that if they modify their source code at 7am, they should modify to play TDT against any agent that has looked at their source code after 7am. To humans, who are not pure CDT agents, the idea that you should play essentially the same way if Omega glimpsed your source code at exactly 6:59am, seems like common sense given the intuitions we have about logic and causality and elegance and winning. If you're going to all the trouble to invent TDT anyway, it seems like a waste of effort to two-box against Omega if he perfectly saw your source code 5 seconds before you self-modified. (These being the kind of ineffable meta-decision considerations that we both agree are important, but which are hard to formalize.)
Besides, I'm assuming a world where agents can't know or guess each others' source codes.
You are guessing their source code every time you argue that they'll choose D. If I can't make you see this as an instance of "guessing the other agent's source code" then indeed you will not see the large correlations at the start point, and if the agents start out highly uncorrelated then the rare TDT agents will choose the correct maximizing action, D. They will be rare because, by assumption in this case, most agents end up choosing to cooperate or defect for all sorts of different reasons, rather than by following highly regular lines of logic in nearly all cases - let alone the same line of logic that kept on predictably ending up at D.
There's a wide variety of cases where philosophers go astray by failing to recognize an instance of an everyday concept as an abstract concept. For example, they say in one breath that "God is unfalsifiable", and in the next breath talk about how God spoke to them in their heart, because they don't recognize "God spoke to me in my heart" as an instance of "God allegedly made something observable happen". Philosophers talk about qualia being epiphenomenal in one breath, and then in the next speak of how they know themselves to be conscious, because they don't recognize this self-observation as an instance of "something making something else happen" aka "cause and effect". The only things recognized as matching the formal-sounding phrase "cause and effect" are big formal things officially labeled "causal", not just stuff that makes other stuff happen.
In the same sense, you have this idea about modeling other agents as this big official affair that requires poring over their source code with a magnifying glass and then furthermore verifying that they can't change it while you aren't looking.
You need to recognize the very thought processes you are carrying out right now in arguing that just about anyone will choose D as an instance of guessing the outputs of the other agents' source codes and moreover guessing that most such codes and outputs are massively logically correlated.
This is witnessed by the fact that if we did get to see some interstellar transactions, and you saw that the first three transactions were (C, C), you would say, "Wow, guess Eliezer was right" and expect the next one to be (C, C) as well. (And of course if I witnessed three cases of (D, D) I would say "Guess I was wrong.") Even though the initial conditions are not physically correlated, we expect a correlation. What is this correlation, then? It is a logical correlation. We expect different species to end up following similar lines of reasoning, that is, performing similar computations, like factorizing 123,456 in spacelike separated galaxies.
It occurs to me that the problem is important enough that even if we can reach intuitive agreement, we should still do the math. But it doesn't help to solve the wrong problem, so do you think the following is the right formalization of the problem?
It commonly acknowledged here that current decision theories have deficiencies that show up in the form of various paradoxes. Since there seems to be little hope that Eliezer will publish his Timeless Decision Theory any time soon, I decided to try to synthesize some of the ideas discussed in this forum, along with a few of my own, into a coherent alternative that is hopefully not so paradox-prone.
I'll start with a way of framing the question. Put yourself in the place of an AI, or more specifically, the decision algorithm of an AI. You have access to your own source code S, plus a bit string X representing all of your memories and sensory data. You have to choose an output string Y. That’s the decision. The question is, how? (The answer isn't “Run S,” because what we want to know is what S should be in the first place.)
Let’s proceed by asking the question, “What are the consequences of S, on input X, returning Y as the output, instead of Z?” To begin with, we'll consider just the consequences of that choice in the realm of abstract computations (i.e. computations considered as mathematical objects rather than as implemented in physical systems). The most immediate consequence is that any program that calls S as a subroutine with X as input, will receive Y as output, instead of Z. What happens next is a bit harder to tell, but supposing that you know something about a program P that call S as a subroutine, you can further deduce the effects of choosing Y versus Z by tracing the difference between the two choices in P’s subsequent execution. We could call these the computational consequences of Y. Suppose you have preferences about the execution of a set of programs, some of which call S as a subroutine, then you can satisfy your preferences directly by choosing the output of S so that those programs will run the way you most prefer.
A more general class of consequences might be called logical consequences. Consider a program P’ that doesn’t call S, but a different subroutine S’ that’s logically equivalent to S. In other words, S’ always produces the same output as S when given the same input. Due to the logical relationship between S and S’, your choice of output for S must also affect the subsequent execution of P’. Another example of a logical relationship is an S' which always returns the first bit of the output of S when given the same input, or one that returns the same output as S on some subset of inputs.
In general, you can’t be certain about the consequences of a choice, because you’re not logically omniscient. How to handle logical/mathematical uncertainty is an open problem, so for now we'll just assume that you have access to a "mathematical intuition subroutine" that somehow allows you to form beliefs about the likely consequences of your choices.
At this point, you might ask, “That’s well and good, but what if my preferences extend beyond abstract computations? What about consequences on the physical universe?” The answer is, we can view the physical universe as a program that runs S as a subroutine, or more generally, view it as a mathematical object which has S embedded within it. (From now on I’ll just refer to programs for simplicity, with the understanding that the subsequent discussion can be generalized to non-computable universes.) Your preferences about the physical universe can be translated into preferences about such a program P and programmed into the AI. The AI, upon receiving an input X, will look into P, determine all the instances where it calls S with input X, and choose the output that optimizes its preferences about the execution of P. If the preferences were translated faithfully, the the AI's decision should also optimize your preferences regarding the physical universe. This faithful translation is a second major open problem.
What if you have some uncertainty about which program our universe corresponds to? In that case, we have to specify preferences for the entire set of programs that our universe may correspond to. If your preferences for what happens in one such program is independent of what happens in another, then we can represent them by a probability distribution on the set of programs plus a utility function on the execution of each individual program. More generally, we can always represent your preferences as a utility function on vectors of the form <E1, E2, E3, …> where E1 is an execution history of P1, E2 is an execution history of P2, and so on.
These considerations lead to the following design for the decision algorithm S. S is coded with a vector <P1, P2, P3, ...> of programs that it cares about, and a utility function on vectors of the form <E1, E2, E3, …> that defines its preferences on how those programs should run. When it receives an input X, it looks inside the programs P1, P2, P3, ..., and uses its "mathematical intuition" to form a probability distribution P_Y over the set of vectors <E1, E2, E3, …> for each choice of output string Y. Finally, it outputs a string Y* that maximizes the expected utility Sum P_Y(<E1, E2, E3, …>) U(<E1, E2, E3, …>). (This specifically assumes that expected utility maximization is the right way to deal with mathematical uncertainty. Consider it a temporary placeholder until that problem is solved. Also, I'm describing the algorithm as a brute force search for simplicity. In reality, you'd probably want it to do something cleverer to find the optimal Y* more quickly.)
Example 1: Counterfactual Mugging
Note that Bayesian updating is not done explicitly in this decision theory. When the decision algorithm receives input X, it may determine that a subset of programs it has preferences about never calls it with X and are also logically independent of its output, and therefore it can safely ignore them when computing the consequences of a choice. There is no need to set the probabilities of those programs to 0 and renormalize.
So, with that in mind, we can model Counterfactual Mugging by the following Python program:
def P(coin):
AI_balance = 100
if coin == "heads":
if S("heads") == "give $100":
AI_balance -= 100
if coin == "tails":
if Omega_Predict(S, "heads") == "give $100":
AI_balance += 10000
The AI’s goal is to maximize expected utility = .5 * U(AI_balance after P("heads")) + .5 * U(AI_balance after P("tails")). Assuming U(AI_balance)=AI_balance, it’s easy to determine U(AI_balance after P("heads")) as a function of S’s output. It equals 0 if S(“heads”) == “give $100”, and 100 otherwise. To compute U(AI_balance after P("tails")), the AI needs to look inside the Omega_Predict function (not shown here), and try to figure out how accurate it is. Assuming the mathematical intuition module says that choosing “give $100” as the output for S(“heads”) makes it more likely (by a sufficiently large margin) for Omega_Predict(S, "heads") to output “give $100”, then that choice maximizes expected utility.
Example 2: Return of Bayes
This example is based on case 1 in Eliezer's post Priors as Mathematical Objects. An urn contains 5 red balls and 5 white balls. The AI is asked to predict the probability of each ball being red as it as drawn from the urn, its goal being to maximize the expected logarithmic score of its predictions. The main point of this example is that this decision theory can reproduce the effect of Bayesian reasoning when the situation calls for it. We can model the scenario using preferences on the following Python program:
def P(n):
urn = ['red', 'red', 'red', 'red', 'red', 'white', 'white', 'white', 'white', 'white']
history = []
score = 0
while urn:
i = n%len(urn)
n = n/len(urn)
ball = urn[i]
urn[i:i+1] = []
prediction = S(history)
if ball == 'red':
score += math.log(prediction, 2)
else:
score += math.log(1-prediction, 2)
print (score, ball, prediction)
history.append(ball)
Here is a printout from a sample run, using n=1222222:
-1.0 red 0.5
-2.16992500144 red 0.444444444444
-2.84799690655 white 0.375
-3.65535182861 white 0.428571428571
-4.65535182861 red 0.5
-5.9772799235 red 0.4
-7.9772799235 red 0.25
-7.9772799235 white 0.0
-7.9772799235 white 0.0
-7.9772799235 white 0.0
S should use deductive reasoning to conclude that returning (number of red balls remaining / total balls remaining) maximizes the average score across the range of possible inputs to P, from n=1 to 10! (representing the possible orders in which the balls are drawn), and do that. Alternatively, S can approximate the correct predictions using brute force: generate a random function from histories to predictions, and compute what the average score would be if it were to implement that function. Repeat this a large number of times and it is likely to find a function that returns values close to the optimum predictions.
Example 3: Level IV Multiverse
In Tegmark's Level 4 Multiverse, all structures that exist mathematically also exist physically. In this case, we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about.
I suggest that the Level 4 Multiverse should be considered the default setting for a general decision theory, since we cannot rule out the possibility that all mathematical structures do indeed exist physically, or that we have direct preferences on mathematical structures (in which case there is no need for them to exist "physically"). Clearly, application of decision theory to the Level 4 Multiverse requires that the previously mentioned open problems be solved in their most general forms: how to handle logical uncertainty in any mathematical domain, and how to map fuzzy human preferences to well-defined preferences over the structures of mathematical objects.
Added: For further information and additional posts on this decision theory idea, which came to be called "Updateless Decision Theory", please see its entry in the LessWrong Wiki.