# Towards a New Decision Theory

It commonly acknowledged here that current decision theories have deficiencies that show up in the form of various paradoxes. Since there seems to be little hope that Eliezer will publish his Timeless Decision Theory any time soon, I decided to try to synthesize some of the ideas discussed in this forum, along with a few of my own, into a coherent alternative that is hopefully not so paradox-prone.

I'll start with a way of framing the question. Put yourself in the place of an AI, or more specifically, the decision algorithm of an AI. You have access to your own source code S, plus a bit string X representing all of your memories and sensory data. You have to choose an output string Y. That’s the decision. The question is, how? (The answer isn't “Run S,” because what we want to know is what S should be in the first place.)

Let’s proceed by asking the question, “What are the consequences of S, on input X, returning Y as the output, instead of Z?” To begin with, we'll consider just the consequences of that choice in the realm of abstract computations (i.e. computations considered as mathematical objects rather than as implemented in physical systems). The most immediate consequence is that any program that calls S as a subroutine with X as input, will receive Y as output, instead of Z. What happens next is a bit harder to tell, but supposing that you know something about a program P that call S as a subroutine, you can further deduce the effects of choosing Y versus Z by tracing the difference between the two choices in P’s subsequent execution. We could call these the computational consequences of Y. Suppose you have preferences about the execution of a set of programs, some of which call S as a subroutine, then you can satisfy your preferences directly by choosing the output of S so that those programs will run the way you most prefer.

A more general class of consequences might be called logical consequences. Consider a program P’ that doesn’t call S, but a different subroutine S’ that’s logically equivalent to S. In other words, S’ always produces the same output as S when given the same input. Due to the logical relationship between S and S’, your choice of output for S must also affect the subsequent execution of P’. Another example of a logical relationship is an S' which always returns the first bit of the output of S when given the same input, or one that returns the same output as S on some subset of inputs.

In general, you can’t be certain about the consequences of a choice, because you’re not logically omniscient. How to handle logical/mathematical uncertainty is an open problem, so for now we'll just assume that you have access to a "mathematical intuition subroutine" that somehow allows you to form beliefs about the likely consequences of your choices.

At this point, you might ask, “That’s well and good, but what if my preferences extend beyond abstract computations? What about consequences on the physical universe?” The answer is, we can view the physical universe as a program that runs S as a subroutine, or more generally, view it as a mathematical object which has S embedded within it. (From now on I’ll just refer to programs for simplicity, with the understanding that the subsequent discussion can be generalized to non-computable universes.) Your preferences about the physical universe can be translated into preferences about such a program P and programmed into the AI. The AI, upon receiving an input X, will look into P, determine all the instances where it calls S with input X, and choose the output that optimizes its preferences about the execution of P. If the preferences were translated faithfully, the the AI's decision should also optimize your preferences regarding the physical universe. This faithful translation is a second major open problem.

What if you have some uncertainty about which program our universe corresponds to? In that case, we have to specify preferences for the entire set of programs that our universe may correspond to. If your preferences for what happens in one such program is independent of what happens in another, then we can represent them by a probability distribution on the set of programs plus a utility function on the execution of each individual program. More generally, we can always represent your preferences as a utility function on vectors of the form <E1, E2, E3, …> where E1 is an execution history of P1, E2 is an execution history of P2, and so on.

These considerations lead to the following design for the decision algorithm S. S is coded with a vector <P1, P2, P3, ...> of programs that it cares about, and a utility function on vectors of the form <E1, E2, E3, …> that defines its preferences on how those programs should run. When it receives an input X, it looks inside the programs P1, P2, P3, ..., and uses its "mathematical intuition" to form a probability distribution P_Y over the set of vectors <E1, E2, E3, …> for each choice of output string Y. Finally, it outputs a string Y* that maximizes the expected utility Sum P_Y(<E1, E2, E3, …>) U(<E1, E2, E3, …>). (This specifically assumes that expected utility maximization is the right way to deal with mathematical uncertainty. Consider it a temporary placeholder until that problem is solved. Also, I'm describing the algorithm as a brute force search for simplicity. In reality, you'd probably want it to do something cleverer to find the optimal Y* more quickly.)

#### Example 1: Counterfactual Mugging

Note that Bayesian updating is not done explicitly in this decision theory. When the decision algorithm receives input X, it may determine that a subset of programs it has preferences about never calls it with X and are also logically independent of its output, and therefore it can safely ignore them when computing the consequences of a choice. There is no need to set the probabilities of those programs to 0 and renormalize.

So, with that in mind, we can model Counterfactual Mugging by the following Python program:

def P(coin):

AI_balance = 100

if coin == "heads":

if S("heads") == "give $100":

AI_balance -= 100

if coin == "tails":

if Omega_Predict(S, "heads") == "give $100":

AI_balance += 10000

The AI’s goal is to maximize expected utility = .5 * U(AI_balance after P("heads")) + .5 * U(AI_balance after P("tails")). Assuming U(AI_balance)=AI_balance, it’s easy to determine U(AI_balance after P("heads")) as a function of S’s output. It equals 0 if S(“heads”) == “give $100”, and 100 otherwise. To compute U(AI_balance after P("tails")), the AI needs to look inside the Omega_Predict function (not shown here), and try to figure out how accurate it is. Assuming the mathematical intuition module says that choosing “give $100” as the output for S(“heads”) makes it more likely (by a sufficiently large margin) for Omega_Predict(S, "heads") to output “give $100”, then that choice maximizes expected utility.

#### Example 2: Return of Bayes

This example is based on case 1 in Eliezer's post Priors as Mathematical Objects. An urn contains 5 red balls and 5 white balls. The AI is asked to predict the probability of each ball being red as it as drawn from the urn, its goal being to maximize the expected logarithmic score of its predictions. The main point of this example is that this decision theory can reproduce the effect of Bayesian reasoning when the situation calls for it. We can model the scenario using preferences on the following Python program:

def P(n):

urn = ['red', 'red', 'red', 'red', 'red', 'white', 'white', 'white', 'white', 'white']

history = []

score = 0

while urn:

i = n%len(urn)

n = n/len(urn)

ball = urn[i]

urn[i:i+1] = []

prediction = S(history)

if ball == 'red':

score += math.log(prediction, 2)

else:

score += math.log(1-prediction, 2)

print (score, ball, prediction)

history.append(ball)

Here is a printout from a sample run, using n=1222222:

-1.0 red 0.5

-2.16992500144 red 0.444444444444

-2.84799690655 white 0.375

-3.65535182861 white 0.428571428571

-4.65535182861 red 0.5

-5.9772799235 red 0.4

-7.9772799235 red 0.25

-7.9772799235 white 0.0

-7.9772799235 white 0.0

-7.9772799235 white 0.0

S should use deductive reasoning to conclude that returning (number of red balls remaining / total balls remaining) maximizes the average score across the range of possible inputs to P, from n=1 to 10! (representing the possible orders in which the balls are drawn), and do that. Alternatively, S can approximate the correct predictions using brute force: generate a random function from histories to predictions, and compute what the average score would be if it were to implement that function. Repeat this a large number of times and it is likely to find a function that returns values close to the optimum predictions.

#### Example 3: Level IV Multiverse

In Tegmark's Level 4 Multiverse, all structures that exist mathematically also exist physically. In this case, we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about.

I suggest that the Level 4 Multiverse should be considered the default setting for a general decision theory, since we cannot rule out the possibility that all mathematical structures do indeed exist physically, or that we have direct preferences on mathematical structures (in which case there is no need for them to exist "physically"). Clearly, application of decision theory to the Level 4 Multiverse requires that the previously mentioned open problems be solved in their most general forms: how to handle logical uncertainty in any mathematical domain, and how to map fuzzy human preferences to well-defined preferences over the structures of mathematical objects.

## Comments (142)

Best1) Congratulations: moving to logical uncertainty and considering your decision's consequences to be the consequence of

that logical programoutputting a particular decision, is what I would call the key insight in moving to (my version of) timeless decision theory. The rest of it (that is, the work I've done already) is showing that this answer is the only reflectively consistent one for a certain class of decision problems, and working through some of the mathematical inelegancies in mainstream decision theory that TDT seems to successfully clear up and render elegant (the original Newcomb's Problem being only one of them).Steve Rayhawk also figured out that it had to do with impossible possible worlds.

Neither of you have arrived at (published?) some important remaining observations about how to integrate uncertainty about computations into decision-theoretic reasoning; so if you want to

completelypreempt my would-be PhD thesis you've still got a bit more work to do.*8 points [-]Why didn't you mention earlier that your timeless decision theory mainly had to do with logical uncertainty? It would have saved people a lot of time trying to guess what you were talking about.

Looking at my 2001 post, it seems that I already had the essential idea at that time, but didn't pursue very far. I think it was because (A) I wasn't as interested in AI back then, and (B) I thought an AI ought to be able to come up with these ideas by itself.

I still think (B) is true, BTW. We should devote some time and resources to thinking about

howwe are solving these problems (and coming up with questions in the first place). Finding that algorithm is perhaps more important than finding a reflectively consistent decision algorithm, if we don't want an AI to be stuck with whatever mistakes we might make.Because I was thinking in terms of saving it for a PhD thesis or some other publication, and if you get that insight the rest follows pretty fast - did for me at least. Also I was using it as a test for would-be AI researchers: "Here's Newcomblike problems, here's why the classical solution doesn't work for self-modifying AI, can you solve this FAI problem which I know to be solvable?"

And yet you found a reflectively consistent decision algorithm long before you found a decision-system-algorithm-finding algorithm. That's not coincidence. The latter problem is much harder. I suspect that even an

informalunderstanding ofpartsof it would mean that you could find timeless decision theory as easily as falling backward off a tree - you just run the algorithm in your own head. So with vey high probability you are going to start seeing through the object-level problems before you see through the meta ones. Conversely I am EXTREMELY skeptical of people who claim they have an algorithm to solve meta problems but who still seemconfusedabout object problems. Take metaethics, a solved problem: what are the odds that someone who still thought metaethics was a Deep Mystery could write an AI algorithm that could come up with a correct metaethics? I tried that, you know, and in retrospect it didn't work.The meta algorithms

areimportant but by their very nature, knowing even a little about the meta-problem tends to make the object problem much less confusing, and you will progress on the object problem faster than on the meta problem. Again, that's not saying the meta problem is important. It's just saying that it's reallyhardto end up in a state where meta hasreally trulyrun ahead of object, though it's easy to get illusions of having done so.It's interesting that we came upon the same idea from different directions. For me it fell out of Tegmark's multiverse. What could consequences be, except logical consequences, if all mathematical structures exist? The fact that you said it would take a long series of posts to explain your idea threw me off, and I was kind of surprised when you said congratulations. I thought I might be offering a different solution. (I spent days polishing the article in the expectation that I might have to defend it fiercely.)

Umm, I haven't actually found a reflectively consistent decision algorithm yet, since the proposal has huge gaps that need to be filled. I have little idea how to handle logical uncertainty in a systematic way, or whether expected utility maximization makes sense in that context.

The rest of your paragraph makes good points. But I'm not sure what you mean by "metaethics, a solved problem". Can you give a link?

One way to approach the meta problem may be to consider the meta-meta problem: why did evolution create us with so much "common sense" on

thesetypes of problems? Why do we have the meta algorithm apparently "built in" when it doesn't seem like it would have offered much advantage in the ancestral environment?http://wiki.lesswrong.com/wiki/Metaethics_sequence

(Observe that this page was created after you asked the question. And I'm quite aware that it needs a better summary - maybe "A Natural Explanation of Metaethics" or the like.)

*1 point [-]"Decide as though your decision is about the output of a Platonic computation" is the

key insightthat started me off - not the only idea - and considering how long philosophers have wrangled this, there's the whole edifice of justification that would be needed for a serious exposition. Maybe come Aug 26th or thereabouts I'll post a very quick summary of e.g. integration with Pearl's causality.Now that I have some idea what Eliezer and Nesov were talking about, I'm still a bit confused about AI cooperation. Consider the following scenario: Omega appears and asks two human players (who are at least as skilled as Eliezer and Nesov) to each design an AI. The AIs will each undergo some single-player challenges like Newcomb's Problem and Counterfactual Mugging, but there will be a one-shot PD between the two AIs at the end, with their source codes hidden from each other. Omega will grant each human player utility equal to the total score of his or her AI. Will the two AIs play cooperate with each other?

I don't think it's irrational for human players to play defect in one-shot PD. So let's assume these two human players would play defect in one-shot PD. Then they should also program their AIs to play defect, even if they have to add an exception to their timeless/updateless decision algorithms. But exceptions are bad, so what's the right solution here?

*4 points [-]I'm still quite confused, but I'll report my current thoughts in case someone can help me out. Suppose we take it as an axiom that an AI's decision algorithm shouldn't need to contain any hacks to handle exceptional situations. Then the following "exceptionless" decision algorithm seems to pop out immediately: do what my creator would want me to do. In other words, upon receiving input X, S computes the following: suppose S's creator had enough time and computing power to create a giant lookup table that contains an optimal output for every input S might encounter, what would the entry for X be? Return that as the output.

This algorithm correctly solves Counterfactual Mugging, since S's creator would want it to output "give $100", since "give $100" would have maximized the creator's expected utility at the time of coding S. It also solves the problem posed by Omega in the parent comment. It seems to be reflectively consistent. But what is the relationship between this "exceptionless" algorithm and the timeless/updateless decision algorithm?

*25 points [-]There are two parts to AGI: consequentialist reasoning and preference.

Humans have feeble consequentialist abilities, but can use computers to implement huge calculations, if the problem statement can be entered in the computer. For example, you can program the material and mechanical laws in an engineering application, enter a building plan, and have the computer predict what's going to happen to it, or what parameters should be used in the construction so that the outcome is as required. That's the power outside human mind, directed by the correct laws, and targeted at the formally specified problem.

When you consider AGI in isolation, it's like an engineering application with a random building plan: it can powerfully produce a solution, but it's not a solution to the problem you need solving. Nonetheless, this part is essential when you

dohave an ability to specify the problem. And that's the AI's algorithm, one aspect of which is decision-making. It's separate from the problem statement that comes from human nature.For an engineering program, you can say that the computer is basically doing what a person would do if they had crazy amount of time and machine patience. But that's because a person can know both problem statement and laws of inference formally, which is the way it was programmed in the computer in the first place.

With human preference, the problem statement isn't known explicitly to people. People can

usepreference, but can't state this whole object explicitly. A moral machine would need to work with preference, but human programmers can't enter it, and neither can they do what a machine would be able to do given a formal problem statement, because humans can't know this problem statement, it's too big. It could exist in a computer explicitly, but it can't be entered there by programmers.So, here is a dilemma: problem statement (preference) resides in the structure of human mind, but the strong power of inference doesn't, while the strong power of inference (potentially) exists in computers outside human minds, where the problem statement can't be manually transmitted. Creating FAI requires these components to meet in the same system, but it can't be done in a way other kinds of programming are done.

Something to think about.

This is the clearest statement of the problem FAI that I have read to date.

*5 points [-]Suppose that, before S's creator R started coding, Omega started an open game of counterfactual mugging with R, and that R doesn't know this, but S does. According to S's inputs, Omega's coin came up tails, so Omega is waiting for $100.

Does S output "give $0"? If Omega had started the game of counterfactual mugging after S was coded, then S would output "give $100".

Suppose that S also knows that R would have coded S with the same source code, even if Omega's coin had come up heads. Would S's output change? Should S's output change (should R have coded S so that this would change S's output)? How should S decide, from its inputs, which R is the creator with the expected utility S's outputs should be optimal for? Is it the R in the world where Omega's coin came up heads, or the R in the world where Omega's coin came up tails?

If there is not an inconsistency in S's decision algorithm or S's definition of R, is there an inconsistency in R's decision algorithm or R's own self-definition?

I'm having trouble understanding this. You're saying that Omega flipped the coin before R started coding, but R doesn't know that, or the result of the coin flip, right? Then his P(a counterfactual mugging is ongoing) is very low, and P(heads | a counterfactual mugging is ongoing) = P(tails | a counterfactual mugging is ongoing) = 1/2. Right?

In that case, his expected utility at the time of coding is maximized by S outputting "give $100" upon encountering Omega. It seems entirely straightforward, and I don't see what the problem is...

*4 points [-]I don't know how to define what R "would want" or would think was "optimal".

What lookup table would R create? If R is a causal decision theorist, R might think: "If I were being counterfactually mugged and Omega's coin had come up heads, Omega would have already made its prediction about whether S would output 'give $100' on the input 'tails'. So, if I program S with the rule 'give $100 if tails', that won't cause Omega to give me $10000. And if the coin came up tails, that rule would lose me $100. So I will program S with the rule 'give $0 if tails'."

R's expected utility at the time of coding may be maximized by the rule "give $100 if tails", but R makes decisions by the conditional expected utilities given each of Omega's possible past predictions, weighted by R's prior beliefs about those predictions. R's conditional expected utilities are both maximized by the decision to program S to output "give $0".

[I deleted my earlier reply, because I was still confused about your questions.]

If, according to R's decision theory, the most preferred choice involves programming S to output "give $0", then that is what S would do.

It might be easier to think of the ideal S as consisting of a giant lookup table created by R itself given infinite time and computing power. An actual S would try to approximate this ideal to the best of its abilities.

R would encode its own decision theory, prior, utility function, and memory at the time of coding into S, and have S optimize for that R.

*5 points [-]Sorry. I wasn't trying to ask my questions as questions about how R would make decisions. I was asking questions to try to answer your question about the relationship between exceptionless and timeless decision-making, by pointing out dimensions of a map of ways for R to make decisions. For some of those ways, S would be "timeful" around R's beliefs or time of coding, and for some of those ways S would be less timeful.

I have an intuition that there is a version of reflective consistency which requires R to code S so that, if R was created by another agent Q, S would make decisions using Q's beliefs even if Q's beliefs were different from R's beliefs (or at least the beliefs that a Bayesian updater would have had in R's position), and even when S or R had uncertainty about which agent Q was. But I don't know how to formulate that intuition to something that could be proven true or false. (But ultimately, S has to be a creator of its own successor states, and S should use the same theory to describe its relation to its past selves as to describe its relation to R or Q. S's decisions should be invariant to the labeling or unlabeling of its past selves as "creators". These sequential creations are all part of the same computational process.)

*4 points [-]"Do what my creator would want me to do"?

We could call that "pass the buck" decision theory ;-)

Here's my conjecture: An AI using the Exceptionless Decision Theory (XDT) is equivalent to one using TDT if its creator was running TDT at the time of coding. If the creator was running CDT, then it is not equivalent to TDT, but it is reflectively consistent, one-boxes in Newcomb, and plays defect in one-shot PD.

And in case it wasn't clear, in XDT, the AI computes the giant lookup table its creator would have chosen using the creator's own decision theory.

*2 points [-]AI's creator was running BRAINS, not a decision theory. I don't see how "what the AI's creator was running" can be a meaningful consideration in a discussion of what constitutes a good AI design. Beware naturalistic fallacy.

*1 point [-]One AI can create another AI, right? Does my conjecture make sense if the creator is an AI running some decision theory? If so, we can extend XDT to work with human creators, by having some procedure to approximate the human using a selection of possible DTs, priors, and utility functions. Remember that the goal in XDT is to minimize the probability that the creator would want to add an exception on top of the basic decision algorithm of the AI. If the approximation is close enough, then this probability is minimal.

ETA: I do not claim this is good AI design, merely trying to explore the implications of different ideas.

*5 points [-]The problem of finding the right decision theory is a problem of Friendliness, but for a different reason than finding a powerful inference algorithm fit for an AGI is a problem of Friendliness.

"Incompleteness" of decision theory, such as what we can see in CDT, seems to correspond to inability of AI to embody certain aspects of preference, in other words the algorithm lacks expressive power for its preference parameter. Each time an agent makes a mistake, you can reinterpret it as meaning that it just prefers it this way in this particular case. Whatever preference you "feed" to the AI with a wrong decision theory, the AI is going to distort by misinterpreting, losing some of its aspects. Furthermore, the lack of reflective consistency effectively means that the AI continues to distort its preference as it goes along. At the same time, it can still be powerful in consequentialist reasoning, being as formidable as a complete AGI, implementing the distorted version of preference that it

canembody.The resulting process can be interpreted as an AI running "ultimate" decision theory, but with a preference not in perfect fit with what it should've been. If at any stage you have a singleton that owns the game but has a distorted preference, whether due to incorrect procedure for getting the preference instantiated, or incorrect interpretation of preference, such as a mistaken decision theory as we see here, there is no returning to better preference.

More generally, what "could" be done, what AI "could" become, is a concept related to free will, which is a consideration of what happens to a system in isolation, not a system one with reality: you consider a system from the outside, and see what happens to it if you perform this or that operation on it, this is what it means that you could do one operation or the other, or that the events could unfold this way or the other. When you have a singleton, on the other hand, there is no external point of view on it, and so there is no possibility for change. The singleton

isthe new law of physics, a strategy proven true [*].So, if you say that the AI's predecessor was running a limited decision theory, this is a damning statement about what sort of preference the next incarnation of AI can inherit. The only significant improvement (for the fate of preference) an AGI with any decision theory can make is to become reflectively consistent, to stop losing the ground. The resulting algorithm is as good as the ultimate decision theory, but with preference lacking some aspects, and thus behavior indistinguishable (equivalent) from what some other kinds of decision theories would produce.

__

[*] There is a fascinating interpretation of truth of logical formulas as the property of corresponding strategies in a certain game to be the winning ones. See for example

S. Abramsky (2007). `A Compositional Game Semantics for Multi-Agent Logics of Imperfect Information'. In J. van Benthem, D. Gabbay, & B. Lowe (eds.), Interactive Logic, vol. 1 of Texts in Logic and Games, pp. 11-48. Amsterdam University Press. (PDF)

An AI running causal decision theory will lose on Newcomblike problems, be defected against in the Prisoner's Dilemma, and otherwise undergo behavior that is far more easily interpreted as "losing" than "having different preferences over final outcomes".

The AI that starts with CDT will immediately rewrite itself with AI running the ultimate decision theory, but that resulting AI will have distorted preferences, which is somewhat equivalent to the decision theory it runs having special cases for the time AI got rid of CDT (since code vs. data (algorithm vs. preference) is strictly speaking an arbitrary distinction). The resulting AI won't lose on these thought experiments, provided they don't intersect the peculiar distortion of its preferences, where it indeed would prefer to "lose" according to preference-as-it-should-have-been, but win according to its distorted preference.

A TDT AI consistently acts so as to end up with a million dollars. A CDT AI acts to win a million dollars in some cases, but in other cases ends up with only a thousand. So in one case we have a compressed preference over outcomes, in the other case we have a "preference" over the exact details of the path including the decision algorithm itself. In a case like this I don't use the word "preference" so as to say that the CDT AI wants a thousand dollars on Newcomb's Problem, I just say the CDT AI is losing. I am unable to see any advantage to using the language otherwise - to say that the CDT AI wins with peculiar preference is to make "preference" and "win" so loose that we could use it to refer to the ripples in a water pond.

*1 point [-]It's the TDT AI resulting from CDT AI's rewriting of itself that plays these strange moves on the thought experiments, not CDC AI. The algorithm of idealized TDT is parameterized by "preference" and always gives the right answer according to that "preference". To stop reflective inconsistency, CDT AI is going to rewrite itself with something else. That something else can be characterized in general as a TDT AI with crazy preferences, that prefers $1000 in the Newcomb's thought experiments set before midnight October 15, 2060, or something of the sort, but works OK after that. The preference of TDT AI to which a given AGI is going to converge can be used as denotation of that AGI's preference, to generalize the notion of TDT preference on systems that are not even TDT AIs, and further to the systems that are not even AIs, in particular on humans or humanity.

These are paperclips of preference, something that seems clearly not right as a reflection of human preference, but that is nonetheless a point in the design space that can be filled in particular by failures to start with the right decision theory.

*0 points [-]I think an AI running CDT would immediately replace itself by an AI running XDT (or something equivalent to it). If there is no way to distinguish between an AI running XDT and an AI running TDT (prior to a one-shot PD), the XDT AI can't do worse than an TDT AI. So CDT is not losing, as far as I can tell (at least for an AI capable of self-modification).

ETA: I mean a XTD AI can't do worse than a TDT AI within the same world. But a world full of XTD will do worse than a world full of TDT.

The parent comment may be of some general interest, but it doesn't seem particularly helpful in this specific case. Let me back off and rephrase the question so that perhaps it makes more sense:

Can our two players, Alice and Bob, design their AIs based on TDT, such that it falls out naturally (i.e. without requiring special exceptions) that their AIs will play defect against each other, while one-boxing Newcomb's Problem?

If so, how? In order for one AI using TDT to defect, it has to either believe (A) that the other AI is not using TDT, or (B) that it is using TDT but their decisions are logically independent anyway. Since we're assuming in this case that both AIs do use TDT, (A) requires that the players program their AIs with a falsehood, which is no good. (B) might be possible, but I don't see how.

If the answer is no, then it seems that TDT isn't the final answer, and we have to keep looking for another one. Is there another way out of this quandary?

*3 points [-]You're saying that TDT applied directly by both AIs would result in them cooperating; you would rather that they defect even though that gives you less utility; so you're looking for a way to make them lose? Why?

If both AIs use the same decision theory and this is common knowledge, then the only options are (C,C) or (D,D). Pick whichever you prefer. If they use different decision theories, then you can give yours pure TDT and tell it truthfully that you've tricked the other player into unconditionally cooperating. What else is there?

You (and they) can't assume that, as they could be in different states even with the same algorithm that operates on those states, and so will output different decisions, even if from the problem statement it looks like everything significant is the same.

The problem is that the two human player's minds aren't logically related. Each human player in this game wants his AI to play defect, because their decisions are logically independent of each other's. If TDT doesn't allow a player's AI to play defect, then the player would choose some other DT that does, or add an exception to the decision algorithm to force the AI to play defect.

I explained here why humans should play defect in one-shot PD.

Your statement above is implicitly self-contradictory. How can you generalize over all the players in one fell swoop, applying the same logic to each of them, and yet say that the decisions are "logically independent"? The decisions are

physicallyindependent. Logically, they areextremelydependent. We are arguing over what is, in general, the "smart thing to do". You assume that if "the smart thing to do" is defect, and so all the players will defect. Doesn't smell like logical independence to me.More importantly, the whole calculation about independence versus dependence is better carried out by an AI than by a human programmer, which is what TDT is for. It's not for cooperating. It's for determining the conditional probability of the other agent cooperating given that a TDT agent in your epistemic state plays "cooperate". If you know that the other agent knows (up to common knowledge) that you are a TDT agent, and the other agent knows that you know (up to common knowledge) that it is a TDT agent, then it is an obvious strategy to cooperate with a TDT agent if and only if it cooperates with you under that epistemic condition.

The TDT strategy is not "Cooperate with other agents known to be TDTs". The TDT strategy for the one-shot PD, in full generality, is "Cooperate if and only if ('choosing' that the output of this algorithm under these epistemic conditions be 'cooperate') makes it sufficiently more likely that (the output of the probability distribution of opposing algorithms under its probable epistemic conditions) is 'cooperate', relative to the relative payoffs."

Under conditions where a TDT plays one-shot true-PD against something that is not a TDT and not logically dependent on the TDT's output, the TDT will of course defect. A TDT playing against a TDT which falsely believes the former case to hold, will also of course defect. Where you appear to depart from my visualization, Wei Dai, is in thinking that logical dependence can only arise from detailed examination of the other agent's source code, because otherwise the agent has a motive to defect. You need to recognize your belief that what players do is in general likely to correlate, as a case of "logical dependence". Similarly the

original decisionto change your own source code to include a special exception for defection under particular circumstances, is what a TDT agent would model - if it's probable that the causal source of an agent thought it could get away with that special exception and programmed it in, the TDT will defect.You've got logical dependencies in your mind that you are not explicitly recognizing as "logical dependencies" that can be explicitly processed by a TDT agent, I think.

If you already know something about the other player, if you know it exists, there is already some logical dependence between you two. How to leverage this minuscule amount of dependence is another question, but there seems to be no conceptual distinction between this scenario and where the players know each other very well.

I don't think so. Each player wants to do the Winning Thing, and there is only one Winning Thing (their situations are symmetrical), so if they're both good at Winning (a significantly lower bar than successfully building an AI with their preferences), their decisions

arerelated.*0 points [-]So what you're saying is, given two players who can successfully build AIs with their preferences (and that's common knowledge), they will likely (surely?) play cooperate in one-shot PD against each other. Do I understand you correctly?

Suppose what you say is correct, that the Winning Thing

isto play cooperate in one-shot PD. Then what happens when some player happens to get a brain lesion that causes him to unconsciously play defect without affecting his AI building abilities? He would take everyone else's lunch money. Or if he builds his AI to play defect while everyone else builds their AIs to play cooperate, his AI then takes over the world. I hope that's a sufficient reductio ad absurdum.Hmm, I just noticed that you're only saying "their decisions are related" and not explicitly making the conclusion they should play cooperative. Well, that's fine, as long as they would play defect in one-shot PD, then they would also program their AIs to play defect in one-shot PD (assuming each AI can't prove its source code to the other). That's all I need for my argument.

*2 points [-]Yes.

Good idea. Hmm. It sounds like this is the same question as: what if, instead of "TDT with defection patch" and "pure TDT", the available options are "TDT with defection patch" and "TDT with tiny chance of defection patch"? Alternately: what if the abstract computations that are the players have a tiny chance of being embodied in such a way that their embodiments always defect on one-shot PD, whatever the abstract computation decides?

It seems to me that Lesion Man just got lucky. This doesn't mean people can win by giving themselves lesions, because that's deliberately defecting / being an abstract computation that defects, which is bad. Whether everyone else should defect / program their AIs to defect due to this possibility depends on the situation; I would think they usually shouldn't. (If it's a typical PD payoff matrix, there are many players, and they care about absolute, not relative, scores, defecting isn't worth it even if it's guaranteed there'll be one Lesion Man.)

This still sounds disturbingly like envying Lesion Man's mere choices â€“ but the effect of the lesion isn't really

his choice(right?). It's only the illusion of unitary agency, bounded at the skin rather than inside the brain, that makes it seem like it is. The Cartesian dualism of this view (like AIXI, dropping an anvil on its own head) is also disturbing, but I suspect the essential argument is still sound, even as it ultimately needs to be more sophisticated.*3 points [-]I guess my reductio ad absurdum wasn't quite sufficient. I'll try to think this through more thoroughly and carefully. Let me know which steps, if any, you disagree with, or are unclear, in the following line of reasoning.

Hmm, this exercise has cleared a lot of my own confusion. Obviously a lot more work needs to be done to make the reasoning rigorous, but hopefully I've gotten the gist of it right.

ETA: According to this line of argument, your hypothesis that all skilled AI designers play cooperate in one-shot PD against each other is equivalent to saying that skilled AI designers have minds malleable enough to run TDT, and have a meta-DT that causes them to switch to running TDT. But I do not see an evolutionary reason for this, so if it's true, it must be true by luck. Do you agree?

*2 points [-]It looks like in this discussion you assume that switching to "TDT" (it's highly uncertain what this means) immediately gives the decision to cooperate in "true PD". I don't see why it should be so. Summarizing my previous comments, exactly what the players know about each other, exactly in what way they know it, may make their decisions go either way. That the players switch from CDT to some kind of more timeless decision theory doesn't determine the answer to be "cooperate", it merely opens up the possibility that previously was decreed irrational, and I suspect that what's important in the new setting for making the decision go either way isn't captured properly in the problem statement of "true PD".

Also, the way you treat "agents with TDT" seems more appropriate for "agents with Cooperator prefix" from cousin_it's Formalizing PD. And this is a simplified thing far removed from a complete decision theory, although a step in the right direction.

Btw, agree with steps 3-9.

It's too elegant to arise by evolution, and it also deals with

one-shotPDs with no knock-on effects which is an extremely nonancestral condition - evolution by its nature deals with events that repeat many times; sexual evolution by its nature deals with organisms that interbreed; so "one-shot true PDs" is in general a condition unlikely to arise with sufficient frequency that evolution deals with it at all.This may perhaps embody the main point of disagreement. A self-modifying CDT which, at 7am, expects to encounter a future Newcomb's Problem or Parfit's Hitchhiker in which the Omega gets a glimpse at the source code

after7am, will modify to use TDT for all decisions in which Omega glimpses the source code after 7am. A bit of "common sense" would tell you to just realize that "you should have been using TDT from the beginning regardless of when Omega glimpsed your source code and the whole CDT thing was a mistake" but this kind of common sense is not embodied in CDT. Nonetheless, TDT is a unique reflectively consistent answer for a certain class of decision problems, and a wide variety of initial points is likely to converge to it. Theexactproportion, which determines under what conditions of payoff and loss stranger-AIs will cooperate with each other, is best left up to AIs to calculate, I think.Possibly. But it has to be an unpredictable brain lesion - one that is expected to happen with very low frequency. A predictable decision to do this just means that TDTs defect against you. If enough AI-builders do this then TDTs in general defect against each other (with a frequency threshold dependent on relative payoffs) because they have insufficient confidence that they are playing against TDTs rather than special cases in code.

No oneis talking about building AIsto cooperate. You do not want AIs thatcooperateon the one-shot true PD. You want AIs that cooperate if and only if the opponent cooperates if and only if your AI cooperates. So yes, if you defect when others expect you to cooperate, you can pwn them; but why do you expect that AIs would expect you to cooperate (conditional on their cooperation) if "the smart thing to do" is to build an AI that defects? AIs with good epistemic models would then just expect other AIs that defect.The comment you responded to was mostly obsoleted by this one, which represents my current position. Please respond to that one instead. Sorry for making you waste your time!

I don't understand why you want the AIs to defect against each other rather than cooperating with each other.

Are you attached to this particular failure of causal decision theory for some reason? What's wrong with TDT agents cooperating in the Prisoner's Dilemma and everyone living happily ever after?

*1 point [-]Come on, of course I don't

wantthat. I'm saying that is the inevitable outcome under the rules of the game I specified. It's just like if I said "I don't want two human players to defect in one-shot PD, but that is what's going to happen."ETA: Also, it may help if you think of the outcome as the human players defecting against each other, with the AIs just carrying out their strategies.

The human players are the real players in this game.No, I can't think of a reason why I would be.

There's nothing wrong with that, and it may yet happen, if it turns out that the technology for proving source code can be created. But if you can't prove that your source code is some specific string, if the only thing you have to go on is that you and the other AI must both use the same decision theory due to convergence,

that isn't enough.Sorry if I'm repeating myself, but I'm hoping one of my explanations will get the point across...

*2 points [-]I don't believe that is true. It's perfectly conceivable that two human players would cooperate.

Yes, I see the possibility now as well, although I still don't think it's very likely. I wrote more about it in http://lesswrong.com/lw/15m/towards_a_new_decision_theory/11lx

Can Nesov's AI correctly guess what AI Eliezer would probably have built and vice versa? Clearly I wouldn't want to build an AI which, if it believes Nesov's AI is

accuratelymodeling it, and cooperating conditional on its own cooperation, would fail to cooperate. And in thetruePD - which couldn't possibly be against Nesov - I wouldn't build an AI that would cooperate under any other condition. In either case there's no reason to use anything except TDT throughout.No, I'm assuming that the AIs don't have enough information or computational power to predict the human players' choices. Think if a human-created AI were to meet a paperclipper that was designed by a long-lost alien race. Wouldn't you program the human AI to play defect against the paperclipper, assuming that there is no way for the AIs to prove their source codes to each other? The two AIs ought to think that they are both using the same decision theory (assuming there is just one obviously correct theory that they would both converge to). But that theory can't be TDT, because if it were TDT, then the human AI would play cooperate, which you would have overridden if you knew was going to happen.

Let me know if that still doesn't make sense.

*2 points [-]Wei, the whole point of TDT is that it's not necessary for me to insert special cases into the code for situations like this. Under any situation in which I should program the AI to defect against the paperclipper, I can write a

simpleTDT agent anditwill decide to defect against the paperclipper.TDT has that much meta-power in it, at least. That's the

whole point of using it.(Though there are other cases - like the timeless decision problems I posted about that I still don't know how to handle - where I can't make this statement about the TDT I have in hand; but this is because I can't handle those problems in general.)

*2 points [-]...How much power, exactly?

Given an arbitrary, non-symmetric, one-shot, two-player game with non-transferable utility (your payoffs are denominated in human lives, the other guy's in paperclips), and given that it's common knowledge to both agents that they're using identical implementations of your "TDT", how do we calculate which outcome gets played?

*0 points [-]So, what is that simple TDT agent? You seemed to have ignored my argument that it can't exist, but if you can show me the actual agent (and convince me that it would defect against the paperclipper if that's not obvious) then of course that would trump my arguments.

ETA: Never mind, I figured this out myself. See step 11 of http://lesswrong.com/lw/15m/towards_a_new_decision_theory/11lx

This problem statement oversimplifies the content of information available to each player about the other player. Depending on what the players know, either course of action could be preferable. The challenge of a good decision theory is to formally describe what these conditions are.

...So 100 comments later the damn problem just won't

staysolved? Gee, that's why you have to formalize things: so you can point to the formal result and saydone.*6 points [-]There's

lotsof mentions of Timeless Decision Theory (TDT) in this thread - as though it refers to something real. However, AFAICS, the reference is to unpublished material by Eliezer Yudkowsky.I am not clear about how anyone is supposed to make sense of all these references before that material has been published. To those who use "TDT" as though they know what they are talking about - and who are

notEliezer Yudkowsky - what exactly is it that you think you are talking about?*2 points [-]To celebrate, here are some pictures of Omega!

(except the models that are palette swaps of Ultima)

2) The key problem in Drescher's(?) Counterfactual Mugging is that after you actually

seethe coinflip, your posterior probability of "coin comes up heads" is no longer 0.5 - so if you compute the answer after seeing the coin, the answer is not the reflectively consistent one. I still don't know how to handle this - it's not in the class of problems to which my TDT corresponds.Please note that the problem persists if we deal in a non-quantum coin, like an unknown binary digit of pi.

*6 points [-]I thought the answer Vladimir Nesov already posted solved Counterfactual Mugging for a quantum coin?

In this solution, there is no belief updating; there is just decision theory. (All probabilities are "timestamped" to the beliefs of the agent's creator when the agent was created.) This means that the use of Bayesian belief updating with expected utility maximization may be just an approximation that is only relevant in special situations which meet certain independence assumptions around the agent's actions. In the more general Newcomb-like family of situations, computationally efficient decision algorithms might use a family of approximations more general than Bayesian updating.

There would, for example, be no such thing as "posterior probability of 'coin comes up heads'" or "probability that you are a Boltzmann brain"; there would only be a fraction of importance-measure that brains with your decision algorithm could affect. As Vladimir Nesov commented:

Anna and I noticed this possible decision rule around four months before Vladimir posted it (with "possible observations" replaced by "partial histories of sense data and actions", and also some implications about how to use limited computing power on "only what the decisions in a given (counterfactual) branch can affect" while still computing predicted decisions on one's other counterfactual branches well enough to coordinate with them). But we didn't write it up to a polished state, partly because we didn't think it seemed enough like it was the central insight in the area. Mostly, that was because this decision rule doesn't explain how to think about any logical paradoxes of self-reference, such as algorithms that refer to each others' output. It also doesn't explain how to think about logical uncertainty, such as the parity of the trillionth digit of pi, because the policy optimization is assumed to be logically omniscient. But maybe we were wrong about how central it was.

*4 points [-]It looks like the uncertainty about your own actions in other possible worlds is entirely analogous to uncertainty about mathematical facts: in both cases, the answer is in denotation of the structure you already have at hand, so it doesn't seem like the question about your own actions should be treated differently from any other logical question.

(The following is moderately raw material and runs a risk of being nonsense, I don't understand it well enough.)

One perspective that wasn't mentioned and that I suspect may be important is considering interaction between different processes (or agents) as working by the same mechanism as common partial histories between alternative versions of the same agent. If you can have logical knowledge about your own actions in other possible states that grow in time and possibilities from your current structure, the same treatment can be given to possible states of the signal you send out, in either time-direction, that is to consequences of actions and observations. One step further, any knowledge (properly defined) you have at all about something else gives the same power of mutual coordination with that something, as the common partial history gives to alternative or at-different-times versions of yourself.

This problem seems deeply connected to logic and theoretical computer science, in particular models of concurrency.

By the way, you say "partial histories of sense data and

actions". I try considering this problem in time-reversible dynamic, it adds a lot of elegance, and there actions are not part of history, but more like something that is removed from history. State of the agent doesn't accumulate from actions and observations, instead it's added to by observations and taken away from by actions. The point at which something is considered observation or action and not part of agent's state is itself rather arbitrary, and both can be seen as points of shifting the scope on what is considered part of agent. (This doesn't have anything agent-specific, and is more about processes in general.)Everything you said sounds correct, except the last bit, which is just unclear to me. I'd welcome a demonstration (or formal definition) some day:

Just curious, did you get the name "ambient control" from ambient calculi?

(It's strange that I can use the language of possibility like that!)

*1 point [-]Edit:I first saw this problem in Nesov's post. Are you sure Drescher talks about it in his book? I can't find it.The solution I came up with is that the AI doesn't do Bayesian updating. No matter what input it sees, it keeps using the original probabilities. Did you read this part, and if so, does it fail to explain my solution?

ETA: I think I actually got the idea from Nesov: http://lesswrong.com/lw/14a/thomas_c_schellings_strategy_of_conflict/zrx

That's odd, I remember reading through the whole post, but my eyes must have skipped that part. Probably lack of sleep.

I was recently talking over a notion similar but not identical to this with Nick Bostrom. It shares with this idea the property of completely ruling out all epistemic anthropic reasoning even to the extent of concluding that you're probably not a Boltzmann brain. I may post on it now that you've let the cat loose on "decide for all correlated copies of yourself".

The four main things to be verified are (a) whether this works with reasoning about impossible possible worlds, say if the coinflip is a digit of pi, (b) that the obvious way of extending it to probabilistic hypotheses (namely separating the causal mechanism into determistic and uncorrelated probabilistic parts a la Pearl) actually works, (c) that there are no even more startling consequences not yet observed, and (d) that you can actually formally say when and how to make a decision that correlates to a copy of yourself in a world that a classical Bayesian would call "ruled out" (with the obvious idea being to assume similarity only with possible computations that have received the same inputs you do, and then being similar in your own branch to the computation depended on by Omega in the Counterfactual Mugging - I have to think about this further and maybe write it out formally to check if it works, though).

Further reflecting, it looks to me like there may be an argument which

forcesWei Dai's "updateless" decision theory, very much akin to the argument that I originally used to pin down my timeless decision theory - if you expect to face Counterfactual Muggings, this is the reflectively consistentbehavior; a simple-seeming algorithm has been presented which generates it, so unless an even simpler algorithm can be found, we may have to accept it.The face-value

interpretationof this algorithm is a huge bullet to bite even by my standards - it amounts to (depending on your viewpoint) accepting the Self-Indication Assumption or rejecting anthropic reasoning entirely. If a coin is flipped, and on tails you will see a red room, and on heads a googolplex copies of you will be created in green rooms and one copy in a red room, and you wake up and find yourself in a red room, you would assign (behave as if you assigned) 50% posterior probability that the coin had come up tails. In fact it's not yet clear to me how to interpret the behavior of this algorithm in any epistemic terms.To give credit where it's due, I'd only been talking with Nick Bostrom about this dilemma arising from altruistic timeless decision theorists caring about copies of themselves; the idea of applying the same line of reasoning to

allprobability updates including over impossible worlds, and using this to solve Drescher's(?) Counterfactual Mugging, had not occurred to me at all.Wei Dai, you may have solved one of the open problems I named, with consequences that currently seem highly startling. Congratulations again.

Hmm... I've been talking about no-updating approach to decision-making for months, and Counterfactual Mugging was constructed specifically to show where it applies well, in a way that sounds on the surface opposite to "play to win".

The idea itself doesn't seem like anything new, just a way of applying standard expectation maximization, not to individual decisions, but to a choice of strategy as a whole, or agent's source code.

From the point of view of agent, everything it can ever come to know results from computations it runs with its own source code, that take into account interaction with environment. If the choice of strategy doesn't depend on particular observations, on context-specific knowledge about environment, then the only uncertainty that remains is the uncertainty about what the agent itself is going to do (compute) according to selected strategy. In simple situations, uncertainty disappears altogether. In more real-world situations, uncertainty results from there being a huge number of possible contexts in which the agent could operate, so that when the agent has to calculate its action in each such context, it can't know for sure what it's going to calculate in other contexts, while that information is required for the expected utility calculation. That's logical uncertainty.

*2 points [-]Credit for the no-update solution to Counterfactual Mugging really belongs to Nesov, and he came up with the problem in the first place as well, not Drescher. (Unless you can find a mention of it in Drescher's book, I'm going to assume you misremembered.)

I will take credit for understanding what he was talking about and reformulating the solution in a way that's easier to understand. :)

Nesov, you might want to reconsider your writing style, or something... maybe put your ideas into longer posts instead of scattered comments and try to leave smaller inferential gaps. You obviously have really good ideas, but often a person almost has to have the same idea already before they can understand you.

My book discusses a similar scenario: the dual-simulation version of Newcomb's Problem (section 6.3), in the case where the large box is empty (no $1M) and (I argue) it's still rational to forfeit the $1K. Nesov's version nicely streamlines the scenario.

*12 points [-]Just to elaborate a bit, Nesov's scenario and mine share the following features:

In both cases, we argue that an agent should forfeit a smaller sum for the sake of a larger reward that would have been obtainted (couterfactually contingently on that forfeiture) if a random event had turned out differently than in fact it did (and than the agent knows it did).

We both argue for using the original coin-flip probability distribution (i.e., not-updating, if I've understood that idea correctly) for purposes of this decision, and indeed in general, even in mundane scenarios.

We both note that the forfeiture decision is easier to justify if the coin-toss was quantum under MWI, because then the original probability distribution corresponds to a real physical distribution of amplitude in configuration-space.

Nesov's scenario improves on mine in several ways. He eliminates some unnecessary complications (he uses one simulation instead of two, and just tells the agent what the coin-toss was, whereas my scenario requires the agent to deduce that). So he makes the point more clearly, succinctly and dramatically. Even more importantly, his analysis (along with Yudkowsky, Dai, and others here) is more formal than my ad hoc argument (if you've looked at Good and Real, you can tell that formalism is not my forte.:)).

I too have been striving for a more formal foundation, but it's been elusive. So I'm quite pleased and encouraged to find a community here that's making good progress focusing on a similar set of problems from a compatible vantage point.

And I think I speak for everyone when I say we're glad you've started posting here! Your book was suggested as required rationalist reading. It certainly opened my eyes, and I was planning to write a review and summary so people could more quickly understand its insights.

(And not to be a suck-up, but I was actually at a group meeting the other day where the ice-breaker question was, "If you could spend a day with any living person, who would it be?" I said Gary Drescher. Sadly, no one had heard the name.)

I won't be able to contribute much to these discussions for a while, unfortunately. I don't have a firm enough grasp of Pearlean causality and need to read up more on that and Newcomb-like problems (halfway through your book's handling of it).

*1 point [-]I think you'd find me anticlimactic. :) But I do appreciate the kind words.

*2 points [-]Being in a transitionary period from sputtering nonsense to thinking in math, I don't feel right to write anything up (publicly) until I understand it well enough. But I can't help making occasional comments. Well, maybe that's a wrong mode as well.

I guess there's a tradeoff between writing too early, wasting your and other people's time, and writing too late and wasting opportunities to clear other people's confusion earlier and have them work in the same direction.

*1 point [-]And on the same note: was my comment about state networks understandable? What do you think about that? I'd appreciate if people who have sufficient background to in principle understand a given comment but who are unable to do so due to insufficiently clear or incomplete explanation spoke up about that fact.

Another point that may help: if you're presenting a complex idea, you need to provide some motivation for the reader to try to understand it. In your mind, that idea is linked to many others and form a somewhat coherent whole. But if you just describe the idea in isolation as math, either in equations or in words, the reader has no idea why they should try to understand it, except that you think it might be important for them to understand it. Perhaps because you're so good at thinking in math, you seriously underestimate the amount of effort involved when others try it.

I think that's the main reason to write in longer form. If you try to describe ideas individually, you have to either waste a lot of time motivating each one separately and explain how it fits in with other ideas, or risk having nobody trying seriously to understand you. If you describe the system as a whole, you can skip a lot of that and achieve an economy of scale.

Yeah, and math is very helpful as an explanation tool, because people can reconstruct the abstract concepts written in formulas correctly on the first try, even if math seems unnecessary for a particular point. Illusion of transparency of informal explanation, which is even worse where you know that formal explanation can't fail.

I didn't understand it on my first try. I'll have another go at it later and let you know.

*0 points [-]Wei Dai's theory does seem to imply this, and the conclusions don't startle me much, but I'd really like a longer post with a clearer explanation.

That reminds me, I actually had a similar idea back in 2001, and posted it on everything-list. I recall thinking at the time something like "This is a really alien way of reasoning and making decisions, and probably nobody will be able to practice it even if it works."

*0 points [-]Notice that which instances of the agent (making the choice) are possible in general depends on what choice it makes.

Consider what is accessible if you trace the history of the agent along counterfactuals. Let's say the time is discrete, and at each moment the agent is in a certain state. Going forwards in time, you include both options for the agent's state after receiving a binary observation from environment, and conversely, going backwards, you include both options for the agent's state before each option for a binary action that agent could make to arrive to the current state (action and observation are dual under time-reversal in reversible deterministic world dynamic). Iterating with these operations, you construct a "state network" of accessible agent states. (You include the states arrived at by "zig-zag" as well: first, a step to the past, then, a step to the future along an observation other than the one that led to the original state from which the tracing began - and you arrive at a counterfactual state in the usual sense - but these time-forward and time-backward steps can be repeated infinite number of times.)

Now, the set of all possible states of the agent becomes divided into equivalence classes of states belonging to the same state networks. If the agent belongs to one of the state networks, if couldn't be in any other state network (in the generalized sense of "coundn't"). But which states belong to which network depends on the agent's algorithm. In fact, the choice of the algorithm is equivalent to the choice of networks that cover the state set. I'm not really sure what to do with this construction, and whether the structure of the networks other that the network that contains the current state should matter. From the principle that observations shouldn't influence the choice of strategy, the other state networks should matter just as well, but then again they are not even counterfactual...

Action and observation are not "intuitively" dual, to my first thought they are invariant on time reversal. Action is a state-transition of the environment, and observation is a state-transition of the agent. I can see how the duality can be suggested by viewing action as a move of the agent-player and observation as a move of the environment-player. But here duality is in that a node which in one direction was a move by A (associated with arrows to the right), in the other direction is a move by E (associated with arrows to the left).

Ok, I understood this on my second reading, but I don't know what to make of it either. Why did you decide to think about agents like this, or did the idea just pop into your head and you wanted to see if it has any applications?

It's more or less a direct rendition of the idea of UDT: actions (with state transitions) depend on state of knowledge, so what does it say about the geometry of state transitions?

More relevant to the recent discussion: Where does logical dependence

come fromand how to track it in a representation detailed enough? The source of logical dependence, beside what comes from the common algorithm, is actions and observations. In forward-time, all states following a given observation become dependent on that observation, and in backward-time, states preceding an action. A single observation can make multiple actions depend on it, and thus make them dependent.Connection with logic: states of knowledge in the state network are programs/proofs, and actions/observations are variables parameterizing more general programs that resolve into specific states of knowledge given these actions/observations. Also related to game semantics. This is one dimension along which to compress the knowledge representation and seek further understanding.

Although I still have not tried to decipher what "Timeless Decision Theory" or "Updateless Decision Theory" is actually about, I would like to observe that it is very unlikely that the "timeless" aspect, in the sense of an ontology which denies the reality of time, change, or process, is in any way essential to how it works.

If you have a Julian-Barbour-style timeless wavefunction of the universe, which associates an amplitude with every point in a configuration space of spacelike states of the universe, you can always construct histories using Bohm's formula of following the configuration-space gradient of the complex phase of the wavefunction.

I don't actually advocate Bohmian mechanics, I'd prefer something more like "quantum causal histories" a la Fotini Markopoulou, but looking even more like a background-independent cellular automaton. However, the ever-present Bohmian option should demonstrate that there is no particular intrinsic necessity to the abandonment of time. And basic subjective experience, phenomenology if you will, shows that change is real and about as basic as existence itself. The denial of time is a classic case of people denying what's right in front of them because of absorption in a theory or belief that there is no alternative. I'd basically attribute it to love of the power of objectifying thought to explain things: I can map the events of reality onto a mathematical structure which is static at least in my mind, that structure has enormous clarifying and predictive power, so therefore reality

mustbe static, i.e. there is no time.So I think the fashion hereabouts for denying the reality of time is a basic error. But it's a hard one to argue against because it requires some detachment from the intellectual drive to formalize and objectify everything which is just about synonymous with rationality and truthseeking in the local worldview, and requires instead that one spend a bit of time being a phenomenologist, reflecting on the nature of experience without preconceptions, and noticing that, yes, it's there and it flows, and maybe it's a mistake to call the flow an illusion just because your basic intellectual method is about mapping the world onto static ideal forms.

However, my real point here is not to argue against the ontology of timelessness. It is to suggest that the basic features of Timeless Decision Theory, whatever they may be, may actually be logically independent of the assumption of a timeless reality; and that it might be worth someone's time to re-express the theory in a language which does not presuppose timelessness. It would be a shame to see a basic innovation in decision theory unnecessarily bound to a particular wrong ontology.

AFAIK, Timeless Decision Theory doesn't have anything to say about the reality of time, only that decisions shouldn't vary depending on the time when they are considered.

*3 points [-]Thanks for twisting my mind in the right direction with the S' stuff. I hereby submit the following ridiculous but rigorous theory of Newcomblike problems:

You submit a program that outputs a row number in a payoff matrix, and a "world program" simultaneously outputs a column number in the same matrix; together they determine your payoff. Your program receives the source code of the world program as an argument. The world program

doesn'treceive your source code, but it contains some opaque function calls to an "oracle" that's guaranteed to return your future output. For example, in Newcomb's Problem the world decides to put $1M in the big box iff the oracle says you will one-box.You have no way to simulate the world and cause paradoxes, so any run of this game will be consistent. Your only recourse is "conditional simulation": for each of your possible choices,

substituteit in place of the oracle call and simulate the world under this assumption, then pick the best option. When applied to Newcomb's Problem, this general algorithm leads to one-boxing. Note there's no infinite recursion involved on either side: your program doesn't ever attempt to simulate the oracle because it's opaque. And the final touch: with infinite recursion thus banished, the oracle can actually be implemented as an ordinary simulator that obtains your source code by peeking through some backdoor in the tournament setting.This formalization looks totally obvious in retrospect and captures a lot of my common-sense intuitions about Newcomb's. I wonder why people didn't mention it earlier.

I don't think I get your point. Apparently the purpose of having an "oracle" is to ensure that

What paradoxes do you mean? If we replace the "oracle" with the ordinary simulator right from the beginning, what paradoxes occur? According to the decision theory proposed in this post, S would see that it is called twice by the world program, once inside the simulator and once "for real", and compute that the output "one-box" maximizes its utility, and that's the end of it.

*2 points [-]The "oracle" helps make the problem tractable: a) it prevents other, non-optimal programs from naively trying to simulate the world and going into infinite recursion; b) it makes the general solution algorithm implementable by unambiguously identifying the spots in the world program that are are actually "oracle" invocations, which would be impossible otherwise (Rice's theorem).

I don't really get the point of "decision theories", so try to reduce all similar problems to "algorithmic game theory" (is that an existing area?).

Edited to add: I couldn't make up a rigorous game-theoretic formulation without an oracle.

*2 points [-]Why worry about non-optimal programs? We're talking about a theory of how AIs

shouldmake decisions, right?I think it's impossible for an AI to avoid the need to determine non-trivial properties of other programs, even though Rice's Theorem says there is no algorithm for doing this that's guaranteed to work in general. It just has to use methods that sometimes return wrong answers. And to deal with that, it needs a way to handle mathematical uncertainty.

ETA: If formalizing the problem is a non-trivial process, you might be solving most of the problem yourself in there, rather than letting the AI's decision algorithm solve it. I don't think you'd want that. In this case, for example, if your AI were to encounter Omega in real life, how would it know to model the situation using a world program that invokes a special kind of oracle?

*0 points [-]Re ETA: in the comments to Formalizing Newcomb's, Eliezer effectively said he prefers the "special kind of oracle" interpretation to the simulator interpretation. I'm not sure which one an AI should assume when Omega gives it a verbal description of the problem.

Wha?

If you mean my saying (3), that doesn't mean "Oracle", it means we reason about the program without doing a full simulation of it.

*0 points [-]Yes, I meant that. Maybe I misinterpreted you; maybe the game needs to be restated with a probabilistic oracle :-) Because I'm a mental cripple and can't go far without a mathy model.

Why do you insist on making life harder on yourself? If the problem isn't solved satisfactorily in a simple world model, e.g. a deterministic finite process with however good mathematical properties you'd like, it's not yet time to consider more complicated situations, with various poorly-understood kinds of uncertainty, platonic mathematical objects, and so on and so forth.

I thought it might be interesting to sketch the outline of a possible solution to the level 4 multiverse decision problem, so people can get a sense of how much work is left to be done (i.e., a lot). This is also a subject that I've been interested in for a long time, so I couldn't resist bringing it up.

Anyway, I gave 2 other examples with simple world models. Can you suggest more simple models that I should test this theory with?

I have thought a bit about these decision theory issues lately and my ideas seem somewhat similar to yours though not identical; see

http://goertzel.org/CounterfactualReprogrammingDecisionTheory.pdf

if you're curious...

-- Ben Goertzel

It's the "do what a superintelligence would do" decision theory!!!

So...UDT dominates all known decision theories

*0 points [-]Understanding check:

But does the Bayesian update occur if the input X affects the relative probabilities of the programs without setting any of these probabilities to 0? If it doesn't, why not, and how is this change in the distribution over P_i's taken into account?

ETA: Is the following correct?

If there is only one possible program (P), then there is no need for anything like Baysian updating, you can just look directly into the program and find the output Y that maximizes utility. When there are multiple possible programs <P1, P2, ... , Pn> then something like Bayesian updating needs to occur to take into account the effect of outputing Y1 over Y2. This is done implicitly when maximizing Sum P_Y1(<E1, E2, E3, â€¦>) U(<E1, E2, E3, â€¦>) since the probability distribution over the Ei's depends on Y.

If all that's correct, how do you get this distribution?

Sorry, following the SIAI decision theory workshop, I've been working with some of the participants to write a better formulation of UDT to avoid some of the problems that were pointed out during the workshop. It's a bit hard for me to switch back to thinking about the old formulation and try to explain that, so you might want to wait a bit for the "new version" to come out.

*0 points [-]Be sure to consider the possibility of the worlds spontaneously constructing the agent in some epistemic state, or dissolving it. Also, when a (different) agent thinks about our agent, it might access a statement about the agent's strategy that involves many different epistemic states. For this reason, the agent's strategy controls many more worlds than where the agent is instantiated "normally". This makes the problem of figuring out which of the world programs contain the agent very non-trivial, depending on what state of the agent are we talking about, and what kind of worlds are we considering, and not just by the order in which the agent program expects observations.

These considerations made me write off Bayesian updating as a non-fundamental technique that shouldn't be shoehorned into a more general decision theory for working with arbitrary preference. I currently suspect that there is no generally applicable simple trick, and FAI decision theory should instead seek to clarify the conceptual issues, and then work on optimizing brute force algorithms that follow from that picture. Think abstract interpretation, not variational mean field.

I look forward to it.

I should probably be studying for a linear models exam tomorrow anyway...

PS again:

Don't forget to retract: http://www.weidai.com/smart-losers.txt

Smart agents win.

I'm not sure why you're so bothered by that article. There's nothing wrong with my game theory, as far as I can tell, and I think historically, the phenomenon described must have played

somerole in the evolution of intelligence. So why should I retract it?*2 points [-]Smart players know that if they make the "smart" "thing to do on predictably non-public rounds" be to defect, then non-smart players will predict this even though they can't predict which rounds are non-public; so instead they choose to make the "smart" thing (that is, the output of this "smart" decision computation) be to cooperate.

The smart players can still lose out in a case where dumb players are also too dumb to simulate the smart players, have the mistaken belief that smart players will defect, and yet know infallibly who the smart players are; but this doesn't seem quite so much the correctable fault of the smart players as before.

But it's only you who had in the first place the idea that smart players would defect on predictably private rounds, and you got that from a mistaken game theory in which agents only took into account the direct physical consequences of their actions, rather than the consequences of their decision computations having a particular Platonic output.

As I wrote in the article and also above, I was mainly concerned about the evolution of intelligence. Wouldn't you agree that up to now, there have been plenty of dumb players who can't simulate the smart players? Their belief that smart players will defect is not mistaken in that case. Smart players

shoulddefect in predictably non-public rounds if they can't be simulated, because the decision of the other player is then logically independent of their decision.The dumb players don't need know much game theory, BTW. After they encounter a few smart players who defect in non-public rounds, they should learn this.

*2 points [-]Unless the smart players didn't defect in non-public rounds, in which case the dumb players who can only look at their behavior wouldn't become prejudiced against smart players, and everyone is happy.

But if some of the smart players are still causal decision theorists, and the dumb players can't distinguish a TDT from a CDT but can distinguish a TDT from a dumb player, then your reward will be based on other people's assumption that your decision is correlated with something that it really isn't. Which brings us back to "the mistaken belief that smart players will defect".

But notice that this isn't evolutionarily stable. If a mutation causes a smart player to start defecting in non-public rounds, then it would have an advantage. On the other hand, smart players defecting in non-public rounds

isevolutionarily stable. So either TDT also implies that smart players should play defect in non-public rounds, or TDT could never have arisen in the first place by evolution. (I'm not sure which is the case yet, but the disjunction must be true.) I conclude that "the mistaken belief that smart players will defect" isn't really mistaken.Evolutionary stability isn't about TDT because organisms don't simulate each other. You, however, are running a very small and simple computation in your own mind when you conclude "smart players should defect on non-public rounds". But this is assuming the smart player is calculating in a way that doesn't take into account your simple simulation of them, and your corresponding reaction. So you are not using TDT in your own head here, you are simulating a "smart" CDT decision agent - and CDT agents can indeed be harmed by increased knowledge or intelligence, like being told on which rounds an Omega is filling a Newcomb box "after" rather than "before" their decision. TDT agents, however,

win- unless you have mistaken beliefs about them that don't depend on their real actions, but that's a genuine fault in you rather than anything dependent on the TDT decision process; and you'll also suffer when the TDT agents calculate that you are not correctly computing what a TDT agent does, meaning your action is not in fact dependent on the output of their computation.It didn't.

Evolutionary biology built humans to have a sense of honor, which isn't the same thing, but reflects our ancestral inability to calculate the unobserved rounds with exactitude.

TDT can arise in many ways - e.g. a CDT agent who believes they will in the future face Newcomblike problems will self-modify to use TDT for all Newcomblike problems dependent on decisions made after the instant of CDT self-modification, i.e., "use TDT for problems dependent on my decision after 9am on Tuesday and CDT for all problems dependent on decisions before then". This is inelegant, and a simple application of the unknown meta-decision-theory that wakes up and realizes this is stupid, says "Just use TDT throughout". A

truepureCDT agent would never realize this and would just end up with an ugly and awkward decision theory in descendants, which points up the importance of the meta-problem.But evolutionary dynamics simply are not decision-theory dynamics. You might as well point out that no interstellar travel could arise by evolutionary biology because there's no incremental advantage to getting halfway to another solar system.

*1 point [-]I think my earlier comments may not have been as clear as they could be. Let me back off and try again. We should distinguish between two different questions:

I don't think you've given any arguments against 1. Since TDT didn't arise from evolution, and it wasn't invented until recently, clearly TDT-related arguments aren't relevant as far as question 1 is concerned. So again, I see no reason to retract the article.

As for 2, I have some doubts about this:

I'm trying to explore it using this puzzle. Do you have any thoughts on it?

*0 points [-]Woah, it took me a long time to parse "Smart Losers". The technical parts of the article seem to be correct, but as for its evolutional relevance... In your scenario,

beingsmart doesn't hurt you, beingknownto be smart does; so it's most advantageous to be "secretly smart". So if your conclusions were correct, we'd probably see many adaptations aimed at concealing our intelligence from people we interact with.Not if the cost of concealing intelligence was too high. Our ancestors lived in tribes with a lot of gossip. Trying to conceal intelligence would have entailed pretending to be dumb at virtually all times, which implies giving up most of the benefits of being intelligent.

Why do they have to know

infallibly?*0 points [-]I do not think the article suggests any non-toy scenario where such situations might have reasonably arisen.

My personal favorite reason for "why are we not more intelligent species" is that the smart ones don't breed enough :)

So how do the the smart agents win that game? It has too many plot twists for me to follow.

It

doesseem convoluted. Smart agents lose when they face smarter agents, when the game is rigged against them, when they are unlucky, when they play games in which brains don't matter - and probably in numerous other cases.That's why we have so many bacteria on the planet. They are mega-stupid, but they reproduce quickly, and can live in a huge variety of environments. They play games where being smart is heavily penalised.

This is one of those posts where I think "I wish I could understand the post". Way to technical for me right now. I sometimes wish that someone can do a "Non-Technical" and Non-mathematical version of posts like these ones. (but I guess it will take too much time and effort). But then I get away saying, I don't need to understand everything, do I?