Eliezer_Yudkowsky comments on Newcomb's Problem and Regret of Rationality - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (588)
Doesn't have a name as far as I know. But I'm not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch? Maybe, now that I think about it, but even so, does it matter?
But it will calculate that expected value using CDT!expectation, meaning that it won't see how self-modifying to be a timeless decision theorist could possibly affect what's already in the box, etcetera.
Is that really so bad, if it takes the state of the world at the point before it self-modifies as an unchangeable given, and self-modifies to a decision theory that only considers states from that point on as changeable by its decision theory? For one thing, doesn't that avoid Roko's basilisk?
If you do that, you'd be vulnerable to extortion from any other AIs that happen to be created earlier in time and can prove their source code.
I'm inclined to think that in most scenarios the first AGI wins anyway. And leaving solving decision theory to the AGI could mean you get to build it earlier.
I was thinking of meeting alien AIs, post-Singularity.
Huh? I thought we were supposed to be the good guys here? ;-)
But seriously, "sacrifice safety for speed" is the "defect" option in the game of "let's build AGI". I'm not sure how to get the C/C outcome (or rather C/C/C/...), but it seems too early to start talking about defecting already.
Besides, CDT is not well defined enough that you can implement it even if you wanted to. I think if you were forced to implement a "good enough" decision theory and hope for the best, you'd pick UDT at this point. (UDT is also missing a big chunk from its specifications, namely the "math intuition module" but I think that problem has to be solved anyway. It's hard to see how an AGI can get very far without being able to deal with logical/mathematical uncertainty.)
Really? That's surprising. My assumption had been that CDT would be much simpler to implement - but just give undesirable outcomes in whole classes of circumstance.
CDT uses a "causal probability function" to evaluate the expected utilities of various choices, where this causal probability function is different from the epistemic probability function you use to update beliefs. (In EDT they are one and the same.) There is no agreement amongst CDT theorists how to formulate this function, and I'm not aware of any specific proposal that can be straightforwardly implemented. For more details see James Joyce's The foundations of causal decision theory.
I understand AIXI reasonably well and had assumed it was a specific implementation of CDT, perhaps with some tweaks so the reward values are generated internally instead of being observed in the environment. Perhaps AIXI isn't close to an implementation of CDT, perhaps it's perceived as not specific or straightforward enough, or perhaps it's not counted as an implementation. Why isn't AIXI a counterexample?
You may be right that AIXI can be thought of as an instance of CDT. Hutter himself cites "sequential decision theory" from a 1957 paper which certainly predates CDT, but CDT is general enough that SDT could probably fit into its formalism. (Like EDT can be considered an instance of CDT with the causal probability function set to be the same as the epistemic probability function.) I guess I hadn't considered AIXI as a serious candidate due to its other major problems.
Four problems are listed there.
The first one is the claim that AIXI wouldn't have a proper understanding of its body because its thoughts are defined mathematically. This is just wrong, IMO; my refutation, for a machine that's similar enough to AIXI for this issue to work the same, is here. Nobody has engaged me in serious conversation about that, so I don't know how well it will stand up. (If I'm right on this, then I've seen Eliezer, Tim Tyler, and you make the same error. What other false consensuses do we have?)
The second one is fixed if we do the tweak I mentioned in the grandparent of this comment.
If you take the fix described above for the second one, what's left of the third one is the claim that instantaneous human (or AI) experience is too nuanced to fit in a single cell of a Turing machine. According to the original paper, page 8, the symbols on the reward tape are drawn from an alphabet R of arbitrary but fixed size. All you need is a very large alphabet and this one goes away.
I agree with the facts asserted in Tyler's fourth problem, but I do not agree that it is a problem. He's saying that Kolmogorov complexity is ill-defined because the programming language used is undefined. I agree that rational agents might disagree on priors because they're using different programming languages to represent their explanations. In general, a problem may have multiple solutions. Practical solutions to the problems we're faced with will require making indefensible arbitrary choices of one potential solution over another. Picking the programming language for priors is going to be one of those choices.
Thankyou for the reference, and the explanation.
I am prompted to ask myself a question analogous to the one Eliezer recently asked:
Is it worth my while exploring the details of CDT formalization beyond just the page you linked to? There seems to be some advantage to understanding the details and conventions of how such concepts are described. At the same time revising CDT thinking in too much detail may eliminate some entirely justifiable confusion as to why anyone would think it is a good idea! "Causal Expected Utiluty"? "Causal Tendencies"? What the? I only care about what will get me the best outcome!
Probably not. I only learned it by accident myself. I had come up with a proto-UDT that was motivated purely by anthropic reasoning paradoxes (as opposed to Newcomb-type problems like CDT and TDT), and wanted to learn how existing decision theories were formalized so I could do something similar. James Joyce's book was the most prominent such book available at the time.
ETA: Sorry, I think the above is probably not entirely clear or helpful. It's a bit hard for me to put myself in your position and try to figure out what may or may not be worthwhile for you. The fact is that Joyce's book is the decision theory book I read, and quite possibly it influenced me more than I realize, or is more useful for understanding the motivation for or the formulation of UDT than I think. It couldn't hurt to grab a copy of it and read a few chapters to see how useful it is to you.
Thanks for the edit/update. For reference it may be worthwhile to make such additions as a new comment, either as a reply to yourself or the parent. It was only by chance that I spotted the new part!
What pre-singularity actions are you worried about them taking?
What I was thinking was that a CDT-seeded AI might actually be safer precisely because it won't try to change pre-Singularity events, and if it's first the new decision theory will be in place in time for any post-Singularity events.
That's surprising to me - what should I read in order to understand this point better? EDIT: strike that, you answer that above.
They could modify themselves so that if they ever encounter a CDT-descended AI they'll start a war (even if it means mutual destruction) unless the CDT-descended AI gives them 99% of its resources.
They could also modify themselves to make the analogous threat if they encounter a UDT-descended AI, or a descendant of an AI designed by TIm Freeman, or a descendant of an AI designed by Wei Dai, or a descendant of an AI designed using ideas mentioned on LessWrong. I would hope that any of those AI's would hand over 99% of their resources if the extortionist could prove its source code and prove that war would be worse. I assume you're saying that CDT is special in this regard. How is it special?
(Thanks for the pointer to the James Joyce book, I'll have a look at it.)
If the alien AI computes the expected utility of "provably modify myself to start a war against CDT-AI unless it gives me 99% of its resources", it's certain to get a high value, whereas if it computes the expected utility of "provably modify myself to start a war against UDT-AI unless it gives me 99% of its resources" it might possibly get a low value (not sure because UDT isn't fully specified), because the UDT-AI, when choosing what to do when faced with this kind of threat, would take into account the logical correlation between its decision and the alien AI's prediction of its decision.
Well, that's plausible. I'll have to work through some UDT examples to understand fully.
What model do you have of how entity X can prove to entity Y that X is running specific source code?
The proof that I can imagine is entity Y gives some secure hardware Z to X, and then X allows Z to observe the process of X self-modifying to run the specified source code, and then X gives the secure hardware back to Y. Both X and Y can observe the creation of Z, so Y can know that it's secure and X can know that it's a passive observer rather than a bomb or something.
This model breaks the scenario, since a CDT playing the role of Y could self-modify any time before it hands over Z and play the game competently.
Now, if there's some way for X to create proofs of X's source code that will be convincing to Y without giving advance notice to Y, I can imagine a problem for Y here. Does anyone know how to do that?
(I acknowledge that if nobody knows how to do that, that means we don't know how to do that, not that it can't be done.)
Hmm, this explains my aversion to knowing the details of what other people are thinking. It can put me at a disadvantage in negotiations unless I am able to lie convincingly and say I do not know.
Do you have an intuition as to how it would do this without contradicting itself? I tried to ask a similar question but got it wrong in the first draft and afaict did not receive an answer to the relevant part.
I just want to know if my own intuition fails in the obvious way.
It is useful to separate in one's mind the difference between on one hand being able to One Box and cooperate in PD with agents that you know well (shared source code) and on the other hand not firing on Baby Eaters after they have already chosen not to fire on you. This is especially the case when first grappling the subject. (Could you confirm, by the way, that Akon's decision in that particular paragraph or two is approximately what TDT would suggest?)
The above is particularly relevant because the "have access to each other's source code" is such a useful intuition pump when grappling with or explaining the solutions to many of the relevant decision problems. It is useful to be able to draw a line on just how far the source code metaphor can take you.
There is also something distasteful about making comparisons to a decision theory that isn't even implicitly stable under self modification. A CDT agent will change to CDT++ unless there is an additional flaw in the agent beyond the poor decision making strategy. If I create a CDT agent, give it time to think and then give it Newcomb's problem it will One Box (and also no longer be a CDT agent). It is the errors in the agent that still remain after that time that need TDT or UDT to fix.
*nod* This is just the 'new rules starting now' option. What the CDT agent does when it wakes up in an empty, boring room and does some introspection.
The CDT is making a decision about whether to self-modify even before it meets the alien, based on its expectation of meeting the alien. How does CDT!expectation differ from Eliezer!expectation before we meet the alien?
Yes, because there are lemmas you can prove about (some) decision theory problems which imply that CDT and UDT give the same output. For example, CDT works if there is exists a total ordering over inputs given to the strategy, common to all execution histories, such that the world program invokes the strategy only with increasing, non-repeating inputs on that ordering. There are (relatively) easy algorithms for these cases. CDT in general is then a matter of applying a theorem when one of its preconditions doesn't hold, which is one of the most common math mistakes ever.
Yes, for reasons of game theory and of practical singularity strategy.
Game theory, because things in Everett branches that are 'closest' to us might be the ones it's most important to be able to interact with, since they're easier to simulate and their preferences are more likely to have interesting overlap with ours. Knowing very roughly what to expect from our neighbors is useful.
And singularity strategy, because if you can show that architectures like AIXI-tl have some non-negligible chance of converging to whatever an FAI would have converged to, as far as actual policies go, then that is a very important thing to know; especially if a non-uFAI existential risk starts to look imminent (but the game theory in that case is crazy). It is not probable but there's a hell of a lot of structural uncertainty and Omohundro's AI drives are still pretty informal. I am still not absolutely sure I know how a self-modifying superintelligence would interpret or reflect on its utility function or terms therein (or how it would reflect on its implicit policy for interpreting or reflecting on utility functions or terms therein). The apparent rigidity of Goedel machines might constitute a disproof in theory (though I'm not sure about that), but when some of the terms are sequences of letters like "makeHumansHappy" or formally manipulable correlated markers of human happiness, then I don't know how the syntax gets turned into semantics (or fails entirely to get turned into semantics, as they case may well be).
This implies that the actually-implemented-CDT agent has a single level of abstraction/granularity at like the naive realist physical level at which it's proving things about causal relationships. Like, it can't/shouldn't prove causal relationships at the level of string theory, and yet it's still confident that its actions are causing things despite that structural uncertainty, and yet despite the symmetry it for some reason cannot possibly see how switching a few transistors or changing its decision policy might affect things via relationships that are ultimately causal but currently unknown for reasons of boundedness and not speculative metaphysics. It's plausible, but I think letting a universal hypothesis space or maybe even just Goedelian limitations enter the decision calculus at any point is going to make such rigidity unlikely. (This is related to how a non-hypercomputation-driven decision theory in general might reason about the possibility of hypercomputation, or the risk of self-diagonalization, I think.)