Eliezer_Yudkowsky comments on Newcomb's Problem and Regret of Rationality - Less Wrong

64 Post author: Eliezer_Yudkowsky 31 January 2008 07:36PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (588)

Sort By: Old

You are viewing a single comment's thread. Show more comments above.

Comment author: TimFreeman 16 May 2011 10:39:32PM 1 point [-]

...if you build an AI that two-boxes on Newcomb's Problem, it will self-modify to one-box on Newcomb's Problem, if the AI considers in advance that it might face such a situation. Agents with free access to their own source code have access to a cheap method of precommitment.

...

But what does an agent with a disposition generally-well-suited to Newcomblike problems look like? Can this be formally specified?

...

Rational agents should WIN.

It seems to me that if all that is true, and you want to build a Friendly AI, then the rational thing to do here is build it and let it solve all problems like these. That way, you win, at least in the time-management sense. Well, you might lose if you encountered Omega before the FAI was up and running, but that seems unlikely. Am I missing something here?

It will also have to precommit to mere humans who can't read its source code and can't predict the future, so solving the problem in the case where you meet Omega doesn't solve the problem in general.

Comment author: Eliezer_Yudkowsky 17 May 2011 12:01:40AM 4 points [-]

Causal decision theorists don't self-modify to timeless decision theorists. If you get the decision theory wrong, you can't rely on it repairing itself.

Comment author: TimFreeman 17 May 2011 12:09:04AM *  7 points [-]

You said:

Causal decision theorists don't self-modify to timeless decision theorists. If you get the decision theory wrong, you can't rely on it repairing itself.

but you also said:

...if you build an AI that two-boxes on Newcomb's Problem, it will self-modify to one-box on Newcomb's Problem, if the AI considers in advance that it might face such a situation.

I can envision several possibilities:

  • Perhaps you changed your mind and presently disagree with one of the above two statements.
  • Perhaps you didn't mean a causal AI in the second quote. In that case I have no idea what you meant.
  • Perhaps Newcomb's problem is the wrong example, and there's some other example motivating TDT that a self-modifying causal agent would deal with incorrectly.
  • Perhaps you have a model of causal decision theory that makes self-modification impossible in principle. That would make your first statement above true, in a useless sort of way, so I hope you didn't mean that.

Would you like to clarify?

Comment author: Eliezer_Yudkowsky 17 May 2011 12:54:32AM *  8 points [-]

Causal decision theorists self-modify to one-box on Newcomb's Problem with Omegas that looked at their source code after the self-modification took place; i.e., if the causal decision theorist self-modifies at 7am, it will self-modify to one-box with Omegas that looked at the code after 7am and two-box otherwise. This is not only ugly but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.

Bad decision theories don't necessarily self-repair correctly.

And in general, every time you throw up your hands in the air and say, "I don't know how to solve this problem, nor do I understand the exact structure of the calculation my computer program will perform in the course of solving this problem, nor can I state a mathematically precise meta-question, but I'm going to rely on the AI solving it for me 'cause it's supposed to be super-smart," you may very possibly be about to screw up really damned hard. I mean, that's what Eliezer-1999 thought you could say about "morality".

Comment author: TimFreeman 17 May 2011 03:42:24AM *  0 points [-]

Okay, thanks for confirming that Newcomb's problem is a relevant motivating example here.

"I don't know how to solve this problem, nor do I understand the exact structure of the calculation my computer program will perform in the course of solving this problem, nor can I state a mathematically precise meta-question, but I'm going to rely on the AI solving it for me 'cause it's supposed to be super-smart,"

I'm not saying that. I'm saying that self-modification solves the problem, assuming the CDT agent moves first, and that it seems simple enough that we can check that a not-very-smart AI solves it correctly on toy examples. If I get around to attempting that, I'll post to LessWrong.

Assuming the CDT agent moves first seems reasonable. I have no clue whether or when Omega is going to show up, so I feel no need to second-guess the AI about that schedule.

(Quoting out of order)

This is not only ugly...

As you know, we can define a causal decision theory agent in one line of math. I don't know a way to do that for TDT. Do you? If TDT could be concisely described, I'd agree that it's the less ugly alternative.

but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.

I'm failing to suspend disbelief here. Do you have motivating examples for TDT that seem likely to happen before Kurzweil's schedule for the Singularity causes us to either win or lose the game?

Comment author: Wei_Dai 17 May 2011 07:29:16PM 2 points [-]

As you know, we can define a causal decision theory agent in one line of math.

If you appreciate simplicity/elegance, I suggest looking into UDT. UDT says that when you're making a choice, you're deciding the output of a particular computation, and the consequences of any given choice are just the logical consequences of that computation having that output.

CDT in contrast doesn't answer the question "what am I actually deciding when I make a decision?" nor does it answer "what are the consequences of any particular choice?" even in principle. CDT can only be described in one line of math because the answer to the latter question has to be provided to it via an external parameter.

Comment author: TimFreeman 17 May 2011 07:41:57PM 0 points [-]

Thanks, I'll have a look at UDT.

CDT can only be described in one line of math because the answer to the latter question has to be provided to it via an external parameter.

I certainly agree there.

Comment author: FAWS 17 May 2011 08:15:46PM *  0 points [-]

but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.

I'm failing to suspend disbelief here. Do you have motivating examples for TDT that seem likely to happen before Kurzweil's schedule for the Singularity causes us to either win or lose the game?

I'm reasonably sure Eliezer meant implications for the would-be friendly AI meeting alien AIs. That could happen at any time in the remaining life span of the universe.

Comment author: wedrifid 17 May 2011 06:43:47AM *  2 points [-]

Causal decision theorists don't self-modify to timeless decision theorists.

Why not? A causal decision theorist can have an accurate abstract understanding of both TDT and CDT and can calculate the expected utility of applying either. If TDT produces a better expected outcome in general then it seems like self modifying to become a TDT agent is the correct decision to make. Is there some restriction or injunction assumed to be in place with respect to decision algorithm implementation?

Thinking about it for a a few minutes: It would seem that the CDT agent will reliably update away from CDT but that the new algorithm will be neither CDT or TDT (and not UDT either). It will be able to cooperate with agents when there has been some sort causal entanglement between the modified source code and the other agent but not able to cooperate with complete strangers. The resultant decision algorithm is enough of an attractor that it deserves a name of its own. Does it have one?

Comment author: ciphergoth 17 May 2011 11:04:32AM 1 point [-]

Surely the important thing is that it will self-modify to whatever decision theory has the best consequences?

The new algorithm will not exactly be TDT, because it won't try to change decisions that have already been made the way TDT does. In particular this means that there's no risk from Roko's basilisk.

Disclaimer: I'm not very confident of anything I say about decision theory.

Comment author: Eliezer_Yudkowsky 17 May 2011 11:11:50AM 5 points [-]

Doesn't have a name as far as I know. But I'm not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch? Maybe, now that I think about it, but even so, does it matter?

A causal decision theorist can have an accurate abstract understanding of both TDT and CDT and can calculate the expected utility of applying either.

But it will calculate that expected value using CDT!expectation, meaning that it won't see how self-modifying to be a timeless decision theorist could possibly affect what's already in the box, etcetera.

Comment author: ciphergoth 17 May 2011 12:06:46PM 1 point [-]

Is that really so bad, if it takes the state of the world at the point before it self-modifies as an unchangeable given, and self-modifies to a decision theory that only considers states from that point on as changeable by its decision theory? For one thing, doesn't that avoid Roko's basilisk?

Comment author: Wei_Dai 17 May 2011 07:33:24PM 2 points [-]

Is that really so bad, if it takes the state of the world at the point before it self-modifies as an unchangeable given, and self-modifies to a decision theory that only considers states from that point on as changeable by its decision theory?

If you do that, you'd be vulnerable to extortion from any other AIs that happen to be created earlier in time and can prove their source code.

Comment author: ciphergoth 18 May 2011 07:08:58AM 1 point [-]

I'm inclined to think that in most scenarios the first AGI wins anyway. And leaving solving decision theory to the AGI could mean you get to build it earlier.

Comment author: Wei_Dai 18 May 2011 06:41:05PM 2 points [-]

I'm inclined to think that in most scenarios the first AGI wins anyway.

I was thinking of meeting alien AIs, post-Singularity.

And leaving solving decision theory to the AGI could mean you get to build it earlier.

Huh? I thought we were supposed to be the good guys here? ;-)

But seriously, "sacrifice safety for speed" is the "defect" option in the game of "let's build AGI". I'm not sure how to get the C/C outcome (or rather C/C/C/...), but it seems too early to start talking about defecting already.

Besides, CDT is not well defined enough that you can implement it even if you wanted to. I think if you were forced to implement a "good enough" decision theory and hope for the best, you'd pick UDT at this point. (UDT is also missing a big chunk from its specifications, namely the "math intuition module" but I think that problem has to be solved anyway. It's hard to see how an AGI can get very far without being able to deal with logical/mathematical uncertainty.)

Comment author: wedrifid 18 May 2011 06:53:13PM 0 points [-]

Besides, CDT is not well defined enough that you can implement it even if you wanted to. I think if you were forced to implement a "good enough" decision theory and hope for the best, you'd pick UDT at this point.

Really? That's surprising. My assumption had been that CDT would be much simpler to implement - but just give undesirable outcomes in whole classes of circumstance.

Comment author: Wei_Dai 18 May 2011 07:34:18PM *  4 points [-]

CDT uses a "causal probability function" to evaluate the expected utilities of various choices, where this causal probability function is different from the epistemic probability function you use to update beliefs. (In EDT they are one and the same.) There is no agreement amongst CDT theorists how to formulate this function, and I'm not aware of any specific proposal that can be straightforwardly implemented. For more details see James Joyce's The foundations of causal decision theory.

Comment author: ciphergoth 18 May 2011 08:39:04PM *  1 point [-]

I was thinking of meeting alien AIs, post-Singularity.

What pre-singularity actions are you worried about them taking?

Huh? I thought we were supposed to be the good guys here? ;-)

What I was thinking was that a CDT-seeded AI might actually be safer precisely because it won't try to change pre-Singularity events, and if it's first the new decision theory will be in place in time for any post-Singularity events.

Besides, CDT is not well defined enough that you can implement it even if you wanted to.

That's surprising to me - what should I read in order to understand this point better? EDIT: strike that, you answer that above.

Comment author: Wei_Dai 18 May 2011 08:54:25PM 1 point [-]

What pre-singularity actions are you worried about them taking?

They could modify themselves so that if they ever encounter a CDT-descended AI they'll start a war (even if it means mutual destruction) unless the CDT-descended AI gives them 99% of its resources.

Comment author: hairyfigment 18 May 2011 09:16:18PM 0 points [-]

It's hard to see how an AGI can get very far without being able to deal with logical/mathematical uncertainty.

Do you have an intuition as to how it would do this without contradicting itself? I tried to ask a similar question but got it wrong in the first draft and afaict did not receive an answer to the relevant part.

I just want to know if my own intuition fails in the obvious way.

Comment author: wedrifid 17 May 2011 12:54:12PM *  0 points [-]

Doesn't have a name as far as I know. But I'm not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch? Maybe, now that I think about it, but even so, does it matter?

It is useful to separate in one's mind the difference between on one hand being able to One Box and cooperate in PD with agents that you know well (shared source code) and on the other hand not firing on Baby Eaters after they have already chosen not to fire on you. This is especially the case when first grappling the subject. (Could you confirm, by the way, that Akon's decision in that particular paragraph or two is approximately what TDT would suggest?)

The above is particularly relevant because the "have access to each other's source code" is such a useful intuition pump when grappling with or explaining the solutions to many of the relevant decision problems. It is useful to be able to draw a line on just how far the source code metaphor can take you.

There is also something distasteful about making comparisons to a decision theory that isn't even implicitly stable under self modification. A CDT agent will change to CDT++ unless there is an additional flaw in the agent beyond the poor decision making strategy. If I create a CDT agent, give it time to think and then give it Newcomb's problem it will One Box (and also no longer be a CDT agent). It is the errors in the agent that still remain after that time that need TDT or UDT to fix.

But it will calculate that expected value using CDT!expectation, meaning that it won't see how self-modifying to be a timeless decision theorist could possibly affect what's already in the box, etcetera.

*nod* This is just the 'new rules starting now' option. What the CDT agent does when it wakes up in an empty, boring room and does some introspection.

Comment author: TimFreeman 18 May 2011 09:13:25PM 0 points [-]

But it will calculate that expected value using CDT!expectation, meaning that it won't see how self-modifying to be a timeless decision theorist could possibly affect what's already in the box, etcetera.

The CDT is making a decision about whether to self-modify even before it meets the alien, based on its expectation of meeting the alien. How does CDT!expectation differ from Eliezer!expectation before we meet the alien?

Comment author: jimrandomh 19 May 2011 01:44:50PM *  1 point [-]

Doesn't have a name as far as I know. But I'm not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch?

Yes, because there are lemmas you can prove about (some) decision theory problems which imply that CDT and UDT give the same output. For example, CDT works if there is exists a total ordering over inputs given to the strategy, common to all execution histories, such that the world program invokes the strategy only with increasing, non-repeating inputs on that ordering. There are (relatively) easy algorithms for these cases. CDT in general is then a matter of applying a theorem when one of its preconditions doesn't hold, which is one of the most common math mistakes ever.

Comment author: Will_Newsome 21 May 2011 08:52:19AM *  0 points [-]

But I'm not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch? Maybe, now that I think about it, but even so, does it matter?

Yes, for reasons of game theory and of practical singularity strategy.

Game theory, because things in Everett branches that are 'closest' to us might be the ones it's most important to be able to interact with, since they're easier to simulate and their preferences are more likely to have interesting overlap with ours. Knowing very roughly what to expect from our neighbors is useful.

And singularity strategy, because if you can show that architectures like AIXI-tl have some non-negligible chance of converging to whatever an FAI would have converged to, as far as actual policies go, then that is a very important thing to know; especially if a non-uFAI existential risk starts to look imminent (but the game theory in that case is crazy). It is not probable but there's a hell of a lot of structural uncertainty and Omohundro's AI drives are still pretty informal. I am still not absolutely sure I know how a self-modifying superintelligence would interpret or reflect on its utility function or terms therein (or how it would reflect on its implicit policy for interpreting or reflecting on utility functions or terms therein). The apparent rigidity of Goedel machines might constitute a disproof in theory (though I'm not sure about that), but when some of the terms are sequences of letters like "makeHumansHappy" or formally manipulable correlated markers of human happiness, then I don't know how the syntax gets turned into semantics (or fails entirely to get turned into semantics, as they case may well be).

But it will calculate that expected value using CDT!expectation, meaning that it won't see how self-modifying to be a timeless decision theorist could possibly affect what's already in the box, etcetera.

This implies that the actually-implemented-CDT agent has a single level of abstraction/granularity at like the naive realist physical level at which it's proving things about causal relationships. Like, it can't/shouldn't prove causal relationships at the level of string theory, and yet it's still confident that its actions are causing things despite that structural uncertainty, and yet despite the symmetry it for some reason cannot possibly see how switching a few transistors or changing its decision policy might affect things via relationships that are ultimately causal but currently unknown for reasons of boundedness and not speculative metaphysics. It's plausible, but I think letting a universal hypothesis space or maybe even just Goedelian limitations enter the decision calculus at any point is going to make such rigidity unlikely. (This is related to how a non-hypercomputation-driven decision theory in general might reason about the possibility of hypercomputation, or the risk of self-diagonalization, I think.)