Wei_Dai comments on Ingredients of Timeless Decision Theory - Less Wrong

43 Post author: Eliezer_Yudkowsky 19 August 2009 01:10AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (226)

You are viewing a single comment's thread. Show more comments above.

Comment author: Wei_Dai 19 August 2009 12:33:19PM 1 point [-]

Yes, I think Eliezer made a similar point:

What if the TDTs that you're playing against, decide to defect unconditionally if you submit a CDT player, in order to give you an incentive to submit a TDT player?

So if you run TDT, then there are at least two equilibria in this game, only one of which involves you submitting a CDT. Can you think of a way to select between these two equilibria?

If not, I can fix this by changing the game a bit. Omega will now create his TDT AIs after you design yours, and hard code the source code of your AI into it as givens. His AIs won't even know about you, the real player.

Comment author: Eliezer_Yudkowsky 19 August 2009 02:47:20PM *  5 points [-]

Omega will now create his TDT AIs after you design yours, and hard code the source code of your AI into it as givens. His AIs won't even know about you, the real player.

They might simply infer you, the real player. You might as well tell the TDT AIs that they're up against a hardcoded Defect move as the "other player", but they won't know if that player has been selected. In fact, that pretty much is what you're telling them, if you show them a CDT player. The CDT player is a red herring - the decision to defect was made by you, in the moment of submitting a CDT player. There is no law against TDT players realizing this after Omega codes them.

I should note that in matters such as these, the phrase "hard code" should act as a warning sign that you're trying to fix something that, at least in your own mind, doesn't want to be fixed. (E.g. "hard code obedience into AIs, build it into the very circuitry!") Where you are tempted to say "hard code" you may just need to accept whatever complex burden you were trying to get rid of by saying "fix it in place with codes of iron!"

Comment author: Wei_Dai 19 August 2009 09:03:06PM *  1 point [-]

By hard code, I meant code it into the TDT's probability distribution. (Even TDT isn't meta enough to say "My prior is wrong!") But that does make the example less convincing, so let me try something else.

Have Omega's AIs physically go first and you play for yourself. They get a copy of your source code, then make their moves in the 3-choose-2 PD game first. You learn their move, then make your choice. Now, if you follow CDT, you'll reason that your decision has no causal effect on the TDT's decisions, and therefore choose D. The TDTs, knowing this, will play C.

And I think I can still show that if you run TDT, you will decide to self-modify into CDT before starting this game. First, if Omega's AIs know that you run TDT at the beginning, then they can use that "play D if you self-modify" strategy to deter you from self-modifying. But you can also use "I'll self-modify anyway" to deter them from doing that. So who wins this game? (If someone moves first logically, then he wins, but what if everyone moves simultaneously in the logical sense, which seems to be the case in this game?)

Suppose it's common knowledge that Omega mostly chooses CDT agents to participate in this game, then "play D if you self-modify" isn't very "credible". That's because they only see your source code after you self-modify so they'd have to play D if they predict that a TDT agent would self-modify, even if the actual player started with CDT. Given that, your "I'll self-modify anyway" would be highly credible.

I'm not sure how to formalize this notion of "credibility" among TDTs, but it seems to make intuitive sense.

Comment author: Eliezer_Yudkowsky 19 August 2009 09:37:37PM *  4 points [-]

And I think I can still show that if you run TDT, you will decide to self-modify into CDT before starting this game

Well that should never happen. Anything that would make a TDT want to self-modify into CDT should make it just want to play D, no need for self-modification. It should give the same answer at different times, that's what makes it a timeless decision theory. If you can break that without direct explicit dependence on the algorithm apart from its decisions, then I am in trouble! But it seems to me that I can substitute "play D" for "self-modify" in all cases above.

First, if Omega's AIs know that you run TDT at the beginning, then they can use that "play D if you self-modify" strategy to deter you from self-modifying.

E.g., "play D if you play D to deter you from playing D" seems like the same idea, the self-modification doesn't add anything.

So who wins this game? (If someone moves first logically, then he wins, but what if everyone moves simultaneously in the logical sense, which seems to be the case in this game?)

Well... it partially seems to me that, in assuming certain decisions are made without logical consequences - because you move logically first, or because the TDT agents have fixed wrong priors, etc. - you are trying to reduce the game to a Prisoner's Dilemma in which you have a certain chance of playing against a piece of cardboard with "D" written on it. Even a uniform population of TDTs may go on playing C in this case, of course, if the probability of facing cardboard is low enough. But by the same token, the fact that the cardboard sometimes "wins" does not make it smarter or more rational than the TDT agents.

Now, I want to be very careful about how I use this argument, because indeed a piece of cardboard with "only take box B" written on it, is smarter than CDT agents on Newcomb's Problem. But who writes that piece of cardboard, rather than a different one?

An authorless piece of cardboard genuinely does go logically first, but at the expense of being a piece of cardboard, which makes it unable to adapt to more complex situations. A true CDT agent goes logically first, but at the expense of losing on Newcomb's Problem. And your choice to put forth a piece of cardboard marked "D" relies on you expecting the TDT agents to make a certain response, which makes the claim that it's really just a piece of cardboard and therefore gets to go logically first, somewhat questionable.

Roughly, what I'm trying to reply is that you're reasoning about the response of the TDT agents to your choosing the CDT algorithm, which makes you TDT, but you're also trying to force your choice of the CDT algorithm to go logically first, but this is begging the question.

I would, perhaps, go so far as to agree that in an extension of TDT to cases in which certain agents magically get to go logically first, then if those agents are part of a small group uncorrelated with yet observationally indistinguishable from a large group, the small group might make a correlated decision to defect "no matter what" the large group does, knowing that the large group will decide to cooperate anyway given the payoff matrix. But the key assumption here is the ability to go logically first.

It seems to me that the incompleteness of my present theory when it comes to logical ordering is the real key issue here.

Comment author: Wei_Dai 19 August 2009 10:01:42PM *  1 point [-]

Well that should never happen. Anything that would make a TDT want to self-modify into CDT should make it just want to play D, no need for self-modification. It should give the same answer at different times, that's what makes it a timeless decision theory. If you can break that without direct explicit dependence on the algorithm apart from its decisions, then I am in trouble! But it seems to me that I can substitute "play D" for "self-modify" in all cases above.

The reason to self-modify is to make yourself indistinguishable from players who started as CDT agents, so that Omega's AIs can't condition their moves on the player's type. Remember that Omega's AIs get a copy of your source code.

A true CDT agent goes logically first, but at the expense of losing on Newcomb's Problem.

But a CDT agent would self-modify into something not losing on Newcomb's problem if it expects to face that. On the other hand, if TDT doesn't self-modify into something that wins my game, isn't that worse? (Is it better to be reflectively consistent, or winning, if you had to choose one?)

It seems to me that the incompleteness of my present theory when it comes to logical ordering is the real key issue here.

Yes, I agree that's a big piece of the puzzle, but I'm guessing the solution to that won't fully solve the "stupid winner" problem.

ETA: And for TDT agents that move simultaneously, there remains the problem of "bargaining" to use Nesov's term. Lots of unsolved problems... I wish you started us working on this stuff earlier!

Comment author: Vladimir_Nesov 19 August 2009 10:19:37PM *  1 point [-]

The reason to self-modify is to make yourself indistinguishable from players who started as CDT agents, so that Omega's AIs can't condition their moves on the player's type.

Being (or performing an action) indistinguishable from X doesn't protect you from the inference that X probably resulted from such a plot. That you can decide to camouflage like this may even reduce X's own credibility (and so a lot of platonic/possible agents doing that will make the configuration unattractive). Thus, the agents need to decide among themselves what to look like: first-mover configurations is a limited resource.

(This seems like a step towards solving bargaining.)

Comment author: Wei_Dai 19 August 2009 10:25:03PM 0 points [-]

Yes, I see that your comment does seem like a step towards solving bargaining among TDT agents. But I'm still trying to argue that if we're not TDT agents yet, maybe we don't want to become them. My comment was made in that context.

Comment author: Vladimir_Nesov 19 August 2009 10:47:32PM *  1 point [-]

Let's pick up Eliezer's suggestion and distinguish now-much-less-mysterious TDT from the different idea of "updateless decision theory", UDT, that describes choice of a whole strategy (function from states of knowledge to actions) rather than choice of actions in each given state of knowledge, of which latter class TDT is an example. TDT isn't a UDT, and UDT is a rather vacuous statement, as it only achieves reflective consistency pretty much by definition, but doesn't tell much about the structure of preference and how to choose the strategy.

I don't want to become a TDT agent, as in UDT sense, TDT agents aren't reflectively consistent. They could self-modify towards more UDT-ish look, but this is the same argument as with CDT self-modifying into a TDT.

Comment author: Eliezer_Yudkowsky 19 August 2009 10:59:42PM 0 points [-]

Dai's version of this is a genuine, reflectively consistent updateless decision theory, though. It makes the correct decision locally, rather than needing to choose a strategy once and for all time from a privileged vantage point.

That's why I referred to it as "Dai's decision theory" at first, but both you and Dai seem to think your idea was the important one, so I compromised and referred to it as Nesov-Dai decision theory.

Comment author: Vladimir_Nesov 19 August 2009 11:12:13PM *  2 points [-]

Well, as I see UDT, it also makes decisions locally, with understanding that this local computation is meant to find the best global solution given other such locally computed decisions. That is, each local computation can make a mistake, making the best global solution impossible, which may make it very important for the other local computations to predict (or at least notice) this mistake and find the local decisions that together with this mistake constitute the best remaining global solution, and so on. The structure of states of knowledge produced by the local computations for the adjacent local computations is meant to optimize the algorithm of local decision-making in those states, giving most of the answer explicitly, leaving the local computations to only move the goalpost a little bit.

The nontrivial form of the decision-making comes from the loop that makes local decisions maximize preference given the other local decisions, and those other local decisions do the same. Thus, the local decisions have to coordinate with each other, and they can do that only through the common algorithm and logical dependencies between different states of knowledge.

At which point the fact that these local decisions are part of the same agent seems to become irrelevant, so that a more general problem needs to be solved, one of cooperation of any kinds of agents, or even more generally processes that aren't exactly "agents".

Comment author: MichaelVassar 19 August 2009 05:08:04PM 0 points [-]

After all, for anything you can hard code, the AI can build a new AI that lacks your hard coding and sacrifice its resources to that new AI.