Wei_Dai comments on Ingredients of Timeless Decision Theory - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (226)
Today I finally came up with a simple example where TDT clearly loses and CDT clearly wins, and as a bonus, proves that TDT isn't reflectively consistent.
Omega comes to you and says
Say the payoffs of the PD are
Suppose you submit an AI running CDT. Then, Omega's AIs will reason as follows: "I have 1/2 chance of playing against a TDT, and 1/2 chance of playing against a CDT. If I play C, then my opponent will play C if it's a TDT, and D if it's a CDT, therefore my expected payoff is 5/2+0/2=2.5. If I play D, then my opponent will play D, so my payoff is 1. Therefore I should play C." Your AI then gets a payoff of 6, since it will play D.
Suppose you submit an AI running TDT instead. Then everyone will play C, so your AI will get a payoff of 5.
So you submit a CDT, whether you are running CDT or TDT. That's because explicitly giving the source code of your submitted AI to the other AIs makes the consequences of your decision the same under CDT and under TDT.
Suppose you have to play this game yourself instead of delegating it, you can self-modify, and the payoffs are large enough, you'd modify yourself from running TDT to running some other DT that plays D in this game! (Notice that I specified that Omega's AIs can't self-modify, so your decision to self-modify won't have the logical consequence that they also self-modify.)
It seems that I've given a counter-example to the claim that
Or does my example fall outside of the specified problem class?
If I wanted to defend the original thesis, I would say yes, because TDT doesn't cooperate or defect depending directly on your decision, but cooperates or defects depending on how your decision depends on its decision (which was one of the open problems I listed - the original TDT is for cases where Omega offers you straightforward dilemmas in which its behavior is just a direct transform of your behavior). So where one algorithm has one payoff matrix for defection or cooperation, the other algorithm gets a different payoff matrix for defection or cooperation, which breaks the "problem class" under which the original TDT is automatically reflectively consistent.
Nonetheless it's certainly an interesting dilemma.
Your comment here is actually pre-empting a comment that I'd planned to make after providing some of the background for the content of TDT. I'd thought about your dilemmas, and then did manage to translate into my terms a notion about how it might be possible to unilaterally defect in the Prisoner's Dilemma and predictably get away with it, provided you did so for unusual reasons. But the conditions on "unusual reasons" are much more difficult than your posts seem to imply. We can't all act on unusual reasons and end up doing the same thing, after all. How is it that these two TDT AIs got here, if not by act of Omega, if the sensible thing to do is always to submit a CDT AI?
To introduce yet another complication: What if the TDTs that you're playing against, decide to defect unconditionally if you submit a CDT player, in order to give you an incentive to submit a TDT player? Given that your reason for submitting a CDT player involves your expectation about how the TDT players will respond, and that you can "get away with it"? It's the TDT's responses that make them "exploitable" by your decision to submit a CDT player - so what if they employ a different strategy instead? (This is another open problem - "who acts first" in timeless negotiations.)
There might be a certain sense in which being in a "small subgroup internally correlated but not correlated with larger groups" could possibly act as a sort of resource for getting away with defection in the true PD, because if you're in a large group then defecting shifts the probability of an opponent likewise defecting by a lot, but if you're in a small subgroup then it shifts the probability of the opponent defecting by a little, so there's a lower penalty for defection, so in marginal cases a small subgroup might play defection while a large subgroup plays cooperate. (But again, the conditions on this are difficult. If all small subgroups reason this way, then all small subgroups form a large correlated group!)
Anyway - you can't end up in a small subgroup if you start out in a large one, because if you decide to deliberately condition on noise in order to decrease the size of your subgroup, that itself is a correlated sort of decision with a clear line of reasoning and motive, and others in your correlated group will try doing the same thing, with predictable results. So to the extent that lots of AI designers in distant parts of Reality are discussing this same issue with the same logic, we are already in a group of a certain minimum size.
But this does lead to an argument for CEV (values extrapolating / Friendly AI) algorithms that don't automatically, inherently correlate us with larger groups than we already started out being in. If uncorrelation is a nonrenewable resource then FAI programmers should at least be careful not to wantonly burn it. You can't deliberately add noise, but you might be able to preserve existing uncorrelation.
Also, other TDTs can potentially set their "minimum cooperator frequency threshold" at just the right level that if any group of noticeable size chooses to defect, all the TDTs start defecting - though this itself is a possibility I am highly unsure of, and once again it has to do with "who goes first" in timeless strategies, which is an open problem.
But these are issues in which my understanding is still shaky, and it very rapidly gets us into very dangerous territory like trying to throw the steering wheel out the window while playing chicken.
So far as evolved biological organisms go, I suspect that the ones who create successful Friendly AIs (instead of losing control and dying at the hands of paperclip maximizers), would hardly start out seeing only the view from CDT - most of them/us would be making the decision "Should I build TDT, knowing that the decisions of other biological civilizations are correlated to this one?" and not "Should I build TDT, having never thought of that?" In other words, we may already be part of a large correlated subgroup - though I sometimes suspect that most of the AIs out there are paperclip maximizers born of experimental accidents, and in that case, if there is no way of verifying source code, nor of telling the difference between SIs containing bio-values-preserving civs and SIs containing paperclip maximizers, then we might be able to exploit the relative smallness of the "successful biological designer" group...
...but a lot of this presently has the quality of "No fucking way would I try that in real life", at least based on my current understanding. The closest I would get might be trying for a CEV algorithm that did not inherently add correlation to decision systems with which we were not already correlated.
You're right, I failed to realize that with timeless agents, we can't do backwards induction using the physical order of decisions. We need some notion of the logical order of decisions.
Here's an idea. The logical order of decisions is related to simulation ability. Suppose A can simulate B, meaning it has trustworthy information about B's source code and has sufficient computing power to fully simulate B or sufficient intelligence to analyze B using reliable shortcuts, but B can't simulate A. Then the logical order of decisions is B followed by A, because when B makes his decision, he can treat A's decision as conditional on his. But when A makes her decision, she has to take B's decision as a given.
Does that make sense?
Moving second is a disadvantage (at least it seems to always work out that way, counterexamples requested if you can find them) and A can always use less computing power. Rational agents should not regret having more computing power (because they can always use less) or more knowledge (because they can always implement the same strategy they would use with less knowledge) - this sort of thing is a sure sign of reflective inconsistency.
To see why moving logically second is a disadvantage, consider that it lets an opponent playing Chicken always toss their steering wheel out the window and get away with it.
That both players desire to move "logically first" argues strongly that neither one will; that the resolution here does not involve any particular fixed global logical order of decisions.
(I should comment in the future about the possibility that bio-values-derived civs, by virtue of having evolved to be crazy, can succeed in moving logically first using crazy reasoning, but that would be a whole 'nother story, and of course also falls into the "Way the fuck too dangerous to try in real life" category relative to my present knowledge.)
BTW, thanks for this compact way of putting it.
Being logically second only keeps being a disadvantage because examples keep being chosen to be of the kind that make it so.
One category of counterexample comes from warfare, where if you know what the enemy will do and he doesn't know what you will do, you have the upper hand. (The logical versus temporal distinction is clear here: being temporally the first to reach an objective can be a big advantage.)
Another counterexample is in negotiation where a buyer and seller are both uncertain about fair market price; each may prefer the other to be first to suggest a price. (In practice this is often resolved by the party with more knowledge, or more at stake, or both - usually the seller - being first to suggest a price.)
You're right. Rock-paper-scissors is another counter-example. In these cases, the relationship between between the logical order of moves and simulation ability seems pretty obvious and intuitive.
Except that the analogy to rock-paper-scissors would be that I get to move logically first by deciding my conditional strategy "rock if you play scissors" etc., and simulating you simulating me without running into an apparently non-halting computation (that would otherwise have to be stopped by my performing counterfactual surgery on the part of you that simulates my own decision), then playing rock if I simulate you playing scissors.
At least I think that's how the analogy would work.
I suspect that this kind of problems will run into computational complexity issues, not clever decision theory issues. Like with a certain variation on St. Petersburg paradox (see the last two paragraphs), where you need to count to the greatest finite number to which you can count, and then stop.
Suppose I know that's your strategy, and decide to play the move equal to (the first googleplex digits of pi mod 3), and I can actually compute that but you can't. What are you going to do?
If you can predict what I do, then your conditional strategy works, which just shows that move order is related to simulation ability.
In this zero-sum game, yes, it's possible that whoever has the most computing power wins, if neither can access unpredictable random or private variables. But what if both sides have exactly equal computing power? We could define a Timeless Paper-Scissors-Rock Tournament this way - standard language, no random function, each program gets access to the other's source code and exactly 100 million ticks, if you halt without outputting a move then you lose 2 points.
This game is pretty easy to solve, I think. A simple equilibrium is for each side to do something like iterate x = SHA-512(x), with a random starting value, using an optimal implementation of SHA-512, until time is just about to run out, then output x mod 3. SHA-512 is easy to optimize (in the sense of writing the absolutely fastest implementation), and It seems very unlikely that there could be shortcuts to computing (SHA-512)^n until n gets so big (around 2^256 unless SHA-512 is badly designed) that the function starts to cycle.
I think I've answered your specific question, but the answer doesn't seem that interesting, and I'm not sure why you asked it.
But if you are TDT, you can't always use less computing power, because that might be correlated with your opponents also deciding to use less computing power, or will be distrusted by your opponent because it can't simulate you.
But if you simply don't have that much computing power (and opponent knows this) then you seem to have the advantage of logically moving first.
Lack of computing power could be considered a form of "crazy reasoning"...
Why does TDT lead to the phenomenon of "stupid winners"? If there's a way to explain this as a reasonable outcome, I'd feel a lot better. But is that like a two-boxer asking for an explanation of why, when the stupid (from their perspective) one-boxers keep winning, that's a reasonable outcome?
Substitute "move logically first" for "use less computing power"? Using less computing power seems like a red herring to me. TDT on simple problems (with the causal / logical structure already given) uses skeletally small amounts of computing power. "Who moves first" is a "battle"(?) over the causal / logical structure, not over who can manage to run out of computing power first. If you're visualizing this using lots of computing power for the core logic, rather than computing the 20th decimal place of some threshold or verifying large proofs, then we've got different visualizations.
The idea of "if you do this, the opponent does the same" might apply to trying to move logically first, but in my world this has nothing to do with computing power, so at this point I think it'd be pretty odd if the agents were competing to be stupider.
Besides, you don't want to respond to most logical threats, because that gives your opponent an incentive to make logical threats; you only want to respond to logical offers that you want your opponent to have an incentive to make. This gets into the scary issues I was hinting at before, like determining in advance that if you see your opponent predetermine to destroy the universe in a mutual suicide unless you pay a ransom, you'll call their bet and die with them, even if they've predetermined to ignore your decision, etcetera; but if they offer to trade you silver for gold at a Ricardian-advantageous rate, you'll predetermine to cooperate, etc. The point, though, is that "If I do X, they'll do Y" is not a blank check to decide that minds do X, because you could choose a different form of responsiveness.
But anyway, I don't see in the first place that agents should be having these sorts of contests over how little computing power to use. That doesn't seem to me like a compelling advantage to reach for.
If you've got that little computing power then perhaps you can't simulate your opponent's skeletally small TDT decision, i.e., you can't use TDT at all. If you can't close the loop of "I simulate you simulating me" - which isn't infinite, and actually terminates rather quickly in the simple cases I know how to analyze at all, because we perform counterfactual surgery inside the loop - then you can't use TDT at all.
No, I mean much crazier than that. Like "This doesn't follow, but I'm going to believe it anyway!" That's what it takes to get "unusual reasons" - the sort of madness that only strictly naturally selected biological minds would find compelling in advance of a timeless decision to be crazy. Like "I'M GOING TO THROW THE STEERING WINDOW OUT THE WHEEL AND I DON'T CARE WHAT THE OPPONENT PREDETERMINES" crazy.
It has not been established to my satisfaction that it does. It is a central philosophical intuition driving my decision theory that increased computing power, knowledge, or self-control, should not harm a rational agent.
...possibly employing mixed strategies, by analogy to the equilibrium of games where neither agent gets to go first and both must choose simultaneously? But I haven't done anything with this idea, yet.
This reminds me of logical Fatalism and the Argument from Bivalence
That's a good point, but what if the process that gives birth to CDT doesn't listen to the incentives you give it? For example, it could be evolution or random chance.
Here's an example, similar to Wei's example above. Imagine two parallel universes, both containing large populations of TDT agents. In both universes, a child is born, looking exactly like everyone else. The child in universe A is a TDT agent named Alice. The child in universe B is named Bob and has a random mutation that makes him use CDT. Both children go on to play many blind PDs with their neighbors. It looks like Bob's life will be much happier than Alice's, right?
What force will push against evolution and keep the number of Bobs small?
The problem is that "source code of your AI" is not a complete story, since your decisions as AI programmer also depended on the Omega AIs' code, and so what you give as the source of AI is already only one of the possible worlds that presupposes the behavior of Omega AIs.
Yes, I think Eliezer made a similar point:
So if you run TDT, then there are at least two equilibria in this game, only one of which involves you submitting a CDT. Can you think of a way to select between these two equilibria?
If not, I can fix this by changing the game a bit. Omega will now create his TDT AIs after you design yours, and hard code the source code of your AI into it as givens. His AIs won't even know about you, the real player.
They might simply infer you, the real player. You might as well tell the TDT AIs that they're up against a hardcoded Defect move as the "other player", but they won't know if that player has been selected. In fact, that pretty much is what you're telling them, if you show them a CDT player. The CDT player is a red herring - the decision to defect was made by you, in the moment of submitting a CDT player. There is no law against TDT players realizing this after Omega codes them.
I should note that in matters such as these, the phrase "hard code" should act as a warning sign that you're trying to fix something that, at least in your own mind, doesn't want to be fixed. (E.g. "hard code obedience into AIs, build it into the very circuitry!") Where you are tempted to say "hard code" you may just need to accept whatever complex burden you were trying to get rid of by saying "fix it in place with codes of iron!"
By hard code, I meant code it into the TDT's probability distribution. (Even TDT isn't meta enough to say "My prior is wrong!") But that does make the example less convincing, so let me try something else.
Have Omega's AIs physically go first and you play for yourself. They get a copy of your source code, then make their moves in the 3-choose-2 PD game first. You learn their move, then make your choice. Now, if you follow CDT, you'll reason that your decision has no causal effect on the TDT's decisions, and therefore choose D. The TDTs, knowing this, will play C.
And I think I can still show that if you run TDT, you will decide to self-modify into CDT before starting this game. First, if Omega's AIs know that you run TDT at the beginning, then they can use that "play D if you self-modify" strategy to deter you from self-modifying. But you can also use "I'll self-modify anyway" to deter them from doing that. So who wins this game? (If someone moves first logically, then he wins, but what if everyone moves simultaneously in the logical sense, which seems to be the case in this game?)
Suppose it's common knowledge that Omega mostly chooses CDT agents to participate in this game, then "play D if you self-modify" isn't very "credible". That's because they only see your source code after you self-modify so they'd have to play D if they predict that a TDT agent would self-modify, even if the actual player started with CDT. Given that, your "I'll self-modify anyway" would be highly credible.
I'm not sure how to formalize this notion of "credibility" among TDTs, but it seems to make intuitive sense.
Well that should never happen. Anything that would make a TDT want to self-modify into CDT should make it just want to play D, no need for self-modification. It should give the same answer at different times, that's what makes it a timeless decision theory. If you can break that without direct explicit dependence on the algorithm apart from its decisions, then I am in trouble! But it seems to me that I can substitute "play D" for "self-modify" in all cases above.
E.g., "play D if you play D to deter you from playing D" seems like the same idea, the self-modification doesn't add anything.
Well... it partially seems to me that, in assuming certain decisions are made without logical consequences - because you move logically first, or because the TDT agents have fixed wrong priors, etc. - you are trying to reduce the game to a Prisoner's Dilemma in which you have a certain chance of playing against a piece of cardboard with "D" written on it. Even a uniform population of TDTs may go on playing C in this case, of course, if the probability of facing cardboard is low enough. But by the same token, the fact that the cardboard sometimes "wins" does not make it smarter or more rational than the TDT agents.
Now, I want to be very careful about how I use this argument, because indeed a piece of cardboard with "only take box B" written on it, is smarter than CDT agents on Newcomb's Problem. But who writes that piece of cardboard, rather than a different one?
An authorless piece of cardboard genuinely does go logically first, but at the expense of being a piece of cardboard, which makes it unable to adapt to more complex situations. A true CDT agent goes logically first, but at the expense of losing on Newcomb's Problem. And your choice to put forth a piece of cardboard marked "D" relies on you expecting the TDT agents to make a certain response, which makes the claim that it's really just a piece of cardboard and therefore gets to go logically first, somewhat questionable.
Roughly, what I'm trying to reply is that you're reasoning about the response of the TDT agents to your choosing the CDT algorithm, which makes you TDT, but you're also trying to force your choice of the CDT algorithm to go logically first, but this is begging the question.
I would, perhaps, go so far as to agree that in an extension of TDT to cases in which certain agents magically get to go logically first, then if those agents are part of a small group uncorrelated with yet observationally indistinguishable from a large group, the small group might make a correlated decision to defect "no matter what" the large group does, knowing that the large group will decide to cooperate anyway given the payoff matrix. But the key assumption here is the ability to go logically first.
It seems to me that the incompleteness of my present theory when it comes to logical ordering is the real key issue here.
The reason to self-modify is to make yourself indistinguishable from players who started as CDT agents, so that Omega's AIs can't condition their moves on the player's type. Remember that Omega's AIs get a copy of your source code.
But a CDT agent would self-modify into something not losing on Newcomb's problem if it expects to face that. On the other hand, if TDT doesn't self-modify into something that wins my game, isn't that worse? (Is it better to be reflectively consistent, or winning, if you had to choose one?)
Yes, I agree that's a big piece of the puzzle, but I'm guessing the solution to that won't fully solve the "stupid winner" problem.
ETA: And for TDT agents that move simultaneously, there remains the problem of "bargaining" to use Nesov's term. Lots of unsolved problems... I wish you started us working on this stuff earlier!
Being (or performing an action) indistinguishable from X doesn't protect you from the inference that X probably resulted from such a plot. That you can decide to camouflage like this may even reduce X's own credibility (and so a lot of platonic/possible agents doing that will make the configuration unattractive). Thus, the agents need to decide among themselves what to look like: first-mover configurations is a limited resource.
(This seems like a step towards solving bargaining.)
Yes, I see that your comment does seem like a step towards solving bargaining among TDT agents. But I'm still trying to argue that if we're not TDT agents yet, maybe we don't want to become them. My comment was made in that context.
Let's pick up Eliezer's suggestion and distinguish now-much-less-mysterious TDT from the different idea of "updateless decision theory", UDT, that describes choice of a whole strategy (function from states of knowledge to actions) rather than choice of actions in each given state of knowledge, of which latter class TDT is an example. TDT isn't a UDT, and UDT is a rather vacuous statement, as it only achieves reflective consistency pretty much by definition, but doesn't tell much about the structure of preference and how to choose the strategy.
I don't want to become a TDT agent, as in UDT sense, TDT agents aren't reflectively consistent. They could self-modify towards more UDT-ish look, but this is the same argument as with CDT self-modifying into a TDT.
After all, for anything you can hard code, the AI can build a new AI that lacks your hard coding and sacrifice its resources to that new AI.
Wei_Dai wrote on 19 August 2009 07:08:23AM :
That seems to violate the secrecy assumptions of the Prisoner's Dilemma problem! I thought each prisoner has to commit to his action before learning what the other one did. What am I missing?
Thanks!