pengvado comments on Towards a New Decision Theory - Less Wrong

50 Post author: Wei_Dai 13 August 2009 05:31AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (142)

You are viewing a single comment's thread. Show more comments above.

Comment author: pengvado 16 August 2009 04:10:05AM *  3 points [-]

You're saying that TDT applied directly by both AIs would result in them cooperating; you would rather that they defect even though that gives you less utility; so you're looking for a way to make them lose? Why?

If both AIs use the same decision theory and this is common knowledge, then the only options are (C,C) or (D,D). Pick whichever you prefer. If they use different decision theories, then you can give yours pure TDT and tell it truthfully that you've tricked the other player into unconditionally cooperating. What else is there?

Comment author: Vladimir_Nesov 16 August 2009 10:55:19AM 0 points [-]

If both AIs use the same decision theory then the only options are (C,C) or (D,D).

You (and they) can't assume that, as they could be in different states even with the same algorithm that operates on those states, and so will output different decisions, even if from the problem statement it looks like everything significant is the same.

Comment author: Wei_Dai 16 August 2009 05:36:40AM 0 points [-]

The problem is that the two human player's minds aren't logically related. Each human player in this game wants his AI to play defect, because their decisions are logically independent of each other's. If TDT doesn't allow a player's AI to play defect, then the player would choose some other DT that does, or add an exception to the decision algorithm to force the AI to play defect.

I explained here why humans should play defect in one-shot PD.

Comment author: Eliezer_Yudkowsky 16 August 2009 10:04:49PM 3 points [-]

The problem is that the two human player's minds aren't logically related. Each human player in this game wants his AI to play defect, because their decisions are logically independent of each other's.

Your statement above is implicitly self-contradictory. How can you generalize over all the players in one fell swoop, applying the same logic to each of them, and yet say that the decisions are "logically independent"? The decisions are physically independent. Logically, they are extremely dependent. We are arguing over what is, in general, the "smart thing to do". You assume that if "the smart thing to do" is defect, and so all the players will defect. Doesn't smell like logical independence to me.

More importantly, the whole calculation about independence versus dependence is better carried out by an AI than by a human programmer, which is what TDT is for. It's not for cooperating. It's for determining the conditional probability of the other agent cooperating given that a TDT agent in your epistemic state plays "cooperate". If you know that the other agent knows (up to common knowledge) that you are a TDT agent, and the other agent knows that you know (up to common knowledge) that it is a TDT agent, then it is an obvious strategy to cooperate with a TDT agent if and only if it cooperates with you under that epistemic condition.

The TDT strategy is not "Cooperate with other agents known to be TDTs". The TDT strategy for the one-shot PD, in full generality, is "Cooperate if and only if ('choosing' that the output of this algorithm under these epistemic conditions be 'cooperate') makes it sufficiently more likely that (the output of the probability distribution of opposing algorithms under its probable epistemic conditions) is 'cooperate', relative to the relative payoffs."

Under conditions where a TDT plays one-shot true-PD against something that is not a TDT and not logically dependent on the TDT's output, the TDT will of course defect. A TDT playing against a TDT which falsely believes the former case to hold, will also of course defect. Where you appear to depart from my visualization, Wei Dai, is in thinking that logical dependence can only arise from detailed examination of the other agent's source code, because otherwise the agent has a motive to defect. You need to recognize your belief that what players do is in general likely to correlate, as a case of "logical dependence". Similarly the original decision to change your own source code to include a special exception for defection under particular circumstances, is what a TDT agent would model - if it's probable that the causal source of an agent thought it could get away with that special exception and programmed it in, the TDT will defect.

You've got logical dependencies in your mind that you are not explicitly recognizing as "logical dependencies" that can be explicitly processed by a TDT agent, I think.

Comment author: Vladimir_Nesov 16 August 2009 11:06:08AM 2 points [-]

If you already know something about the other player, if you know it exists, there is already some logical dependence between you two. How to leverage this minuscule amount of dependence is another question, but there seems to be no conceptual distinction between this scenario and where the players know each other very well.

Comment author: Nick_Tarleton 16 August 2009 07:52:20AM 1 point [-]

The problem is that the two human player's minds aren't logically related. Each human player in this game wants his AI to play defect, because their decisions are logically independent of each other's.

I don't think so. Each player wants to do the Winning Thing, and there is only one Winning Thing (their situations are symmetrical), so if they're both good at Winning (a significantly lower bar than successfully building an AI with their preferences), their decisions are related.

Comment author: Wei_Dai 16 August 2009 08:36:21AM *  0 points [-]

So what you're saying is, given two players who can successfully build AIs with their preferences (and that's common knowledge), they will likely (surely?) play cooperate in one-shot PD against each other. Do I understand you correctly?

Suppose what you say is correct, that the Winning Thing is to play cooperate in one-shot PD. Then what happens when some player happens to get a brain lesion that causes him to unconsciously play defect without affecting his AI building abilities? He would take everyone else's lunch money. Or if he builds his AI to play defect while everyone else builds their AIs to play cooperate, his AI then takes over the world. I hope that's a sufficient reductio ad absurdum.

Hmm, I just noticed that you're only saying "their decisions are related" and not explicitly making the conclusion they should play cooperative. Well, that's fine, as long as they would play defect in one-shot PD, then they would also program their AIs to play defect in one-shot PD (assuming each AI can't prove its source code to the other). That's all I need for my argument.

Comment author: Nick_Tarleton 16 August 2009 09:15:40AM *  2 points [-]

So what you're saying is, given two players who can successfully build AIs with their preferences (and that's common knowledge), they will likely (surely?) play cooperate in one-shot PD against each other. Do I understand you correctly?

Yes.

Suppose what you say is correct, that the Winning Thing is to play cooperate in one-shot PD. Then what happens when some player happens to get a brain lesion that causes him to unconsciously play defect without affecting his AI building abilities? He would take everyone else's lunch money. Or if he builds his AI to play defect while everyone else builds their AIs to play cooperate, his AI then takes over the world. I hope that's a sufficient reductio ad absurdum.

Good idea. Hmm. It sounds like this is the same question as: what if, instead of "TDT with defection patch" and "pure TDT", the available options are "TDT with defection patch" and "TDT with tiny chance of defection patch"? Alternately: what if the abstract computations that are the players have a tiny chance of being embodied in such a way that their embodiments always defect on one-shot PD, whatever the abstract computation decides?

It seems to me that Lesion Man just got lucky. This doesn't mean people can win by giving themselves lesions, because that's deliberately defecting / being an abstract computation that defects, which is bad. Whether everyone else should defect / program their AIs to defect due to this possibility depends on the situation; I would think they usually shouldn't. (If it's a typical PD payoff matrix, there are many players, and they care about absolute, not relative, scores, defecting isn't worth it even if it's guaranteed there'll be one Lesion Man.)

This still sounds disturbingly like envying Lesion Man's mere choices – but the effect of the lesion isn't really his choice (right?). It's only the illusion of unitary agency, bounded at the skin rather than inside the brain, that makes it seem like it is. The Cartesian dualism of this view (like AIXI, dropping an anvil on its own head) is also disturbing, but I suspect the essential argument is still sound, even as it ultimately needs to be more sophisticated.

Comment author: Wei_Dai 16 August 2009 12:02:24PM *  3 points [-]

I guess my reductio ad absurdum wasn't quite sufficient. I'll try to think this through more thoroughly and carefully. Let me know which steps, if any, you disagree with, or are unclear, in the following line of reasoning.

  1. TDT couldn't have arisen by evolution.
  2. Until a few years ago, almost everyone on Earth was running some sort of non-TDT which plays defect in one-shot PD.
  3. It's possible that upon learning about TDT, some people might spontaneously switch to running it, depending on whatever meta-DT controls this, and whether the human brain is malleable enough to run TDT.
  4. If, in any identifiable group of people, a sufficient fraction switches to TDT, and that proportion is public knowledge, the TDT-running individuals in that group should start playing cooperate in one-shot PD with other members of the group.
  5. The threshold proportion is higher if the remaining defectors can cause greater damage. If the remaining defectors can use their gains from defection to better reproduce themselves, or to gather more resources that will let them increase their gains/damage, then the threshold proportion must be close to 1, because even a single defector can start a chain reaction that causes all the resources of the group to become held by defectors.
  6. What proportion of skilled AI designers would switch to TDT is ultimately an empirical question, but it seems to me that it's unlikely to be close to unity.
  7. TDT-running AI designers will design their AIs to run TDT. Non-TDT-running AI designers will design their AIs to run non-TDT (not necessarily the same non-TDT).
  8. Assume that a TDT-running AI (TAI) can't tell which other AIs are running TDT and which ones aren't, so in every game it faces the decision described in steps 4 and 5. A TDT AI will cooperate in some situations where the benefit from cooperation is relatively high and damage from defection relatively low, and not in other situations.
  9. As a result, non-TAI will do better than TAI, but the damage to TAIs will be limited.
  10. Only if a TAI is sure that all AIs are TAIs, will it play cooperate unconditionally.
  11. If a TAI encounters an AI of alien origin, the same logic applies. The alien AI will be TAI if-and-only-if its creator was running TDT. If the TAI knows nothing about the alien creator, then it has to estimate what fraction of AI-builders in the universe runs TDT. Taking into account that TDT can't arise from evolution, and not seeing any reason for evolution to create a meta-DT that would pick TDT upon discovering it, this fraction seems pretty low, and so the TAI will likely play defect against the alien AI.

Hmm, this exercise has cleared a lot of my own confusion. Obviously a lot more work needs to be done to make the reasoning rigorous, but hopefully I've gotten the gist of it right.

ETA: According to this line of argument, your hypothesis that all skilled AI designers play cooperate in one-shot PD against each other is equivalent to saying that skilled AI designers have minds malleable enough to run TDT, and have a meta-DT that causes them to switch to running TDT. But I do not see an evolutionary reason for this, so if it's true, it must be true by luck. Do you agree?

Comment author: Vladimir_Nesov 16 August 2009 01:47:34PM *  2 points [-]

It looks like in this discussion you assume that switching to "TDT" (it's highly uncertain what this means) immediately gives the decision to cooperate in "true PD". I don't see why it should be so. Summarizing my previous comments, exactly what the players know about each other, exactly in what way they know it, may make their decisions go either way. That the players switch from CDT to some kind of more timeless decision theory doesn't determine the answer to be "cooperate", it merely opens up the possibility that previously was decreed irrational, and I suspect that what's important in the new setting for making the decision go either way isn't captured properly in the problem statement of "true PD".

Also, the way you treat "agents with TDT" seems more appropriate for "agents with Cooperator prefix" from cousin_it's Formalizing PD. And this is a simplified thing far removed from a complete decision theory, although a step in the right direction.

Comment author: Wei_Dai 16 August 2009 07:17:56PM 0 points [-]

I don't assume that switching to TDT immediately gives the decision to cooperate in "true PD". I assume that an AI running TDT would decide to cooperate if it thinks the expected utility of cooperating is higher than the EU of defecting, and that is true if its probability of facing another TDT is sufficiently high compared to its probability of facing a defector (how high is sufficient depends on the payoffs of the game). Well, this is necessary but not sufficient. For example if the other TDT doesn't think its probability of facing a TDT is high enough, it won't cooperate, so we need some common knowledge of the relevant probabilities and payoffs.

Does my line of reasoning make sense now, given this additional explanation?

Comment author: Vladimir_Nesov 16 August 2009 07:39:23PM *  0 points [-]

Actually it makes less sense now, since your explanation seems to agree that two "TDT" algorithms that know each of them is "TDT" won't necessarily cooperate, which undermines my hypothesis for why you were talking about cooperation as a sure thing in some relation to "TDT". I still think you make that assumption though. Citation from your argument:

4. If, in any identifiable group of people, a sufficient fraction switches to TDT, and that proportion is public knowledge, the TDT-running individuals in that group should start playing cooperate in one-shot PD with other members of the group.

Comment author: Wei_Dai 16 August 2009 08:04:32PM *  0 points [-]

I'm having trouble understanding what you're talking about again. Do you agree or disagree with step 4? To rephrase it a bit, if an identifiable group of people contains a high fraction of individuals running TDT, and that proportion is public knowledge, then TDT-running individuals in that group should play cooperate in one-shot PD with other members of the group in games where the payoffs are such that potential gains from mutual cooperation is large compared to potential loses from being defected against. (Assuming being in such a group is the best evidence available about whether someone is running TDT or not.)

If you disagree, why do you think a TDT-running individual might not play cooperate in this situation? Can you give an example to help me understand?

Comment author: Eliezer_Yudkowsky 16 August 2009 10:28:40PM 1 point [-]

Btw, agree with steps 3-9.

Comment author: Eliezer_Yudkowsky 16 August 2009 10:17:48PM 0 points [-]
  1. TDT couldn't have arisen by evolution.

It's too elegant to arise by evolution, and it also deals with one-shot PDs with no knock-on effects which is an extremely nonancestral condition - evolution by its nature deals with events that repeat many times; sexual evolution by its nature deals with organisms that interbreed; so "one-shot true PDs" is in general a condition unlikely to arise with sufficient frequency that evolution deals with it at all.

Taking into account that TDT can't arise from evolution, and not seeing any reason for evolution to create a meta-DT that would pick TDT upon discovering it

This may perhaps embody the main point of disagreement. A self-modifying CDT which, at 7am, expects to encounter a future Newcomb's Problem or Parfit's Hitchhiker in which the Omega gets a glimpse at the source code after 7am, will modify to use TDT for all decisions in which Omega glimpses the source code after 7am. A bit of "common sense" would tell you to just realize that "you should have been using TDT from the beginning regardless of when Omega glimpsed your source code and the whole CDT thing was a mistake" but this kind of common sense is not embodied in CDT. Nonetheless, TDT is a unique reflectively consistent answer for a certain class of decision problems, and a wide variety of initial points is likely to converge to it. The exact proportion, which determines under what conditions of payoff and loss stranger-AIs will cooperate with each other, is best left up to AIs to calculate, I think.

Comment author: Wei_Dai 17 August 2009 10:09:16PM *  1 point [-]

Nonetheless, TDT is a unique reflectively consistent answer for a certain class of decision problems, and a wide variety of initial points is likely to converge to it.

The main problem I see with this thesis (to restate my position in a hopefully clear form) is that an agent that starts off with a DT that unconditionally plays D in one-shot PD will not self-modify into TDT, unless it has some means of giving trustworthy evidence that it has done so. Suppose there is no such means, then any other agent must treat it the same, whether it self-modifies into TDT or not. Suppose it expects to face a TDT agent in the future. Whether that agent will play C or D against it is independent of what it decides now. If it does self-modify into TDT, then it might play C against the other TDT where it otherwise would have played D, and since the payoff for C is lower than for D, holding the other player's choice constant, it will decide not to self-modify into TDT.

If it expects to face Newcomb's Problem, then it would self-modify into something that handles it better, but that something must still unconditionally play D in one-shot PD.

Do you still think "a wide variety of initial points is likely to converge to it"? If so, do you agree that (ETA: in a world where proving source code isn't possible) those initial points exclude any DT that unconditionally plays D in one-shot PD?

BTW, there are a number of decision theorists in academia. Should we try to get them to work on our problems? Unfortunately, I have no skill/experience/patience/willpower for writing academic papers. I tried to write such a paper about cryptography once and submitted it to a conference, got back a rejection with nonsensical review comments, and that was that. (I guess I could have tried harder but then that would probably have put me on a different career path where I wouldn't be working these problems today.)

Also, there ought to be lots of mathematicians and philosophers who would be interested in the problem of logical uncertainty. How can we get them to work on it?

Comment author: Eliezer_Yudkowsky 18 August 2009 12:36:14AM 1 point [-]

Suppose it expects to face a TDT agent in the future. Whether that agent will play C or D against it is independent of what it decides now.

Unless that agent already knows or can guess your source code, in which case it is simulating you or something highly correlated to you, and in which case "modify to play C only if I expect that other agent simulating me to play C iff I modify to play C" is a superior strategy to "just D" because an agent who simulates you making the former choice (and which expects to be correctly simulated itself) will play C against you, while if it simulates you making the latter choice it will play D against you.

If it does self-modify into TDT, then it might play C against the other TDT where it otherwise would have played D, and since the payoff for C is lower than for D, holding the other player's choice constant, it will decide not to self-modify into TDT.

The whole point is that the other player's choice is not constant. Otherwise there is no reason ever for anyone to play C in a one-shot true PD! Simulation introduces logical dependencies - that's the whole point and to the extent it is not true even TDT agents will play D.

"Holding the other player's choice constant" here is the equivalent of "holding the contents of the boxes constant" in Newcomb's Problem. It presumes the answer.

Comment author: Wei_Dai 18 August 2009 04:16:48AM *  0 points [-]

Unless that agent already knows or can guess your source code, in which case it is simulating you or something highly correlated to you

I think you're invoking TDT-style reasoning here, before the agent has self-modified into TDT.

Besides, I'm assuming a world where agents can't know or guess each others' source codes. I thought I made that clear. If this assumption doesn't make sense to you, consider this: What evidence can one AI use to infer the source code of another AI or its creator? What if any such evidence can be faked near perfectly by the other AI? What about for two AIs of different planetary origins meeting in space?

I know you'd like to assume a world where guessing each others' source code is possible, since that makes everything work out nicely and everyone can "live happily ever after". But why shouldn't we consider both possibilities, instead of ignoring the less convenient one?

ETA: I think it may be possible to show that a CDT won't self-modify into a TDT as long as it believes there is a non-zero probability that it lives in a world where it will encounter at least one agent that won't know or guess its current or future source code, but in the limit as that probability goes to zero, the DT it self-modifies into converges to TDT.

Comment author: Wei_Dai 16 August 2009 10:25:07PM 0 points [-]

so "one-shot true PDs" is in general a condition unlikely to arise with sufficient frequency that evolution deals with it at all

But there are analogs of one-shot true PD everywhere.

A self-modifying CDT which, at 7am, expects to encounter a future Newcomb's Problem or Parfit's Hitchhiker in which the Omega gets a glimpse at the source code after 7am, will modify to use TDT for all decisions in which Omega glimpses the source code after 7am.

No, I disagree. You seem to have missed this comment, or do you disagree with it?

Comment author: Eliezer_Yudkowsky 16 August 2009 10:34:59PM 2 points [-]

But there are analogs of one-shot true PD everywhere.

Name a single one-shot true PD that any human has ever encountered in the history of time, and be sure to calculate the payoffs in inclusive fitness terms.

Of course that's a rigged question - if you can tell me the name of the villain, I can either say "look how they didn't have any children" or "their children suffered from the dishonor brought upon their parent". But still, I think you are taking far too liberal a view of what constitutes one-shotness.

Empirically, humans ended up with both a sense of temptation and a sense of honor that, to the extent it holds, holds when no one is looking. We have separate impulses for "cooperate because I might get caught" and "cooperate because it's the honorable thing to do".

Regarding your other comment, "Do what my programmer would want me to do" is not formally defined enough for me to handle it - all the complexity is hidden in "would want". Can you walk me through what you think a CDT agent self-modifies to if it's not "use TDT for future decisions where Omega glimpsed my code after 7am and use CDT for future decisions where Omega glimpsed my code before 7am"? (Note that calculations about general population frequency count as "before 7am" from the crazed CDT's perspective, because you're reasoning from initial conditions that correlate to the AI's state before 7am rather than after it.)

Comment author: Wei_Dai 16 August 2009 10:51:00PM *  0 points [-]

By "analog of one-shot true PD" I meant any game where the Nash equilibrium isn't Pareto-optimal. The two links in my last comment gave plenty of examples.

all the complexity is hidden in "would want"

I think I formalized it already, but to say it again, suppose the creator had the option of creating a giant lookup table in place of S. What choice of GLT would have maximized his expected utility at the time of coding, under the creator's own decision theory? S would compute that and then return whatever the GLT entry for X is.

ETA:

Can you walk me through what you think a CDT agent self-modifies to

It self-modifies to the S described above, with a description of itself embedded as the creator. Or to make it even simpler but less realistic, a CDT just replaces itself by a GLT, chosen to maximize its current expected utility.

Is that sufficiently clear?

Comment author: Eliezer_Yudkowsky 16 August 2009 10:08:18PM 1 point [-]

Suppose what you say is correct, that the Winning Thing is to play cooperate in one-shot PD. Then what happens when some player happens to get a brain lesion that causes him to unconsciously play defect without affecting his AI building abilities? He would take everyone else's lunch money.

Possibly. But it has to be an unpredictable brain lesion - one that is expected to happen with very low frequency. A predictable decision to do this just means that TDTs defect against you. If enough AI-builders do this then TDTs in general defect against each other (with a frequency threshold dependent on relative payoffs) because they have insufficient confidence that they are playing against TDTs rather than special cases in code.

Or if he builds his AI to play defect while everyone else builds their AIs to play cooperate, his AI then takes over the world.

No one is talking about building AIs to cooperate. You do not want AIs that cooperate on the one-shot true PD. You want AIs that cooperate if and only if the opponent cooperates if and only if your AI cooperates. So yes, if you defect when others expect you to cooperate, you can pwn them; but why do you expect that AIs would expect you to cooperate (conditional on their cooperation) if "the smart thing to do" is to build an AI that defects? AIs with good epistemic models would then just expect other AIs that defect.

Comment author: Wei_Dai 16 August 2009 10:13:35PM 0 points [-]

The comment you responded to was mostly obsoleted by this one, which represents my current position. Please respond to that one instead. Sorry for making you waste your time!