hairyfigment comments on Causal decision theory is unsatisfactory - LessWrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (158)
I am not convinced that this is the case. A self-modifying CDT agent is not caused to self-modify in favor of precommitment by facing a scenario in which precommitment would have been useful, but instead by evidence that such scenarios will occur in the future (and in fact will occur with greater frequency than scenarios that punish you for such precommitments).
Actually, this seems like a bigger problem with UDT to me than with SMCDT (self-modifying CDT). Either type of program can be punished for being instantiated with the wrong code, but only UDT can be blackmailed into behaving differently by putting it in a Newcomb-like situation.
The story idea you had wouldn't work. Against a SMCDT agent, all that getting the AIs original code would allow people to do is to laugh at it for having been instantiated with code that is punished by the scenario they are putting it in. You manipulate a SMCDT agent by threatening to get ahold of its future code and punishing it for not having self-modified. On the other hand, against a UDT agent you could do stuff. You just have to tell it "we're going to simulate you and if the simulation behaves poorly, we will punish the real you". This causes the actual instantiation to change its behavior if it's a UDT agent but not if it's a CDT agent.
On the other hand, all reasonable self-modifying agents are subject to blackmail. You just have to tell them "every day that you are not running code with property X, I will charge you $1000000".
I think this is actually the point (though I do not consider myself an expert here). Eliezer thinks his TDT will refuse to give in to blackmail, because outputting another answer would encourage other rational agents to blackmail it. By contrast, CDT can see that such refusal would be useful in the future, so it will adopt (if it can) a new decision theory that refuses blackmail and therefore prevents future blackmail (causally). But if you've already committed to charging it money, its self-changes will have no causal effect on you, so we might expect Modified CDT to have an exception for events we set in motion before the change.
This just means that TDT loses in honest one-off blackmail situations (in reality, you don't give in to blackmail because it will cause other people to blackmail you whether or not you then self-modify to never give into blackmail again). TDT only does better if the potential blackmailers read your code in order to decide whether or not blackmail will be effective (and then only if your priors say that such blackmailers are more likely than anti-blackmailers who give you money if they think you would have given into blackmail). Then again, if the blackmailers think that you might be a TDT agent, they just need to precommit to using blackmail whether or not they believe that it will be effective.
Actually, this suggests that blackmail is a game that TDT agents really lose badly at when playing against each other. The TDT blackmailer will decide to blackmail regardless of effectiveness and the TDT blackmailee will decide to ignore the blackmail, thus ending in the worst possible outcome.