Followup/summary/extension to this conversation with SilasBarta
So, you're going along, cheerfully deciding things, doing counterfactual surgery on the output of decision algorithm A1 to calculate the results of your decisions, but it turns out that a dark secret is undermining your efforts...
You are not running/being decision algorithm A1, but instead decision algorithm A2, an algorithm that happens to have the property of believing (erroneously) that it actually is A1.
Ruh-roh.
Now, it is _NOT_ my intent here to try to solve the problem of "how can you know which one you really are?", but instead to deal with the problem of "how can TDT take into account this possibility?"
Well, first, let me suggest a slightly more concrete way in which this might come up:
Physical computation errors. For instance, a stray cosmic ray hits your processor and flips a bit in such a way that a certain conditional that would have otherwise gone down one branch instead goes down the other, so instead of computing the output of your usual algorithm in this circumstance, you're computing the output of the version that, at that specific step, behaves in that slightly different way. (Yes, this sort of thing can be mitigated with error correction/etc. The problem that is being addressed here is that, (to me at least) it seems that basic TDT doesn't have a natural way to even represent this possibility).
Consider a slightly modified causal net with in which the innards of an agent are more more of an "initial state", and that there's a selector node/process (ie, the resulting computation) that selects which abstract algorithm's output is the one that's the actual output. ie, this process determines which algorithm you, well, are.
Similarly, another being that might base its actions on a model of your behavior will be represented as having a model of your innards and the model itself having a selector, analogous to the above.
To actually compute consequences of decisions and do all the relevant counterfactual surgery, ideally (ignoring "minor" issues like computability), one iterates over all possible algorithms one might be. That is, one first goes "if the actual results of the combination of my innards and all the messy details of reality and so on is to do computation A1, then..." and subiterate over all possible decisions. The second thing, of course, being done via the usual counterfactual surgery.
Then, weigh all of those by the probability that one actually _is_ algorithm A1, and then go "if I actually was algorithm A2..." etc etc... ie, and one does the same counterfactual surgery.
In the above diagram, that lets one consider the possibility of ones own choice being decoupled from what the model of their choice would predict, given that the initial model is correct, but while they are actually considering the decision, a hardware error or whatever causes the agent to be/implement A2 while the model of them is instead properly implementing A1.
I am far from convinced that this is the best way to deal with this issue, but I haven't seen anyone else bringing it up, and the usual form of TDT that we've been describing didn't seem to have any obvious way to even represent this issue. So, if anyone has any better ideas for how to clean up this solution, or otherwise alternate ideas for dealing with this problem, go ahead.
I just think it is important that it be dealt with _somehow_... That is, that the decision theory have some way of representing errors or other things that could cause ambiguity as to which algorithm it is actually implementing in the first place.
EDIT: sorry, to clarify: one determines the utility for a possible choice by summing over the results of all the possible algorithms making that particular choice. (ie, "I don't know if my decision corresponds to deciding the outcome of algorithm A1 or A2 or...") so sum over those for each choice, weighing by the probability of that being the actual algorithm in quesiton)
EDIT2: SilasBarta came up with a different causal graph during our discussion to represent this issue.
In the discussion, I had proposed this as a causal net that captures all of your concerns, and I still don't see why it doesn't. Explanation
First of all, I will remind you that all nodes on a Bayesian causal net implicitly have a lone parent (disconnected from all other such parents) that represents uncertainty, which you explicitly represent in your model as "messy details of physical reality" and "more [i.e. independent] messy details of physical reality".
The model I made for you captures all of this, so I don't see why it's something TDT has any difficulty representing.
Omega knows your innards. Omega knows what algorithm you're trying to implement. Omega knows something about how hardware issues lead to what failure modes. So yes, there remains a chance Omega will guess wrong (under your restrictive assumptions about Omega), but this is fully represented by the model.
Also (still in my model), the agent, when computing a "would", looks at its choice as being what algorithm it will attempt to implement. It sees that there is room for the possibility of its intended algorithm not being the algorithm that actually gets implemented. It estimates what kinds of effects turn what intended algorithms into bad algorithms and therefore has reasons to pick algorithms that are unlikely to be turned into bad ones.
For a more concrete example of this kind of agent reasoning, refer back to what EY does in this post. He points out that we (including him) run on corrupted hardware ("innards" in my model). Therefore, the kind of corruption that his innards have, given his desired payoffs, justifies rejecting such target algorithms as "cheat when it will benefit the tribe on net", reasoning that that algorithm will likely degrade (via the causal effect of innards) into the actual algorithm of "cheat when it benefits me personally". To avoid this, he picks an algorithm harder to corrupt, like "for the good of the tribe, don't cheat, even if it benefits the tribe", which will, most likely, degrade into "don't cheat to benefit yourself at the expense of the tribe", something consistent with his values.
All of this is describable in TDT and represented by my model.
I think I may be misunderstanding your model, but, well, here's an example of where I think yours (ie, just using the built in error terms) would fail worse than mine:
Imagine that in addition to you, there're, say, a thousand systems that are somewhat explicitly dependent on algorithm A1 (or try to be) and another thousand that are explicitly dependent on A2 (or try to be), either through directly implementing, or modeling, or...
If you are A1, then your decision will be linked to the first group and less so to the second group... and if you are A2, then th... (read more)