The result there is sometimes abbreviated as UDT=CDT+SIA, although UDT⊂CDT+SIA is more accurate, because the optimal UDT policies are a subset of the policies which CDT+SIA can follow. This is because UDT has self-coordination power which CDT+SIA lacks.
I feel like this understates the way in which CDT+SIA is philosophically/intuitively crazy/implausible. Consider a variant of AMD where U(A)=1, U(B)=0, U(C)=2. Obviously one should select CONT with probability 1 in order to reach C, but "EXIT with probability 1" seems to be another CDT+SIA solution. The CDT+SIA reasoning there (translated from math to English) is: Suppose my policy is "EXIT with probability 1". Then I'm at X with probability 1. Should I deviate from this policy? If I do CONT instead, I'm still at X with probability 1 and the copy of me at Y will still do EXIT with probability 1 so I'll end up at B for sure with utility 0, therefore I should not deviate. Isn't this just obviously crazy (assuming I didn't misunderstand something)?
we could say that UDT1.0 = CDT+SIA
But UDT1.0 already gives a unique and correct solution to the problem above.
Caspar Oesterheld commented on that post with an analogous EDT+SSA result.
I tried to understand Caspar's EDT+SSA but was unable to figure it out. Can someone show how to apply it to an example like the AMD to help illustrate it?
> I tried to understand Caspar’s EDT+SSA but was unable to figure it out. Can someone show how to apply it to an example like the AMD to help illustrate it?
Sorry about that! I'll try to explain it some more. Let's take the original AMD. Here, the agent only faces a single type of choice -- whether to EXIT or CONTINUE. Hence, in place of a policy we can just condition on when computing our SSA probabilities. Now, when using EDT+SSA, we assign probabilities to being a specific instance in a specific possible history of the world. For example, we assign probabilities of the form , which denotes the probability that given I choose to CONTINUE with probability , history (a.k.a. CONTINUE, EXIT) is actual and that I am the instance intersection (i.e., the first intersection). Since we're using SSA, these probabilities are computed as follows:
That is, we first compute the probability that the history itself is actual (given ). Then we multiply it by the probability that within that history I am the instance at , which is just 1 divided by the number of instances of myself in that history, i.e. 2.
Now, the expected value according to EDT + SSA given can be computed by just summing over all possible situations, i.e. over all combinations of a history and a position within that history and multiplying the probability of that situation with the utility given that situation:
And that's exactly the ex ante expected value (or UDT-expected value, I suppose) of continuing with probability . Hence, EDT+SSA's recommendation in AMD is the ex ante optimal policy (or UDT's recommendation, I suppose). This realization is not original to myself (though I came up with it independently in collaboration with Johannes Treutlein) -- the following papers make the same point:
My comment generalizes these results a bit to include cases in which the agent faces multiple different decisions.
Thanks, I think I understand now, and made some observations about EDT+SSA at the old thread. At this point I'd say this quote from the OP is clearly wrong:
So, we could say that CDT+SIA = EDT+SSA = UDT1.0; or, CDT=EDT=UDT for short.
In fact UDT1.0 > EDT+SSA > CDT+SIA, because CDT+SIA is not even able to coordinate agents making the same observation, while EDT+SSA can do that but not coordinate agents making different observations, and UDT1.0 can (probably) coordinate agents making different observations (but seemingly at least some of them require UDT1.1 to coordinate).
Aside: Bayes nets which are representing decision problems are usually called influence diagrams rather than Bayes nets. I think this convention is silly; why do we need a special term for that?
In influence diagrams, nodes have a type--uncertainty, decision, or objective. This gives you legibility, and makes it more obvious what sort of interventions are 'in the spirit of the problem' or 'necessary to give a full solution.' (It's not obvious from the structure of the causal network that I should set 'my action' instead of 'Omega's prediction' in Newcomb's Problem; I need to read it off the labels. In an influence diagram, it's obvious from the shape of the node.) This is a fairly small benefit, tho, and seems much less useful than the restriction on causal networks that the arrows imply causation.
[Edit] They also make it clearer how to do factorized decision-making with different states of local knowledge, especially when knowledge is downstream of earlier decisions you made; if you're trying to reason about how a simple agent should deal with a simple situation, this isn't that helpful, but if you're trying to reason about many different corporate policies simultaneously, then something influence-diagram shaped might be better.
I guess, philosophically, I worry that giving the nodes special types like that pushes people toward thinking about agents as not-embedded-in-the-world, thinking things like "we need to extend Bayes nets to represent actions and utilities, because those are not normal variable nodes". Not that memoryless cartesian environments are any better in that respect.
I guess, philosophically, I worry that giving the nodes special types like that pushes people toward thinking about agents as not-embedded-in-the-world, thinking things like "we need to extend Bayes nets to represent actions and utilities, because those are not normal variable nodes". Not that memoryless cartesian environments are any better in that respect.
I see where this is coming from, but I think it might also go the opposite direction. For example, my current guess of how counterfactuals/counterlogicals ground out is the imagination process; I implicitly or explicitly think of different actions I could take (or different ways math could be), and somehow select from those actions (hypotheses / theories); the 'magic' is all happening in my imagination instead of 'in the world' (noting that, of course, my imagination is being physically instantiated). Less imaginative reactive processes (like thermostats 'deciding' whether to turn on the heater or not) don't get this treatment, and are better considered as 'just part of the environment', and if we build an imaginative process out of unimaginative processes (certainly neurons are more like thermostats than they are like minds) then it's clear the 'magic' comes from the arrangement of them rather than the individual units.
Which suggests how the type distinction might be natural; places where I see decision nodes are ones where I expect to think about what action to take next (or expect some other process to think about what action to take next), or think that it's necessary to think about how that thinking will go.
I'm not sure which you're addressing, but, note that I'm not objecting to the practice of illustrating variables with diamonds and boxes rather than only circles so that you can see at a glance where the choices and the utility are (although I don't tend to use the convention myself). I'm objecting to the further implication that doing this makes it not a Bayes net.
I'm objecting to the further implication that doing this makes it not a Bayes net.
I mean, white horses are not horses, right? [Example non-troll interpretations of that are "the set 'horses' only contains horses, not sets" and "the two sets 'white horses' and 'horses' are distinct." An example interpretation that is false is "for all members X of the set 'white horses', X is not a member of the set 'horses'."]
To be clear, I don't think it's all that important to use influence diagrams instead of causal diagrams for decision problems, but I do think it's useful to have distinct and precise concepts (such that if it even becomes important to separate the two, we can).
What is it that you want out of them being Bayes nets?
I disagree. All the nodes in the network should be thought of as grounding out in imagination, in that it's a world-model, not a world. Maybe I'm not seeing your point.
I would definitely like to see a graphical model that's more capable of representing the way the world-model itself is recursively involved in decision-making.
One argument for calling an influence diagram a generalization of a bayes could be that the conditional probability table for the agent's policy given observations is not given as part of the influence diagram, and instead must be solved for. But we can still think of this as a special case of a Bayes net, rather than a generalization, by thinking of an influence diagram as a special sort of Bayes net in which the decision nodes have to have conditional probability tables obeying some optimality notion (such as the CDT optimality notion, the EDT optimality notion, etc).
This constraint is not easily represented within the Bayes net itself, but instead imposed from outside. It would be nice to have a graphical model in which you could represent that kind of constraint naturally. But simply labelling things as decision nodes doesn't do much. I would rather have a way of identifying something as agent-like based on the structure of the model for it. (To give a really bad version: suppose you allow directed cycles, rather than requiring DAGs, and you think of the "backwards causality" as agency. But, this is really bad, and I offer it only to illustrate the kind of thing I mean -- allowing you to express the structure which gives rise to agency, rather than taking agency as a new primitive.)
All the nodes in the network should be thought of as grounding out in imagination, in that it's a world-model, not a world. Maybe I'm not seeing your point.
My point is that my world model contains both 'unimaginative things' and 'things like world models', and it makes sense to separate those nodes (because the latter are typically functions of the former). Agreed that all of it is 'in my head', but it's important that the 'in my head' realm contain the 'in X's head' toolkit.
unfortunately "coordination" lacks a snappy three-letter acronym.
I propose the following three letters: "YOU" (possibly as a backronym).
Hrm. I realize that the post would be comprehensible to a much wider audience with a glossary, but there's one level of effort needed for me to write posts like this one, and another level needed for posts where I try to be comprehensible to someone who lacks all the jargon of MIRI-style decision theory. Basically, if I write with a broad audience in mind, then I'm modeling all the inferential gaps and explaining a lot more details. I would never get to points like the one I'm trying to make in this post. (I've tried.) Posts like this are primarily for the few people who have kept up with the CDT=EDT sequence so far, to get my updated thinking in writing in case anyone wants to go through the effort of trying to figure out what in the world I mean. To people who need a glossary, I recommend searching lesswrong and the stanford encyclopedia of philosophy.
I encounter the same problem when I'm writing about voting theory. But there is a set of people who have followed past discussion closely enough to follow something technical like this with a glossary, but not without one. My solution has been to make sure every acronym I use has an entry on electowiki, and then include a note saying so with a link to electowiki. I think you could helpfully do the same using less wrong wiki.
If someone made a glossary, what terms would you want in it?
(The closest thing right now might be https://wiki.lesswrong.com/wiki/LessWrong_Wiki)
Epistemic status: I no longer endorse the particular direction this post advocates, though I'd be excited if someone figured out something that seems to work. I still endorse most of the specific observations.
So... what's the deal with counterfactuals?
Over the past couple of years, I've been writing about the CDT=EDT perspective. I've now organized those posts into a sequence for easy reading.
I call CDT=EDT a "perspective" because it is a way of consistently answering questions about what counterfactuals are and how they work. At times, I've argued strongly that it is the correct way. That's basically because:
However, recently I've realized that there's a perspective which unifies even more approaches, while being less boring (more optimistic about counterfactual reasoning helping us to do well in decision-theoretic problems). It's been right in front of me the whole time, but I was blind to it due to the way I factored the problem of formulating decision theory. It suggests a research direction for making progress in our understanding of counterfactuals; I'll try to indicate some open curiosities of mine by the end.
Three > Two
The claim I'll be elaborating on in this post is, essentially, that the framework in Jessica Taylor's post about memoryless cartesian environments is better than the CDT=EDT way of thinking. You'll have to read the post to get the full picture if you haven't, but to briefly summarize: if we formalize decision problems in a framework which Jessica Taylor calls "memoryless cartesian environments" (which we can call "memoryless POMDPs" if we want to be closer to academic CS/ML terminology), reasoning about anthropic uncertainty in a certain way (via the self-indication assumption, SIA for short) makes it possible for CDT to behave like UDT.
The result there is sometimes abbreviated as UDT=CDT+SIA, although UDT⊂CDT+SIA is more accurate, because the optimal UDT policies are a subset of the policies which CDT+SIA can follow. This is because UDT has self-coordination power which CDT+SIA lacks. (We could say UDT=CDT+SIA+coordination, but unfortunately "coordination" lacks a snappy three-letter acronym. Or, to be even more pedantic, we could say that UDT1.0 = CDT+SIA, and UDT1.1 = CDT+SIA+coordination. (The difference between 1.0 and 1.1 is, after all, the presence of global policy coordination.)) [EDIT: This isn't correct. See Wei Dai's comment.]
Caspar Oesterheld commented on that post with an analogous EDT+SSA result. SSA (the self-sampling assumption) is one of the main contenders beside SIA for correct anthropic reasoning. Caspar's comment shows that we can think of the correct anthropics as a function of your preference between CDT and EDT. So, we could say that CDT+SIA = EDT+SSA = UDT1.0; or, CDT=EDT=UDT for short. [EDIT: As per Wei Dai's comment, the equation "CDT+SIA = EDT+SSA = UDT1.0" is really not correct due to differing coordination strengths; as he put it, UDT1.0 > EDT+SSA > CDT+SIA.]
My CDT=EDT view came from being pedantic about how decision problems are represented, and noticing that when you're pedantic, it becomes awfully hard to drive a wedge between CDT and EDT; you've got to do things which are strange enough that it becomes questionable whether it's a fair comparison between CDT and EDT. However, I didn't notice the extent to which my "being very careful about the representation" was really insisting that bayes nets are the proper representation.
(Aside: Bayes nets which are representing decision problems are usually called influence diagrams rather than Bayes nets. I think this convention is silly; why do we need a special term for that?)
It is rather curious that LIDT also illustrated CDT=EDT-style behavior. It is part of what made me feel like CDT=EDT was a convergent result of many different approaches, rather than noticing its reliance on certain Bayes-net formulations of decision problems. Now, I instead find it to be curious and remarkable that logical induction seems to think as if the world were made of bayes nets.
If CDT=EDT comes from insisting that decision problems are represented as Bayes nets, CDT=EDT=UDT is the view which comes from insisting that decision problems be represented as memoryless cartesian environments. At the moment, this just seems like a better way to be pedantic about representation. It unifies three decision theories instead of two.
Updatelessness Doesn't Factor Out
In fact, I thought about Jessica's framework frequently, but I didn't think of it as an objection to my CDT=EDT way of thinking. I was blind to this objection because I thought (logical-)counterfactual reasoning and (logically-)updateless reasoning could be dealt with as separate problems. The claim was not that CDT=EDT-style decision-making did well, but rather, that any decision problem where it performed poorly could be analyzed as a case where updateless reasoning is needed in order to do well. I let my counterfactual reasoning be simple, blaming all the hard problems on the difficulty of logical updatelessness.
Once I thought to question this view, it seemed very likely wrong. The Dutch Book argument for CDT=EDT seems closer to the true justification for CDT=EDT reasoning than the Bayes-net argument, but the Dutch Book argument is a dynamic consistency argument. I know that CDT and EDT both violate dynamic consistency, in general. So, why pick on one special type of dynamic consistency violation which CDT can illustrate but EDT cannot? In other words, the grounds on which I can argue CDT=EDT seem to point more directly to UDT instead.
What about all those arguments for CDT=EDT?
Non-Zero Probability Assumptions
I've noted before that each argument I make for CDT=EDT seems to rely on an assumption that actions have non-zero probability. I leaned heavily on an assumption of epsilon exploration, although one could also argue that all actions must have non-zero probability on different grounds (such as the implausibility of knowing so much about what you are going to do that you can completely rule out any action, before you've made the decision). Focusing on cases where we have to assign probability zero to some action was a big part of finally breaking myself of the CDT=EDT view and moving to the CDT=EDT=UDT view.
(I was almost broken of the view about a year ago by thinking about the XOR blackmail problem, which has features in common with the case I'll consider now; but, it didn't stick, perhaps because the example doesn't actually force actions to have probability zero and so doesn't point so directly to where the arguments break down.)
Consider the transparent Newcomb problem with a perfect predictor:
Transparent Newcomb. Omega runs a perfect simulation of you, in which you face two boxes, a large box and a small box. Both boxes are made of transparent glass. The small box contains $100, while the large one contains $1,000. In the Simulation, Omega gives you the option of either taking both boxes or only taking the large box. If Omega predicts that you will take only one box, then Omega puts you in this situation for real. Otherwise, Omega gives the real you the same decision, but with the large box empty. You find yourself in front of two full boxes. Do you take one, or two?
Apparently, since Omega is a perfect predictor, we are forced to assign probability zero to one-boxing even if we follow a policy of epsilon-exploring. In fact, if you implement epsilon-exploration by refusing to take any action which you're very confident you'll take (you have a hard-coded response: if P("I do action X")>1-epsilon, do anything but X), which is how I often like to think about it, then you are forced to 2-box in transparent Newcomb. I was expecting CDT=EDT type reasoning to 2-box (at which point I'd say "but we can fix that by being updateless"), but this is a really weird reason to 2-box.
Still, that's not in itself an argument against CDT=EDT. Maybe the rule that we can't take actions we're overconfident in is at fault. The argument against CDT=EDT style counterfactuals in this problem is that the agent should expect that if it 2-boxes, then it won't ever be in the situation to begin with; at least, not in the real world. As discussed somewhat in the happy dance problem, this breaks important properties that you might want out of conditioning on conditionals. (There are some interesting consequences of this, but they'll have to wait for a different post.) More importantly for the CDT=EDT question, this can't follow from evidential conditioning, or learning about consequences of actions through epsilon-exploration, or any other principles in the CDT=EDT cluster. So, there would at least have to be other principles in play.
A very natural way of dealing with the problem is to represent the agent's uncertainty about whether it is in a simulation. If you think you might be in Omega's simulation, observing a full box doesn't imply certainty about your own action anymore, or even about whether the box is really full. This is exactly how you deal with the problem in memoryless cartesian environments. But, if we are willing to do this here, we might as well think about things in the memoryless cartesian framework all over the place. This contradicts the CDT=EDT way of thinking about things in lots of problems where updateless reasoning gives different answers than updatefull reasoning, such as counterfactual mugging, rather than only in cases where some action has probability zero.
(I should actually say "problems where updateless reasoning gives different answers than non-anthropic updateful reasoning", since the whole point here is that updateful reasoning can be consistent with updateless reasoning so long as we take anthropics into account in the right way.)
I also note that trying to represent this problem in bayes nets, while possible, is very awkward and dissatisfying compared to the representation in memoryless cartesian environments. You could say I shouldn't have gotten myself into a position where this felt like significant evidence, but, reliant on Bayes-net thinking as I was, it did.
Ok, so, looking at examples which force actions to have probability zero made me revise my view even for cases where actions all have non-zero probability. So again, it makes sense to ask: but what about the arguments in favor of CDT=EDT?
Bayes Net Structure Assumptions
The argument in the bayes net setting makes some assumptions about the structure of the Bayes net, illustrated earlier. Where do those go wrong?
In the Bayes net setting, observations are represented as parents of the epistemic state (which is a parent of the action). To represent the decision conditional on an observation, we condition on the observation being true. This stops us from putting some probability on our observations being false due to us being in a simulation, as we do in the memoryless cartesian setup.
In other words: the CDT=EDT setup makes it impossible to update on something and still have rational doubt in it, which is what we need to do in order to have an updateful DT act like UDT.
There's likely some way to fix this while keeping the Bayes-net formalism. However, memoryless cartesian environments model it naturally.
Question: how can we model memoryless cartesian environments in Bayes nets? Can we do this in a way such that the CDT=EDT theorem applies (making the CDT=EDT way of thinking compatible with the CDT=EDT=UDT way of thinking)?
CDT Dutch Book
What about the Dutch-book argument for CDT=EDT? I'm not quite sure how this one plays out. I need to think more about the setting in which the Dutch-book can be carried out, especially as it relates to anthropic problems and anthropic Dutch-books.
Learning Theory
I said that I think the Dutch-book argument gets closer to the real reason CDT=EDT seems compelling than the Bayes-net picture does. Well, although the Dutch Book argument against CDT gives a crisp justification of a CDT=EDT view, I felt the learning-theoretic intuitions which lead me to formulate the dutch book are closer to the real story. It doesn't make sense to ask an agent to have good counterfactuals in any single situation, because the agent may be ignorant about how to reason about the situation. However, any errors in counterfactual reasoning which result in observed consequences predictably differing from counterfactual expectations should eventually be corrected.
I'm still in the dark about how this argument connects to the CDT=EDT=UDT picture, just as with the Dutch-book argument. I'll discuss this more in the next section.
Static vs Dynamic
A big update in my thinking recently has been to cluster frameworks into "static" and "dynamic", and ask how to translate back and forth between static and dynamic versions of particular ideas. Classical decision theory has a strong tendency to think in terms of statically given decision problems. You could say that the epistemic problem of figuring out what situation you're in is assumed to factor out: decision theory deals only with what to do once you're in a particular situation. On the other hand, learning theory deals with more "dynamic" notions of rationality: rationality-as-improvement-over-time, rather than an absolute notion of perfect performance. (For our purposes, "time" includes logical time; even in a single-shot game, you can learn from relevantly similar games which play out in thought-experiment form.)
This is a messy distinction. Here are a few choice examples:
Static version: Dutch-book and money-pump arguments.
Dynamic version: Regret bounds.
Dutch-book arguments rely on the idea that you shouldn't ever be able to extract money from a rational gambler without a chance of losing it instead. Regret bounds in learning theory offer a more relaxed principle, that you can't ever extract too much money (for some notion of "too much" given by the particular regret bound). The more relaxed condition is more broadly applicable; Dutch-book arguments only give us the probabilistic analog of logical consistency properties, whereas regret bounds give us inductive learning.
Static: Probability theory.
Dynamic: Logical induction.
In particular, the logical induction criterion gives a notion of regret which implies a large number of nice properties. Typically, the difference between logical induction and classical probability theory is framed as one of logical omniscience vs logical uncertainty. The static-vs-dynamic frame instead sees the critical difference as one of rationality in a static situation (where it makes sense to think about perfect reasoning) vs learning-theoretic rationality (where it doesn't make sense to ask for perfection, and instead, one thinks in terms of regret bounds).
Static: Bayes-net decision theory (either CDT or EDT as set up in the CDT=EDT argument).
Dynamic: LIDT.
As I mentioned before, the way LIDT seems to naturally reason as if the world were made of Bayes nets now seems like a curious coincidence rather than a convergent consequence of correct counterfactual conditioning. I would like a better explanation of why this happens. Here is my thinking so far:
There's a lot of formal work one could do to try to make the connection more rigorous (and look for places where the connection breaks down!).
Static: UDT.
Dynamic: ???
The problem of logical updatelessness has been a thorn in my side for some time now. UDT is a good reply to a lot of decision-theoretic problems when they're framed in a probability-theoretic setting, but moving to a logically uncertain setting, it's unclear how to apply UDT. UDT requires a fixed prior, whereas logical induction gives us a picture in which logical uncertainty is fundamentally about how to revise beliefs as you think longer.
The main reason the static-vs-dynamic idea has been a big update for me is that I realized that a lot of my thinking has been aimed at turning logical uncertainty into a "static" object, to be able to apply UDT. I haven't even posted about most of those ideas, because they haven't lead anywhere interesting. Tsvi's post on thin logical priors is definitely an example, though. I now think this type of approach is likely doomed to failure, because the dynamic perspective is simply superior to the static one.
The interesting question is: how do we translate UDT to a dynamic perspective? How do we learn updateless behavior?
For all its flaws, taking the dynamic perspective on decision theory feels like something asymptotic decision theory got right. I have more to say about what ADT does right and wrong, but perhaps it is too much of an aside for this post.
A general strategy we might take to approach that question is: how do we translate individual things which UDT does right into learning-theoretic desiderata? (This may be more tractable than trying to translate the UDT optimality notion into a learning-theoretic desideratum whole-hog.)
Static: Memoryless Cartesian decision theories (CDT+SIA or EDT+SSA).
Dynamic: ???
The CDT=EDT=UDT perspective on counterfactuals is that we can approach the question of learning logically updateless behavior by thinking about the learning-theoretic version of anthropic reasoning. How do we learn which observations to take seriously? How do we learn about what to expect supposing we are being fooled by a simulation? Some optimistic speculation on that is the subject of the next section.
We Have the Data
Part of why I was previously very pessimistic about doing any better than the CDT=EDT-style counterfactuals was that we don't have any data about counterfactuals, almost by definition. How are we supposed to learn what to counterfactually expect? We only observe the real world.
Consider LIDT playing transparent Newcomb with a perfect predictor. Its belief that it will 1-box in cases where it sees that the large box is full must converge to 100%, because it only ever sees a full box in cases where it does indeed 1-box. Furthermore, the expected utility of 2-boxing can be anything, since it will never see cases where it sees a full box and 2-boxes. This means I can make LIDT 1-box by designing my LI to think 2-boxing upon seeing a full box will be catastrophically bad: I simply include a trader with high initial wealth who bets it will be bad. Similarly, I can make LIDT 2-box whenever it sees the full box by including a trader who bets 2-boxing will be great. Then, the LIDT will never see a full box except on rounds where it is going to epsilon-explore into 1-boxing.
(The above analysis depends on details of how epsilon exploration is implemented. If it is implemented via the probabilistic chicken-rule, mentioned earlier, making the agent explore whenever it is very confident about which action it takes, then the situation gets pretty weird. Assume that LIDT is epsilon-exploring pseudorandomly instead.)
LIDT's confidence that it 1-boxes whenever it sees a full box is jarring, because I've just shown that I can make it either 1-box or 2-box depending on the underlying LI. Intuitively, an LIDT agent who 2-boxes upon seeing the full box should not be near-100% confident that it 1-boxes.
The problem is that the cases where LIDT sees a full box and 2-boxes are all counterfactual, since Omega is a perfect predictor and doesn't show us a full box unless we in fact 1-box. LIDT doesn't learn from counterfactual cases; the version of the agent in Omega's head is shut down when Omega is done with it, and never reports its observations back to the main unit.
(The LI does correctly learn the mathematical fact that its algorithm 2-boxes when input observations of a full box, but, this does not help it to have the intuitively correct expectations when Omega feeds it false sense-data.)
In the terminology of The Happy Dance Problem, LIDT isn't learning the right observation-counterfactuals: the predictions about what action it takes given different possible observations. However, we have the data: the agent could simulate itself under alternative epistemic conditions, and train its observation-counterfactuals on what action it in fact takes in those conditions.
Similarly, the action-counterfactuals are wrong: LIDT can believe anything about what happens when it 2-boxes upon seeing a full box. Again, we have the data: LI can observe that on rounds when it is mathematically true that the LIDT agent would have 2-boxed upon seeing a full box, it doesn't get the chance. This knowledge simply isn't being "plugged in" to the decision procedure in the right way. Generally speaking, an agent can observe the real consequences of counterfactual actions, because (1) the counterfactual action is a mathematical fact of what the agent does under a counterfactual observation, and (2) the important effects of this counterfactual action occur in the real world, which we can observe directly.
This observation makes me much more optimistic about learning interesting counterfactuals. Previously, it seemed like by definition there would be no data from which to learn the correct counterfactuals, other than the (EDTish) requirement that they should match the actual world for actions actually taken. Now, it seems like I have not one, but two sources of data: the observation-counterfactuals can be simulated outright, and the action-counterfactuals can be trained on what actually happens when counterfactual actions are taken.
I haven't been able to plug these pieces together to get a working counterfactual-learning algorithm yet. It might be that I'm still missing a component. But ... it really feels like there should be something here.