This post is inspired by the recent discussion I had with IlyaShpitser and Vaniver on EDT.
A random variable only ever has one value
In probability theory, statistics and so on, we often use the notion of a random variable (RV). If you go look at the definition, you will see that a RV is a function of the sample space. What that means is that a RV assigns a value to each possible outcome of a system. In reality, where there are no closed systems, this means that a RV assigns a value to each possible universe.
For example, a random variable X representing the outcome a die roll is a function of type "Universe → {1..6}". The value of X in a particular universe u is then X(u). Uncertainty in X corresponds to uncertainty about the universe we are in. Since X is a pure mathematical function, its value is fixed for each input. That means that in a fixed universe, say our universe, such a random variable only ever takes on one value.
So, before the die roll, the value of X is undefined1, and after the roll X is forever fixed. X is the outcome of a certain particular roll. If I roll the same die again, that doesn't change the value of X. If you want to talk about multiple rolls, you have to use different variables. The usual solution is to use indices, X1, X2, etc.
This also means that the nodes in a causal model, are not random variables. For example in the causal model "Smoking → Cancer", there is no single RV for smoking. Rather, the model is implicitly a generalized to mean "Smokingi → Canceri" for all persons i.
What this means for EDT
It is sometimes claimed that Evidential Decision Theory (EDT) can not deal with causal structure. But I would disagree. To avoid confusion, I will refer to my interpretation as Estimated Evidential Decision Theory (EEDT).
Decision theories such as (E)EDT rely on the following formula to make decisions:
where oj are the possible outcomes, U(oj) is the utility of an outcome, O is a random variable that represents the actual outcome, and a is an action. The (E)EDT policy is to take the action that maximizes V(a), the value of that action.
How would you evaluate this formula in practice? To do that, you need to know P(O=oj | a). I.e. the probability of a certain outcome given that you take a certain action. But keep in mind the previous section! There is only one random variable O, which is the outcome of this action. Without assuming some prior knowledge, O is unrelated to the outcome of other similar actions in similar situations.
At the time an agent has to decide what action a to take, the action has not happened yet, and the outcome is not yet known to him. This means that the agent has no observations of O. The agent therefore has to estimate P(O=oj|a) by using only his prior knowledge. How this estimation is done exactly is not specified by EEDT. If the agent wants to use a causal model, he is perfectly free to do so!
You might argue that by not specifying how the conditional probabilities P(O=oj|a) are calculated, I have taken out the interesting part of the decision theory. With the right choice of estimation procedure, EEDT can describe CDT, normal/naive EDT, and even UDT2. But EEDT is not so general as to be completely useless. What it does give you is a way to reduce the problem of making decisions to that of estimating conditional probabilities.
Footnotes
1. Technically, 'undefined' is not in the domain of X. What I mean is that X is a partial function of universes, or a function only of universes in which the die has been rolled.
2. To get CDT, assume there is a causal model for A -> O, and use that to estimate P(O=oj | do A=a). To get naive EDT, estimate the probabilities from data without taking causality or confounders into account. To get UDT, model A as being the choice of all sufficiently similar agents, not just yourself.
The English explanation is that P(O|a) is "the probability of outcome O given that we observe the action is a" and P(O|do(a)) is "the probability of outcome O given that we set the action to a."
The first works by conditioning; basically, you go through the probability table, throw out all of the cases where the action isn't a, and then renormalize.
The second works by severing causal links that point in to the modified node, while maintaining causal links pointing out of the modified node. Then you use this new severed subgraph to calculate a new joint probability distribution (for only the cases where the action is a).
The practical difference shows up mostly in cases where some environmental variable influences the action. If you condition on observing a, that means you make a Bayesian update, which means you can think your decision influences unmeasured variables which could have impacted your decision (because correlation is symmetric). For example, suppose you're uncertain how serious your illness is, but you know that seriousness of illness is positively correlated with going to the hospital. Then, as part of your decision whether or not to go to the hospital, your model tells you that going to the hospital would make your illness be more serious because it would make your illness seem more serious.
The defense of EDT is generally that of course the decision-maker would intuitively know which correlations are inside the correct reference class and which aren't. This defense breaks down if you want to implement the decision-making as computer algorithms, where programming in intuition is an open problem, or you want to use complicated interventions in complicated graphs where intuition is not strong enough to reliably get the correct answer.
The benefit of do(a) is that it's an algorithmic way of encoding asymmetric causality assumptions, such that lesion-> smoke means we think learning about the lesion tells us about whether or not someone will smoke, and learning whether or not someone smoked tells us about whether or not they have the lesion, but changing someone from a smoker to a non-smoker (or the other way around) will not impact whether or not they have a lesion, while directly changing whether or not someone has the lesion will change how likely they are to smoke. We can algorithmically create the correct reference class for any given intervention into a causal network, which is the severed subgraph I mentioned earlier, with the do() operator.
How about a more concrete example: what's the difference between observing that I one-box and setting that I one-box?
P(A|B) = P(A&B)/P(B). That is the definition of conditional probability. You appear to be doing something else.