I still basically agree with my retracted comment, I'd just like to note that taken at face value, your two equations for B given A really are the same.
The counterfactual difference comes from an implied random variable that decides which branch of the equation we're "using" (in the implied causal process that goes from A to B), and which can remember this information during counterfactual reasoning. But of course it is a simple thing to make this implied random variable an explicit node in your causal graph. This is probably the best resolution.
This sounds like a question of how you're choosing to define a causal node. Is it something that's a fixed function of its parents? In which case your hypotheses about the function from A to B are hypotheses over different causal graphs. Or should the function from parents to node be a parameter that you represent inside a causal graph? In which case you need some representation of this distribution.
Either way, I agree that you need more than what you started with to capture the counterfactuals you're thinking of here.
Problem solved: Found what I was looking for in: An Axiomatic Characterization Causal Counterfactuals, thanks to Evan Lloyd.
Basically, making every endogenous variable a deterministic function of the exogenous variables and of the other endogenous variables, and pushing all the stochasticity into the exogenous variables.
Old post:
A problem that's come up with my definitions of stratification.
Consider a very simple causal graph:
In this setting, A and B are both booleans, and A=B with 75% probability (independently about whether A=0 or A=1).
I now want to compute the counterfactual: suppose I assume that B=0 when A=0. What would happen if A=1 instead?
The problem is that P(B|A) seems insufficient to solve this. Let's imagine the process that outputs B as a probabilistic mix of functions, that takes the value of A and outputs that of B. There are four natural functions here:
Then one way of modelling the causal graph is as a mix 0.75f2 + 0.25f3. In that case, knowing that B=0 when A=0 implies that P(f2)=1, so if A=1, we know that B=1.
But we could instead model the causal graph as 0.5f2+0.25f1+0.25f0. In that case, knowing that B=0 when A=0 implies that P(f2)=2/3 and P(f0)=1/3. So if A=1, B=1 with probability 2/3 and B=1 with probability 1/3.
And we can design the node B, physically, to be one or another of the two distributions over functions or anything in between (the general formula is (0.5+x)f2 + x(f3)+(0.25-x)f1+(0.25-x)f0 for 0 ≤ x ≤ 0.25). But it seems that the causal graph does not capture that.
Owain Evans has said that Pearl has papers covering these kinds of situations, but I haven't been able to find them. Does anyone know any publications on the subject?