Your question may be only aimed at people who have studied the relevant part of the book, but to me it is very unclear what you mean here with "recover" and "express" in "recover causal models given only the training data" or "minimally expressive among those that can express the observational distribution".
Chapter 2 of Pearl's Causality book claims you can recover causal models given only the observational data, under very natural assumptions of minimality and stability[1].
In graphical models lingo, Pearl identifies a causal model of the observational distribution with the distribution's perfect map (if they exist).
But I'm confused about a pretty fundamental point: "What does this have to do at all with causality??" More precisely:
To be clear, Pearl acknowledges this in his Temporal Bias Conjecture (2.8.2):
And Pearl conjectures that the reason for this is possibly because human language is optimized such that our [choice of variables / factorization of reality] makes the Temporal Bias true.
I ... guess that could be an explanation? But honestly I don't think I understand his point very well and I find it pretty unsatisfying. I would appreciate any explanation as to why it makes sense to identify perfect maps with Causal Models.
Minimality: Choose the network structure that is minimally expressive among those that can express the observational distribution.
Stability: Assume that there exists a network structure that perfectly captures all and only the independencies implied by the observational distribution. i.e. independencies are structural.