When I first read the Sequences, one of the exciting posts was Causal Diagrams and Causal Models, which got me into the idea that one could discover the structure of causal networks using statistics. Another rationalist source which gave me similar hopes was Scott Alexander's SSC Journal Club: Mental Disorders As Networks.
However, when I actually started applying these techniques to my own data, or to publicly available datasets, I often found that the techniques were unstable, and that one could easily infer plausible conditions where they would give the wrong results. It's possible I had the wrong approach or something, but in my confusion I started reading up on what experts in causal inference had said, and I got the impression that they studied the problem for a while, initially finding some algorithms, but over time concluding that their algorithms didn't work very well and that it is better to just have a human in the loop who specifies the causal networks.
So I mostly abandoned it, or saw it as a much more limited tool than I had before. But recently, John Wentworth argued that it was actually quite feasible in practice, so maybe I was too quick to abandon it. I would like to know - what are the best examples of this working well in practice? Or alternatively, did anyone else come to the same conclusions as I did?
I'm doing causal inference in academia. I do not work on causal discovery, neither does anyone of my colleagues, but I can tell my impression about it from occasional seminars on the topic.
The few causal discovery seminars I have seen belong to these categories:
Consider this a skewed but actual slice of the field.
My own thoughts on the subject matter:
In practice you don't have the information to completely reconstruct the causal relationships, neither to do it with low enough uncertainty that you can pretend you knew the graph, in cases where you have enough constraints to converge in principle to a single graph with infinite i.i.d. data. So an ideal method would provide you a list of graphs with a posterior probability for each, and then you would carry out the inference conditional on each graph. This is what Bayes tells you to do.
However, a graph with less arcs leads naturally to a lower-dimensional parameter space than one with more arcs, when you try to specify a model. This would suggest that the graph with missing arcs has probability zero. You can try to repair this with delta distributions (i.e., probability mass given to a single point in a continuous space), but does it make sense? As Andrew Gelman sometimes says, everything has a causal effect on everything, it's just that it can be very small. So maybe a model with shrinkage (i.e., keeping all connections in the graph, but defining a notion of "small" connection in the model and using a prior distribution that prefers simpler graphs) would make more sense.
I've not had these doubts answered in seminars nor by asking.
Finally, @IlyaShpitser may know something more.
This matches my impressions relatively well.