[ Upvoted. ]
If anyone felt I was uncivil to them in any subthread, I hereby apologize here.
I am not sure causality is a subfield of ML in the sense that I don't think many ML people care about causality. I think causal inference is a subfield of stats (lots of talks with the word "causal" at this year's JSM). I think it's weird that stats and ML are different fields, but that's a separate discussion.
I think it is possible to formalize causality without talking about interventions as Pearl et al. thinks of them, for example people in reinforcement learning do this. But if you start to worry about e.g. time-varying confounders, and you are not using interventions, you will either get stuff wrong, or have to reinvent interventions again. Which would be silly -- so just learn about the Neyman/Rubin model and graphs. It's the formalism that handles all the "gotchas" correctly. (In fact, until interventionists came along, people didn't even have the math to realize that time-varying confounders are a "gotcha" that needs special handling!)
By the way, the only reason I am harping on time-varying confounders is because it is a historically important case that I can explain with a 4 node example. There are lots of other, more complicated "gotchas," of course.
Interventions seem to pop up/get reinvented in seemingly weird places, like the pi constant:
In channels with feedback (thus causality arises!)
http://www.adaptiveagents.org/bayesian_control_rule
http://en.wikipedia.org/wiki/Thompson_sampling
In multi-armed bandit problems (which are related to longitudinal studies in causal inference).
http://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator
In handling missing data (can view "missingness" as a causal property). Note the phrasing in the second link: "given the observed data, the missingness mechanism does not depend on the unobserved data." This is precisely the "no unobserved confounders" assumption in causal inference. Not surprisingly the correction is the same as in causal inference.
Also in figuring out what the dimension of a statistical hidden variable DAG model is. For example if A,B,C,D are binary, and U, W are unrestricted, then the dimension of the model
{ p(a,b,c,d) = \sum_{u,w} p(a,b,c,d,u,w) | p(a,b,c,d,u,w) factorizes wrt A -> B -> C -> D, A <- U -> C, B <- W -> D } is 13, not 15, which is weird, but there is an intervention-inspired explanation for why.
you can imagine learning about causality as a feature of the environment
I don't think you can get something for nothing. You will need causal assumptions somewhere.
Thanks Ilya, that was a lot of useful context and I wasn't aware that causality was more in stats than ML. For the record, I think that causality is super-interesting and cool, I hope that I didn't sound too negative by calling it "niche" (I would have described e.g. Bayesian nonparametrics, which I used to do research in, the same way, although perhaps it's unfair to lump in causality with nonparametric Bayes, since the former has a much more distinguished history).
I agree with pretty much everything you say above, although I'm still confused ab...
Yann LeCun, now of Facebook, was interviewed by The Register. It is interesting that his view of AI is apparently that of a prediction tool:
"In some ways you could say intelligence is all about prediction," he explained. "What you can identify in intelligence is it can predict what is going to happen in the world with more accuracy and more time horizon than others."
rather than of a world optimizer. This is not very surprising, given his background in handwriting and image recognition. This "AI as intelligence augmentation" view appears to be prevalent among the AI researchers in general.