Nice find!
GPT-4 really seems to change the minds of a lot of researchers. Pearl, Hinton, and I think I saw a few others too but can't remember who they were now.
Yeah. It seems worth reviewing what is was about that innovation in particular that caused people to update, so we can have a better general model of when, "after critical event W happens, they still won't believe you", and when it turns out that they will!
I'm wondering if GPT-5 or Gemini would snap people like LeCun out of their complacency. I suspect LeCun has a pretty detailed model of intelligence which implies things like mesa-optimization not being a problem etc. as well as further scaling successes being implausible. Something like Gemini having a good enough world model to do plenty of physical reasoning in a simulation may violate enough of his assumptions that he actually updates.
In the interview, he does not say if he has tried gpt 3 or 4. After witnessing various intellectuals skimping the 20$ and then generalizing whatever gpt3 did to a grand theory of artificial intelligence, I'm not too confident that Pearl ponied up. I say 5:1 he tried gpt4 as well.
If something interests us, we can perform trials. Because our knowledge is integrated with our decisionmaking, we can learn causality that way. What ChatGPT does is pick up both knowledge and decisionmaking by imitation, which is why it can also exhibit causal reasoning without itself necessarily acting agentically during training.
Ok. My guess is that Pearl would say something more like that we have an innate ability to represent causal models, and only after that follow with what you said. He thinks that having the causal model representation is necessary, that you can't just look at trials and decisions to make causal inferences, if you don't have this special causal machinery inside you. (Personally, I disagree this is a good frame.)
My rejoinder to this is that, analogously to how a causal model can be re-implemented as a more complex non-causal model[2], a learning algorithm that looks at data that in some ways is saying something about causality, be it because the data contains information-decision-action-outcome units generated by agents, because the learning thing can execute actions itself and reflectively process the information of having done such actions, or because the data contains an abstract description of causality, can surely learn causality.
Short comment/feedback just to say: This sentence is making one of your main points but is very tricky! - perhaps too long/too many subclauses?
Thanks for sharing! It's nice to see plasticity, especially for stats, which seems to have more opinionated contributors than other applied maths. Although, it seems this 'admission' is not changing his framework, but rather reinterpreting how ML is used to be compatible with his framework.
Pearl's texts talk about having causal models that use the do(X) operator (e.g. P(Y|do(X))) to signify causal information. Now in LLMs, he sees the text the model is conditioning on as sometimes being do(X) or X. I'm curious what else besides text would count as this. I'm not sure that I recall this correctly but in his third level, you can use purely observational data to infer causality with things like instrumental variables. If I had a ML model that took as input purely numerical input, such as (tar, smoking status, got cancer, and various other health data), should it be able to predict counterfactual results?
I'm uncertain about what the right answer here is, and how Pearl would view this. My guess is a naive ML model would be able to do this provided the data covered the counterfactual cases which is likely for the smoking case. But it would not be as useful for out of sample counterfactual inferences where there is little or no coverage for the interventions and outcomes (e.g. if one of the inputs was 'location' it had to predict the effects of smoking on the ISS, where no one smokes). However, if we kept adding more purely observational information about the universe, it feels like we might be able to get a causal model out a transformer-like thing. I'm aware there are some tools that try to extract a DAG from the data as a primitive form of this approach, but is at odds with the Bayesian stats approach of having a DAG first, then checking to see if the DAG holds with the data or vice versa. Please share if you have some references that would be useful.
However, if we kept adding more purely observational information about the universe, it feels like we might be able to get a causal model out a transformer-like thing.
I think it is true.
If you observe everything in enough detail and your hypothesis space is complete you get counterfactual prediction automatically. Theoretical example: a Solomonoff inductor observes the world, Physical laws satisfy causality, the best prediction algorithm takes that into account, the inductor's inference favors that algorithm, the algorithm can simulate Physical laws and so produce counterfactuals if needed in the course of its predictions.
If you live in a world where counterfactual thinking is possible and useful to predict the future, then Bayes brings you there.
An interesting look at the question of counterfactuals is the debate between Pearl and Robins on cross-world independence assumptions. It's relevant because Robins solves the paradox of Pearl's impossible to verify assumptions by noting that you can always add a mediator in any arrow of a causal model (I'd add, due to locality of Physical laws) and this makes the assumptions verifiable in principle. In other words, by observing the "full video" of a process, instead of just the frames represented by some random variables, you need less out-of-the-hat assumptions to infer counterfactual causal quantities.
I tried to write an explanation, but I realized I still don't understand the matter enough to go through the details, so I'll leave you a reference: the last section, "Mediation", in this Robins interview.
I'm aware there are some tools that try to extract a DAG from the data as a primitive form of this approach, but is at odds with the Bayesian stats approach of having a DAG first, then checking to see if the DAG holds with the data or vice versa. Please share if you have some references that would be useful.
My superficial impression is that the field of causal discovery does not have its shit together. Not to dunk on them; it's not a law of Nature that what you set out to do will be within your ability. See also "Are there any good, easy-to-understand examples of cases where statistical causal network discovery worked well in practice?"
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Judea Pearl is a famous researcher, known for Bayesian networks (the standard way of representing Bayesian models), and his statistical formalization of causality. Although he has always been recommended reading here, he's less of a staple compared to, say, Jaynes. So the need to re-introduce him. My purpose here is to highlight a soothing, unexpected show of rationality on his part.
One year ago I reviewed his last book, The Book of Why, in a failed[1] submission to the ACX book review contest. There I spend a lot of time around what appears to me as a total paradox in a central message of the book, dear to Pearl: that you can't just use statistics and probabilities to understand causal relationships; you need a causal model, a fundamentally different beast. Yet, at the same time, Pearl shows how to implement a causal model in terms of a standard statistical model.
Before giving me the time to properly raise all my eyebrows, he then sweepingly connects this insight to Everything Everywhere. In particular, he thinks that machine learning is "stuck on rung one", his own idiomatic expression to say that machine learning algorithms, only combing for correlations in the training data, are stuck at statistics-level reasoning, while causal reasoning resides at higher "rungs" on the "ladder of causation", which can't be reached unless you deliberately employ causal techniques.
My rejoinder to this is that, analogously to how a causal model can be re-implemented as a more complex non-causal model[2], a learning algorithm that looks at data that in some ways is saying something about causality, be it because the data contains information-decision-action-outcome units generated by agents, because the learning thing can execute actions itself and reflectively process the information of having done such actions, or because the data contains an abstract description of causality, can surely learn causality. A powerful enough learner ought to be able to cross such levels of quoting.
Thus, I was gleefully surprised to read Pearl expressing this same reasoning in the September cover story of AMSTAT News. Surprised, because his writings, and forever ongoing debates with other causality researchers, begat an image of a very stubborn old man. VERY stubborn. Even when judging him in the right, I deemed him too damn confident and self-aggrandizing. At this point, I could not expect that, after dedicating a whole book to say a thing he had been repeating for 20 years, he could just go on the record and say "Oops".
He did.
Granted, a partial oops. He says "but". Still way beyond what I am used to expect from 80 year olds with a sterling hard-nosing track record.
Bits of the interview:
In the next paragraph, he shows the rare skill of not dunking on GPT before proper prompt futzing:
To top off, some AI safety:
But yielding the highest screening vote variance, I hence claim the title of Most Polarizing Review.
Amounts to adding auxiliary random variables connected with conditional distributions designed to implement causal relationships.