Thanks Marius and David, really interesting post, and super glad to see interest in causality picking up!
I very much share your "hunch that causality might play a role in transformative AI and feel like it is currently underrepresented in the AI safety landscape."
Most relevant, I've been working with Mary Phuong on a project which seems quite related to what you are describing here. I don't want to share too many details publicly without checking with Mary first, but if you're interested perhaps we could set up a call sometime?
I also think causality is relevant to AGI safety in several additional ways to those you mention here. In particular, we've been exploring how to use causality to describe agent incentives for things like corrigibility and tampering (summarized in this post), formalizing ethical concepts like intent, and understanding agency.
So really curious to see where your work is going and potentially interested in collaborating!
I like the direction you're going with this. I agree that causal reasoning is necessary, but not sufficient, for getting alignable TAI. I think just getting step 2 done (Summarize the literature on causality from an AI safety perspective) could have a huge impact in terms of creating a very helpful resource for AI alignment researches to pull insights from.
Our worry is that ML researchers, once they figure out how, will introduce a similar “overidentifying causality” inductive bias into models. This would mean that very powerful models with potentially big impacts have the causal model of a political pundit rather than a scientist.
One possible workaround for this would be to take a Bayesian approach. Bayes' rule is all about comparing the predictions (likelihoods) of different models (hypotheses) to assign higher probability mass to those models with greater predictive power.
Consider a system that uses an ensemble of differently structured causal models, each containing slots for different factors (e.g., {A, B} (no causal relationship), {A -> B}, {A <- B}, {A <- C -> B}, ...). Then for any given phenomenon, the system could feed in all relevant factors into the slots of the causal graph of each model, then use each model to make predictions, both about passive observations and about the results of interventions. Those causal (or acausal) models with the greatest predictive power would win out after the accumulation of enough evidence.
Of course, the question still remains about how the system chooses which factors are relevant, or about how it decides what kind of state transformations each causal arrow induces. But I think the general idea of multiple hypothesis testing should be sufficient to get any causally reasoning AI to think more like a scientist than a pundit.
There is a more technical sense in which no one understands causality, not even Judea Pearl
I feel like this is too strong, at least in the way I read it, though maybe I misunderstand. The questions you raise do not seem too hard to address:
(where does causal information ultimately come from if you have to make causal assumptions to get it?
Evolution makes the causal assumption that organic replicators are viable (which fails in many places, e.g. when environments do not provide water, negentropy, a broad mix of chemicals, and which depends on a variety of causal properties, like stability), as well as various other causal assumptions. Further, we have "epistemic luck" in being humans, the species of animal that is probably the most adapted for generality, making us need advanced rather than just simplistic causal understanding.
For that matter, how do we get variables out of undifferentiated sense data?)
There are numerous techniques for this, based on e.g. symmetries, conserved properties, covariances, etc.. These techniques can generally be given causal justification.
There are numerous techniques for this, based on e.g. symmetries, conserved properties, covariances, etc.. These techniques can generally be given causal justification.
I'd be curious to hear more about this, if you have some pointers
Sure!
I wrote "etc.", but really the main ones I can think of are probably the ones I listed there. Let's start with correlations since this is the really old-school one.
The basic causal principle behind a correlation/covariance-based method is that if you see a correlation where the same thing appears in different places, then that correlation is due to there being a shared cause. This is in particular useful for representation learning, because the shared cause is likely not just an artifact of your perception ("this pixel is darker or lighter") but instead a feature of the world itself ("the scene depicted in this image has properties XYZ"). This then leads to the insight of Factor Analysis[1]: it's easy to set up a linear generative model with a fixed number of independent latent variables to model your data.
Factor Analysis still gets used a lot in various fields like psychology, but for machine perception it's too bad because perception requires nonlinearity. (Eigenfaces are a relic of the past.) However, the core concept that correlations imply latent variables, and that these latent variables are likely more meaningful features of reality contains to be relevant in many models:
Covariance-based techniques aren't the only game in town, though; a major alternate one is symmetries. Often, we know that some symmetries hold; for instance reality is symmetric under translations, but images have to pick some particular translation to capture only a specific part of reality. However, small perturbations in the translation of an image do not matter much for its contents, as the contents tends to be extended geometrically over a large part of the image. Therefore, if you do random cropping of your images, you still get basically the same images. Similarly, various other data augmentation methods can be viewed as being about symmetries. In particular there are a number of symmetry-oriented feature learners:
Finally, there is the notion of conserved properties. John Wentworth has persuasively argued that variables that act at a distance must have deterministically conserved mediators. I am only aware of one architecture that uses this insight, though, namely Noether Networks. Basically, Noether Networks are trained to identify features where your predictions can be improved by assuming the features to stay constant. They do not directly use these features for much, but they seem potentially promising for the future.
This set is probably not exhaustive. There's probably methods that are hyper-specialized to specific domains, and there's probably also other general methods. But I think it's a good general taste of what's available?
Factor Analysis might not be a familiar algorithm to you, but it is computationally basically equivalent to Principal Component Analysis, which you are almost certainly familiar with. There's only a few caveats for this, and they don't really matter for the purposes of this post.
You state in the first comment that they can be given causal justification. As far as I understand you argue with covariances above. Can you elaborate on what this causal justification is?
In a causal universe, if you observe things in different places that correlate with each other, they must have a common cause. That's the principle VAEs/triplet losses/etc. can be understood as exploiting.
Right, but Reichenbach's principle of common cause doesn't tell you anything about how they are causally related? They could just be some nodes in a really large complicated causal graph. So I agree that we can assume causality somehow but we are much more interested in how the graph looks like, right?
So I agree that we can assume causality somehow but we are much more interested in how the graph looks like, right?
Not necessarily? Reality is really really big. It would be computationally infeasible to work with raw reality. Rather, you want abstractions that cover aggregate causality in a computationally practical way, throwing away most of the causal details. See also this:
TL;DR: transformative AI(TAI) plausibly requires causal models of the world. Thus, a component of AI safety is ensuring secure paths to generating these causal models. We think the lens of causal models might be undervalued within the current alignment research landscape and suggest possible research directions.
This post was written by Marius Hobbhahn and David Seiler. MH would like Richard Ngo for encouragement and feedback.
If you think these are interesting questions and want to work on them, write us. We will probably start to play around with GPT-3 soonish. If you want to join the project, just reach out. There is certainly stuff we missed. Feel free to send us references if you think they are relevant.
There are already a small number of people working on causality within the EA community. They include Victor Veitch, Zhijing Jin and PabloAMC. Check them out for further insights. There are also other alignment researchers working on causal influence diagrams (authors: Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane Legg) whose work is very much related.
Causality - a working definition:
Just to get this out of the way: we follow a broad definition of causality, i.e. we assume it can be learned from (some) data and doesn’t have to be put into the model by humans. Furthermore, we don’t think the representation has to be explicit, e.g. in a probabilistic model, but could be represented in other ways, e.g. in the weights of neural networks.
But what is it? In a loose sense, you already know: things make other things happen. When you touch a light switch and a light comes on, that’s causality. There is a more technical sense in which no one understands causality, not even Judea Pearl (where does causal information ultimately come from if you have to make causal assumptions to get it? For that matter, how do we get variables out of undifferentiated sense data?). But it's possible to get useful results without understanding causality precisely, and for our purposes, it's enough to approach the question at the level of causal models.
Concretely: you can draw circles around phenomena in the world (like "a switch" and "a lightbulb") to make them into nodes in a graph, and draw arrows between those nodes to represent their causal relationships (from the switch to the lightbulb if you think the switch causes the lightbulb to turn on, or from the lightbulb to the switch if you think it's the other way around).
There’s an old Sequences post that covers the background in more detail. The key points for practical purposes are that causal models:
Why does causality matter?
Causal, compared to correlational, information has two main advantages. For the following section, I got help from a fellow Ph.D. student.
1. Data efficiency
Markov factorization: Mathematically speaking, Markov factorization ensures conditional independence between some nodes given other nodes. In practice, this means that we can write a joint probability distribution as a sparse graph where only some nodes are connected if we assume causality. It introduces sparsity.
“Namely, if we have a joint with n binary random variables, it would have 2^n - 1 independent parameters (the last one is determined to make the sum equal to 1). If we have k factors with n/k variables each, then we would have k(2^(n/k) - 1) independent parameters. For n=20 and k=4, the numbers are 1048576 vs. 124.” - Patrik Reizinger
Independent Mechanisms: the independent mechanisms principle ensures that factors do not influence each other. Therefore, if we observe shifts in our data distribution, we only need to retrain a few parts of the model. If we observe global warming, for example, the vast majority of physics stays the same. We only need to recalibrate some parts of our model that relate to temperature and climate. Another example is the lightbulb blackout scenario from above. If you know there is a blackout, you don't need to flip the switch to know that the light won't turn on.
The conclusion of these two statements is that correlational models assume a lot more relations between variables than causal models and the entire model needs to be retrained every time the data changes. In causal models, however, we usually only need to retrain a small number of mechanisms. Therefore, causal models are much more sample efficient than correlational ones.
2. Action guiding
Causal models introduce a very strong assumption on the model. Namely, variables are not just related, they are related in a directed way. Thus, causal models imply a testable hypothesis. If our causal model is that taking a specific drug reduces the severity of a disease, then we can test this with an RCT. So our model, drug -> disease, is a falsifiable hypothesis.
The same thing is not possible for correlational models. If we say the intake of drugs correlates with the severity of the disease we say that either the drug helps with the disease, people who have less severe diseases take more drugs or both depend on a third variable. As soon as we intervene by fixing one variable and observing the other, we have already made a causal assumption.
Correlational knowledge can still be used for actions--you can still take the drug and hope the causal arrow goes in the right direction. But it could also have a different effect than desired since you don’t know which variable is the cause and which one is the effect.
Causal models greatly improve the ability of models to make decisions and interact with their environment. Therefore we think it is highly plausible that transformative AI will have some causal model of the world. Due to the rise of data-driven learning, we expect this model to be learned from data, but we could also imagine some human interference or inductive biases.
Overall, we think that the thesis that causality matters for TAI is not very controversial but we think there are a lot of implications for AI safety that are not yet fully explored.
Questions & Implications for AI safety:
If the causal models in ML algorithms have a large effect on their actions/predictions, we should really understand how they work. Some considerations include:
If our ML model has learned a slightly wrong causal model of the world, it will make incorrect predictions on data points outside of the training distribution. Therefore it seems relevant to understand which kind of model the algorithm is acting on. This is a subcategory of alignment and interpretability.
If we could say, for example, with higher certainty whether LLMs create internal causal (vs. correlational) models of the world, they might be easier to control and we could get higher certainty about their predictions.
We are scared of ML algorithms increasingly interacting with the real world because if the interventions go wrong they can do a lot of harm. GPT-3 recently got hooked up to google and we expect someone to be mad enough to give it even more access to interventions on the internet. If there was a non-interventional way to get similar results, we would certainly prefer that.
Having a better understanding of this difference in training efficiency might give us more insight into the quality of the world model of current algorithms.
Our worry is that ML researchers, once they figure out how, will introduce a similar “overidentifying causality” inductive bias into models. This would mean that very powerful models with potentially big impacts have the causal model of a political pundit rather than a scientist.
Furthermore, since language models are trained on text that is generated by humans, they might just learn this bias on their own. Then, GPT-n would be as useless as the average political analysis.
What now?
We ask a lot of questions but don’t have many answers. Thus, we think the highest priority is to get a clearer picture, e.g. refine the questions, translate them into testable hypotheses and read more work from other scientists working on causality.
We think that reasonable first steps could be:
If you think these are interesting questions and want to work on them, reach out. We will probably start to play around with GPT-3 soon. There is certainly research we missed. Feel free to send us references if you think they are relevant.
Causality is not everything
We don’t want this to be another piece along the lines of “AI truly needs X to be intelligent” where X might be something vague like understanding/creativity/etc. We have the hunch that causality might play a role in transformative AI and feel like it is currently underrepresented in the AI safety landscape. Not more, not less.
Furthermore, we don’t need a causal model of everything. Correlations are often sufficient. For example, if you hear an alarm, you don’t need to know exactly what caused the alarm to be cautious. But knowing whether the alarm was caused by fire or by an earthquake will determine what the optimal course of action is.
So we don’t think humans need to have a causal model of everything and neither do AIs but at least for safety-relevant applications, we should look into it deeper.
Conclusion
Causality might be one interesting angle for AI safety but certainly not the only one. However, there are a ton of people in classic ML who think that causality is the missing piece to AGI. They could be completely wrong but we think it’s at least worth exploring from an AI safety lens.
In this post, we outlined why causality might be relevant for TAI, which kind of questions might be relevant and how we could start answering them.
Appendix:
Is there a clear distinction between causality and correlation?
Some people will see our definition as naive and undercomplex. Maybe there is no such thing as causality and it’s all just different shades of correlation. Maybe all causal models are wrong and humans see something that isn’t. Maybe, maybe, maybe.
Similar to how there is no hard evidence for consciousness and philosophical zombies that act just as if they were conscious but truly aren't could exist, all causal claims could also be explained with a lot of correlations and luck. But as argued, e.g. by Eliezer, Occam's razor would make the existence of some sort of consciousness much more likely than its absence and by the same logic causality more likely than its absence.