I dunno if this has been discussed elsewhere (pointers welcome).

Observational data doesn't allow one to distinguish correlation and causation.
This is a problem for an agent attempting to learn values without being allowed to make interventions.

For example, suppose that happiness is just a linear function of how much Utopamine is in a person's brain.
If a person smiles only when their Utopamine concentration is above 3 ppm, then an value-learner which observes both someone's Utopamine levels and facial expression and tries to predict their reported happiness on the basis of these features will notice that smiling is correlated with higher levels of reported happiness and thus erroneously believe that it is partially responsible for the happiness.

------------------
an IMPLICATION:
I have a picture of value learning where the AI learns via observation (since we don't want to give an unaligned AI access to actuators!).
But this makes it seem important to consider how to make an un unaligned AI safe-enough to perform value-learning relevant interventions.

New Comment
4 comments, sorted by Click to highlight new comments since:

This is only true for simple systems - with more complications you can indeed sometimes deduce causal structure!

Suppose you have three variables: Utopamine conentration, smiling, and reported happiness. And further suppose that there is an independent noise source for each of these variables - causal nodes that we put in as a catch-all for fluctuations and external forcings that are hard to model.

If Utopamine is the root cause of both smiling and reported happiness, then the variation in happiness will be independent of the variation in smiling, conditional on the variation in Utopamine. But conditional on the variation in smiling, the variation in utopamine and reported happiness will still be correlated!

The AI can now narrow down the causal structure to 2, and perhaps it can even figure out the right one if there's some time lag in the response and it assumes that causation goes forward in time.

Observational data doesn't allow one to distinguish correlation and causation.

No? If I observe a hammer striking a nail and the nail sinking into the wooden plank, is anyone going to argue that it's mere correlation and not causation?

Observational data doesn't always allow one one to distinguish correlation and causation.

I am also a bit confused since you're talking about learning values but your example is not about values but about a causal relationship.

Indeed. Pearl's "Causality" talks at length about this sort of things, and what data can and cannot distinguish between causal correlation. There's even a Sequence post about this exact topic.

Is there a reason to think this problem is less amenable to being solved by complexity priors than other learning problems? / Might we build an unaligned agent competent enough to be problematic without solving problems similar to this one?