At Techniques for optimizing worst-case performance Paul Christiano says
The key point is that a malign failure requires leveraging the intelligence of the model to do something actively bad. If our model is trained by gradient descent, its behavior can only be intelligent when it is exercised on the training distribution — if part of the model never (or very rarely) does anything on the training distribution, then that part of the model can’t be intelligent. So in some sense a malign failure mode needs to use a code path that gets run on the training distribution, just under different conditions that cause it to behave badly.
Here is how I would rephrase it:
Aligned or Benign Conjecture: Let A be a machine learning agent you are training with an aligned loss function. If A is in a situation that is too far out of distribution for it to be aligned, it won't act intelligently either.
(Although I'm calling this a "conjecture", it's probably context dependent instead of being a single mathematical statement.)
This seems pretty plausible, but I'm not sure it's guaranteed mathematically 🤔. (For example: A neural network could have subcomponents that are great at specific tasks, and such that putting A in an out-of-distribution situation does not put those subcomponents out of distribution.)
I'm wondering if there is an empirical evidence or theoretical arguments against this conjecture.
As an example, can we make a ML agent, trained with stochastic descent, that abandons it's utility function out-of-distribution, but still has the same capabilities in some sense? For example, if the agent is fighting in an army, could an out-of-distribution environment cause it to defect to a different army, but still retain its fighting skills?
Well, in order to be confident about a solution like this, we need to be able to reliably detect off-distribution cases. This gets into tricky philosophical issues about what really counts as "off-distribution" (and I think you'll find that the concept of "off-distribution" turns out to be not quite the right one for the job).
Measuring distance from the latent space of auto-encoders - well, the auto-encoders learn some model of the training distribution. But the whole concern is that their capabilities may generalize off-distribution; so, to take this concern seriously, I suppose we should entertain the possibility that they compress some things well even off-distribution. So measuring distance away from latent space by trying to compress the input seems to define away some important aspects of the problem.
For your example, the bright daylight might be non-compressible with the factory robot's learned prior. However, in another scenario, perhaps a new machine is introduced to the factory floor. The new machine has some AI features of its own. Because the Atlas AI was never trained with other AIs, this is out-of-distribution, and could result in goal misgeneralization. However, the appearance and behavior of the new machine might be compressible enough to still register as "within the latent space".
The question here is whether the actual training examples are enough to fully train the correct behaviors for what's compressible.
It's tempting to try to argue that if the latent space is small enough (like, the representation uses few enough bits), then the training data must cover it well. But notice that this isn't actually enough. We also have to think that the compressed representation understands the situation in the way we do / the way we expect. Adversarial counterexamples for NNs tell us that, to some extent, this is not the case. So it seems possible that even if the data covers the latent space quite well, there's still misgeneralization later, because something compresses down to the latent space just fine, but it compresses to a different point than we would have wanted it to.