At Techniques for optimizing worst-case performance Paul Christiano says
The key point is that a malign failure requires leveraging the intelligence of the model to do something actively bad. If our model is trained by gradient descent, its behavior can only be intelligent when it is exercised on the training distribution — if part of the model never (or very rarely) does anything on the training distribution, then that part of the model can’t be intelligent. So in some sense a malign failure mode needs to use a code path that gets run on the training distribution, just under different conditions that cause it to behave badly.
Here is how I would rephrase it:
Aligned or Benign Conjecture: Let A be a machine learning agent you are training with an aligned loss function. If A is in a situation that is too far out of distribution for it to be aligned, it won't act intelligently either.
(Although I'm calling this a "conjecture", it's probably context dependent instead of being a single mathematical statement.)
This seems pretty plausible, but I'm not sure it's guaranteed mathematically 🤔. (For example: A neural network could have subcomponents that are great at specific tasks, and such that putting A in an out-of-distribution situation does not put those subcomponents out of distribution.)
I'm wondering if there is an empirical evidence or theoretical arguments against this conjecture.
As an example, can we make a ML agent, trained with stochastic descent, that abandons it's utility function out-of-distribution, but still has the same capabilities in some sense? For example, if the agent is fighting in an army, could an out-of-distribution environment cause it to defect to a different army, but still retain its fighting skills?
Note we have a simple and succinct solution to this problem. It's already what we see used in BingChat.
If out of the training distribution -> controlled shutdown.
Do not accept output actions from an agent that is not within the latent space of the training distribution. We can measure how well the current input state fits within that latent space with autoencoders by trying to compress the input space. If the uncompressible component > threshold, shutdown.
A simple worked example: you have an AI agent driving an Atlas robot that is working inside a factory. The agent drops a part and in the process of trying to retrieve the dropped part (a task it has practiced in simulation) the robot opens the emergency exit door.
The bright daylight and open sky and background of buildings is all outside the training distribution. It can't be compressed to the representation of "factory states" without a huge residual. So the internal robotic control systems transfer control to a lower level controller that brings the hardware to a controlled stop.
This actually fixes a lot of alignment issues...
We have another tool in the toolbox to potentially add to this.
Each time a frame comes in to the AI system, it is checking if that frame is within the latent space of the training distribution, yes. (the frame is the state of all sensor inputs + the values saved from the outputs from the last execution)
Each output can enumerate what the machine believes will be the next n frames, factoring in changes from it's own actions, and it can be more than 1 frame for stochastic processes.
For example, if a machine is observing a coin flip, it would need to out... (read more)