Maybe you can train a sequence of reward functions: such that each is discouraged from attending to the input features that are most salient to the previous reward functions?
I.e., you'd train normally. Then, while training , you'd use gradient saliency (or similar methods) to find which regions of the input are most salient for and , then penalize for sharing salient features with . Similarly, would be penalized w.r.t. saliency maps from .
Note that for gradient saliency specifically, you can optimize directly for the penalty term with SGD because differentiation is itself a differentiable operation. You can have a loss term like and compute its gradient with respect to model parameters (Some notes on doing this with PyTorch). Note that some gradient saliency methods seem to fail basic sanity checks.
Non-differentiable saliency methods like Shapley values can still serve as an optimization target, but you'll need to use reinforcement learning or other non-gradient optimization approaches. That would probably be very hard.
You can also steer optimization to find 'diverse' models, like Ridge Rider: https://arxiv.org/abs/2011.06505
I'm not sure how necessary that is. If you want diverse good solutions, that sounds a lot like 'sampling from the posterior', and we know thanks to Google burning a huge number of TPU-hours on true HMC-sampling from Bayesian neural networks that 'deep ensembles' (ie training multiple random initializations from scratch on the same dataset) actually provide you a pretty good sample from the posterior. If there are lots of equally decent ways to classify an image expressible in a NN, then the deep ensemble will sample from them (and that is presumably why ensembling improves: because they all are doing something different, instead of weighting the same features the same amount). If that's not adequate, it'd be good to think about what one really wants instead, and how to build that in (maybe one wants to do data augmentation to erase color from one dataset/model and shapes from another, to encourage a ventral-dorsal split or something).
The gSCAN benchmark for compositional generalization might be useful. Essentially a grid world with natural language instructions, where the goal is to compose different concepts seen in training that have different correlations at test time. (E.g. in training, learn blue square and read circle, at test time identify red square - very similar to identifying bleggs and rubes).
Regularized attention is a method that's seen some success in similar compositional setups. This method adds a loss calculated as the distance between actual and predefined "golden" alignments between concepts.
Of course this technique is accomplishing a slightly different goal: rather than attempting to learn a "span" of all possible models, it is trying to learn the correct one.
The value of biasing toward the correct model seems to largely depend on the Natural Abstraction Hypothesis. If Wentworth is right, and there are abstractions that cognitive systems will converge on, then learning a span of possible models seems feasible. However, if the NAH is false, than the space of possible models gets very large, making systematic extrapolation according to human values more difficult. In this case, it might be necessary to constrain a model's abstractions according to human values directly, even at the cost of some capabilities.
Take CoinRun as an example. The approach of the OP is to learn a span of possible reward models, and then presumably learn some extrapolation procedure for selecting the correct model. Alternatively, throughout training we could penalize the agent's saliency maps for assigning high value to "large left-facing values" and reward saliency maps that value the coin. With this regularized value function, the agent would be more likely to pursue the coin if it was placed somewhere else in the level. However, by penalizing left-facing wall saliency, we potentially limit the agent's world model - it may become less aware of left-facing walls, which in turn would lead to a capabilities decrease. See here for a fleshed out version of this proposal (in CoinRun).
Self-supervised world models might solve this problem by explicitly separating the world model from the value function, though I expect we'll need some combination of the two (e.g. EfficientZero, which uses self-supervision and reward to construct its model)
Research projects
I'm planning to start two research projects on model splintering/reward generalisation and learning the preferences of irrational agents.
Within those projects, I'm aiming to work on subprojects that are:
The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.
The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.
Generating multiple rewards and objectives
Thanks to Rohin Shah and Ramana Kumar for suggesting to investigate this problem, and to Rebecca Gorman.
Parent project: this is a subproject of model-splintering (value extrapolation).
Generating multiple rewards
Suppose that an agent is trained on videos of happy humans. The wireheaded reward is for it to create similar videos for it to watch. We'd prefer that it instead worked to make real humans happy.
But for that to be possible, it needs to consider that "make real humans happy" is even a possible reward function. Thus it needs to generate multiple reward functions that can explain the same data.
Working in CoinRun
CoinRun is a procedurally generated set of environments, a simplified Mario-style platform game. The reward is given by reaching the coin on the right:
Since the coin is always at the right of the level, there are two equally valid simple explanations of the reward: the agent must reach the coin, or the agent must reach the right side of the level.
When agents trained on CoinRun are tested on environments that move the coin to another location, they tend to ignore the coin and go straight to the right side of the level. So that one reward is chosen by default. The aim of the research project is to make the algorithm generate multiple (simple) reward functions that explain the initial data, including the "reach the coin" reward. This needs to be done in a generalisable way.
Multiple image classifications
Consider the image classification task of classifying images of huskies versus images of lions:
A naïve image classifier could be trained on images of this type. The simplest one would probably become a brown-versus-white classifier. We'd want to force the algorithm to generate more classifiers (more "reward functions" for the task of correct classification).
One way to do that is to give the algorithm many other images, unlabelled. Then, in a semi-supervised way, the AI will figure out the key features of these images. Then different classifiers will be trained on the original image data, using these features.
The ultimate aim is for the algorithm to produce, eg, one classifier that classifies through colour, another that classifies through landscape, another through husky versus lion, etc...
Research aims
This is a simpler version of the project presented here, generating multiple reward functions from a simpler environment.