It's so frustrating to me that "model-utility" learning doesn't have a guarantee. It's like, you make an AI that has a good model of the world, you point (via extensional definition) at some things in the world and say "do things like that!" ... And then the AI can learn the category "things that cause the human to include them in the extensional definition," and create stimuli that would hack your brain if you were alive to see them.
It might need a better understanding of reference, and it might need breakthrougs in human-like concepts and matching the training distribution. But maybe it's still near the right track?
I can definitely tap into the "This should work!" intuition, which says that there should be a way to avoid the problem without significantly changing the feedback loop -- if only we could articulate to the system the mistake it is making. Yet, it seems like to address these sorts of failures you have to change the feedback loop.
What does it mean for an AI who knows a lot more about what the world is to do what a human wants?
Utility functions are likely the wrong concept (Stuart Armstrong has given a lot of reasons to think this). My suspicion is that the better concept is "what a human would want you to do in a situation"; IE, you try and extract a policy rather than a utility. That's a little like approval-direction in flavor. A big problem: like my "human hypothesis evaluation" above, it would require the AI to construct human-understandable explanations of its potential cognitive states to the human. ("What action do I take if I'm thinking all these things?")
What other concepts do we need to refactor? Maybe knowledge?
Cross-posted.
In Stable Pointers to Value, I discussed various ways in which we can try to “robustly point at what we want” (ie, do value learning). I can tidy up the discussion there into three categories:
I want to point at an analogy to three categories of approach to the problem of generalizable environmental goals (as defined in the alignment for advanced machine learning agenda). It’s a fairly messy analogy, and there’s probably a better way of organizing the landscape, but FWIW.
1. Supervised Learning
Imagine you’re trying to teach a system to build bridges by showing it examples. You could learn a big neural network which distinguishes cases of “successfully building a bridge” from everything else, and then use this to drive the system.
If the agent is an RL or OU agent, it is incentivised to “fool itself” by doing things like playing a video of bridge-building in front of its camera. You can try and train the classifier to notice this sort of thing, of course; you give it negative training examples in which someone puts a TV set in front of it and things thereafter appear as they do in one of the positive examples. However, you can’t figure out all the different negative training examples you need to give it ahead of time – especially if the rest of the system will continue to learn later on as the classifier remains fixed.
To me, this feels closely analogous to trying to prevent RL systems from wireheading themselves by giving them strongly negative reward for trying to mess with their reward circuits. You don’t know ahead of time what all the things you need to punish are, but you would need to, since the system keeps getting smarter as the reward circuit remains the same. (Or, if humans are managing the reward button, they need to be able to recognize any attempts to mess with the hardware or take over control of the reward button or manipulate the humans.)
2. Model-Utility Learning
One way you might try to solve this: the AI is learning a model of the world in an unsupervised way, only trying to predict well, not thinking at all about its goals. Separately, the AI is learning a classifier representing the goals. This classifier takes the model state, rather than the observations.
So, returning to the bridge-building example, the system is shown lots of examples of building bridges and not building bridges. It infers a physical model of what’s going on in those examples, plus a predicate on the physical situations which tells it whether the state of affairs corresponds to proper bridge-building.
As before, we can show it many negative training examples involving different methods of attempting to fool itself.
Now, we might reasonably expect that if the AI considers a novel way of “fooling itself” which hasn’t been given in a training example, it will reject such things for the right reasons: the plan does not involve physically building a bridge.
This can also deal with the problem of ontological crisis, even without new classifier data. As the physical model changes in response to new data, the classifier is simply re-learned so that it remains accurate on the original training examples.
Unfortunately, this approach has serious problems.
Since humans (or something) must be labeling the original training examples, the hypothesis that building bridges means “what humans label as building bridges” will always be at least as accurate as the intended classifier. I don’t mean “whatever humans would label”. I mean they hypothesis that “build a bridge” means specifically the physical situations which were recorded as training examples for this system in particular, and labeled by humans as such.
This time, there’s no way to patch the problem with negative training examples. You can’t label an example as both positive and negative!
How can we avoid simple-but-wrong hypotheses like this?
3. Human Hypothesis Evaluation
Just as approval-directed agents put more work on the humans in the control loop, we can try and do the same here.
As in model-utility systems, we build a model of the environment through unsupervised learning, and also try to learn the utility in a supervised way.
However, this time the system gets feedback on the quality of hypotheses from humans, and also tries to anticipate such feedback in its model selection. I’m not sure exactly how this should work, but one version is: ask the humans to classify made-up examples. Such examples of bridge-building can be in imaginary worlds where there are no humans evaluating whether bridge-building is going on, so as to differentiate the pathological hypothesis mentioned above from the desired hypothesis.
For this to work, though, we also have to solve the problem of providing human-understandable explanations of the AI’s learned models, which is its own pandora’s box.
Discussion
The overall point I’m trying to make here has similarities to the Reinforcement Learning with a Corrupted Reward Channelpaper, particularly section 4.1: the way the system gets feedback matters a lot. The way humans get put into the loop can be very tricky; seemingly obvious answers lead to pathological behaviors for highly capable systems. Trying to fix this behavior can lead us down a rabbit-hole of trying patch after patch after patch, until a change in perspective like observation-utility learning eliminates the need for all those patches in one fell swoop (and then we find ourselves making entirely new patches on a higher level and about more important things…).