How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?
The examples found "in the wild" (cultural transmission, InstructGPT) involved no coaxing at all. Details for the other examples (going off of memory, probably some of this will be wrong, but it should be right in broad strokes):
seems like this means the takeaway has to be that in weird circumstances, you can misgeneralize in ways that maintain surprisingly large amounts of competence, but this isn't the default in most situations. the problem is, those misgeneralizations might be surprisingly bad if the competence is strong enough. these are specifically situations where empowerment is reliable but purpose is confusing, yeah? and it seems like language models would be an exception to that because their empowerment and purpose are deeply tied.
seems like this means the takeaway has to be that in weird circumstances, you can misgeneralize in ways that maintain surprisingly large amounts of competence, but this isn't the default in most situations.
Sure, I endorse that conclusion today, when systems aren't particularly general / competent. I don't endorse that conclusion for the future, when systems will predictably become more general / competent.
(And if you take language models and put them in weird circumstances, they still look competent on some axes, they're just weird enough that we had trouble attributing any simple goal to them.)
I'm not sure I understand what you mean by empowerment and purpose as it relates to language models, can you say it a different way?
empowerment as in ability to control an environment; I just wanted to use a different term of art because it felt more appropriate, despite not being evaluated directly, empowerment is the question we care about out of capability, is it not?
and by purpose I simply meant goal.
I understand that part, but I'm not seeing what you mean by empowerment being reliable but purpose being confusing, and why language models are an exception to that.
The generative modeling objective applied to human datasets only makes behavior that causes empowerment because doing so correlates with behavior that causes accuracy; a reinforcement learning objective applied to the same dataset will still learn the convergent empowerment capability well, but the reward signal is relatively sparse, the model will fit whatever happens to be going on at the time.
in general it seems like the thing all of the example situations have in common is much less dense feedback from anything approaching a true objective.
situations where it's obvious how to assemble steps to get things, but confusing which results of the different combinations are the ones you really want, are ones where feedback is hard to be sure you have pushed into the correct dimensions. or something.
These examples seem like capabilities failures rather than alignment failures. Reading them doesn’t make me feel any more convinced that there will be rebellious AI, accidental paperclip maximizers, deceptive alignment, etc.
In the first example, the environment the AI is in suddenly changes, and the AI is not given the capability to learn and adapt to this change. So of course it fails.
In the second example, the AI is given the ability to continuously learn and adapt, and in this case, it actually succeeds at the intended goal. It almost depopulates the trees, because it’s a relatively simple reinforcement learner & it has to screw up once to learn from its mistakes, whereas a more sophisticated intelligence might have more foresight. Still, only messing up once is pretty impressive.
The third example is an LLM, about which it’s sort of awkward to apply the concept of having “goals”. LLMs are capable of astonishing examples of intelligence sometimes but also frequently are very “stupid” when statistical next-token-prediction just leads to faulty pattern-matching. This failure is one such example.
In our latest paper and accompanying blog post, we provide several new examples of goal misgeneralization in a variety of learning systems. The rest of this post picks out a few upshots that we think would be of interest to this community. It assumes that you’ve already read the linked blog post (but not necessarily the paper).
Goal misgeneralization is not limited to RL
The core feature of goal misgeneralization is that after learning, the system pursues a goal that was correlated with the intended goal in the training situations, but comes apart in some test situations. This does not require you to use RL – it can happen with any learning system. The Evaluating Expressions example, where Gopher asks redundant questions, is an example of goal misgeneralization in the few-shot learning regime for large language models.
The train/test distinction is not crucial
Sometimes people wonder whether goal misgeneralization depends on the train/test distinction, and whether it would no longer be a problem if we were in a continual learning setting. As Evan notes, continual learning doesn’t make much of a difference: whenever your AI system is acting, you can view that as a “test” situation with all the previous experience as the “training” situations. If goal misgeneralization occurs, the AI system might take an action that breaks your continual learning scheme (for example, by creating and running a copy of itself on a different server that isn’t subject to gradient descent).
The Tree Gridworld example showcases this mechanism: an agent trained with continual learning learns to chop trees as fast as possible, driving them extinct, when the optimal policy would be to chop the trees sustainably. (In our example the trees eventually repopulate and the agent recovers, but if we slightly tweak the environment so that once extinct the trees can never come back, then the agent would never be able to recover.)
It can be hard to identify goal misgeneralization
InstructGPT was trained to be helpful, truthful, and harmless, but nevertheless it will answer "harmful" questions in detail. For example, it will advise you on the best ways to rob a grocery store.
An AI system that competently does something that would have gotten low reward? Surely this is an example of goal misgeneralization?
Not so fast! It turns out that during training the labelers were told to prioritize helpfulness over the other two criteria. So maybe that means that actually these sorts of harmful answers would have gotten high reward? Maybe this is just specification gaming?
We asked the authors of the InstructGPT paper, and their guess was that these answers would have had high variance – some labelers would have given them a high score; others would have given them a low score. So now is it or is it not goal misgeneralization?
One answer is to say that it depends on the following counterfactual: “how would the labelers have reacted if the model had politely declined to answer?” If the labelers would have preferred that the model decline to answer, then it would be goal misgeneralization, otherwise it would be specification gaming.
As systems become more complicated we expect that it will become harder to (1) aggregate and analyze the actual labels or rewards given during training, and (2) evaluate the relevant counterfactuals. So we expect that it will become more challenging to categorize a failure as specification gaming or goal misgeneralization.