All of Maximilian Kaufmann's Comments + Replies

I'd like to use this from your description, but tough to trust / understand without a README!

Answer by Maximilian Kaufmann32

To partially answer your question ( I think the answer to "What is happening inside the LLM when it 'switches' to one task or another?" is pretty much "We don't know"), techniques such as RLHF (which nowadays are applied to pretty much any public-facing model you are likely to interact with) cause the model to act less like something searching for the most likely completion of this sentence on the internet, and more like something which is trying to answer your questions. These models would take "question" interpretation over the "autocomplete" one. 

A... (read more)

How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?

9Rohin Shah
The examples found "in the wild" (cultural transmission, InstructGPT) involved no coaxing at all. Details for the other examples (going off of memory, probably some of this will be wrong, but it should be right in broad strokes): 1. Monster gridworld: We knew from the beginning that the mechanism we wanted was "agent needs to collect shields in training episodes; over longer time horizons it should collect apples but it will continue to collect shields because shields were way more important during training". We had to play around with the setup quite a bit before we got the relatively clean results in the paper. Two canonical examples of issues: 1. The agent learned to run around the gridworld to avoid the monsters instead of picking up shields. We fixed this by making monsters faster than the agent. 2. The agent didn't learn competent path planning (and instead looked like it was moving around somewhat randomly). I don't remember exactly why this was, but it might have been that the apples / shields were too densely packed in the environment and so there wasn't much benefit to competent path planning (in which case we probably solved it by reducing the number of apples / shields or increasing the size of the gridworld). 2.  Tree gridworld: This was originally supposed to be the same sort of environment as Monster gridworld, but with different hyperparameters to showcase the same issue for non-episodic / never-ending / continual learning RL. Our biggest issue here was that we failed to find an RL algorithm that actually worked for this; the agent typically didn't even learn to collect shields. We spent quite a while trying to fix this before we realized we could simplify the environment by removing the shields and still show a similar issue; with this simpler environment the agent finally started to learn. After that there was a bit of tweaking of hyperparameters but I think it worked pretty quickly. 3. Evaluating Linear Expressions: I think for this on