Phew, I almost missed this post.
This seems plausible if recalling the actual description of the data-cleaning process is more robust to regularization than just learning to avoid duplicates directly. In this toy example I think learning the right strategy directly is favored, but I can see where if a document about the dataset makes a long series of useful (to the LLM) statements, eventually it seems like it should converge to believing in statements it hasn't observed directly yet.
But whether this really happens seems to depend on how good LLMs are at calculating things at runtime, which I'm still personally fuzzy on. Especially if you have to think of calculating things at runtime as effectively being spread out over multiple tokens, which is an important emergent property that still seems tricky to think about because it really is emergent and not directly selected for.
Agreed on the first part. I'm not entirely clear on what you're referring to in the second paragraph though. What calculation has to be spread out over multiple tokens? The matching to previously encountered K-1 sequences? I'd suspect that, in some sense, most LLM calculations have to work across multiple tokens, so not clear on what this has to do with emergence either.
The calculations I mean are something like:
So that's all pretty ambitious. I think spreading the work out over multiple tokens requires those intermediate tokens to have good in-text reasons to contain intermediate results (as in "think step by step").
Agree on points 3,4. Disagree on point 1. Unsure of point 2.
On the final two points, and I think those capabilities are already in place in GPT3.5. Any capability/processing which seems necessary for general instruction following I'd expect to be in place by default. E.g. consider what processing is necessary for GPT3.5 to follow instructions on turning a tweet into a Haiku.
On the first point, we should expect text which occurs repeatedly in the dataset to be compressed while preserving meaning. Text regarding the data-cleaning spec is no exception here.
Thanks for this concise and informative story.
A few questions from my side:
There are two common mental models of how situational awareness emerges
do you have pointers to references here? I'm quite interested myself in situational awareness and would like to read up on the literature of how it emerges.
the presence of duplicates can be taken as a signal distinguishing training from deployment
I'm not sure if I completely got it. So what you are saying is that the described deduplication procedure will leave sequences of length K that differ in only a single token. However, we won't have any duplicates of sequences of length K left during training. Because we expect duplicates of sequences of length K to appear during deployment, the model can use this signal to distinguish training from deployment. Is that right?
PS: you mention \textit{4b} in point 1) under scenario, which I assume refers to your list of assumptions but does not exist. Additionally, maybe use A, B, C or I, II, III for the assumptions to not confuse the referencing of the assumptions with the scenario.
Overview
There are two common mental models of how situational awareness emerges: (A) training for situational awareness via e.g. dialogue RLHF; (B) instrumental reasoning on a base language modeling objective i.e. advanced mesa-optimization. In this post, I will describe a plausible scenario under which type-B awareness occurs without any new forms of mesa-optimization beyond those likely already existing in GPT3.5.
If you buy the plausibility of this scenario, there are a few likely implications:
The deduplication scenario
Assumptions:[3]
Scenario:
Implications and Conclusion
First some observations on the above story: there’s a spectrum between local, semantics-driven retrieval described in step 4a to global, semantics-agnostic retrieval occurring in step 5. The learned rule in 5 is anti-inductive, and it entails an inconsistency/untruthfulness incentive in cases where matching substrings of length K-1 make distinct claims in token K.
The retrieval and instruction-following story of steps 4 and 4a seem identical to retrieval and instruction-following as they occur on any other (non-situationally relevant) tokens. Assuming retrieval and instruction following are simple cases for ELK and interpretability it seems steps 4, 4a, and plausibly also 5, are also easy cases.
One elaboration of steps 1, 4 and 5 could be that the LM initially does semantics-guided retrieval, but then the move to step 5 involves changing LM circuitry such that certain information is pulled up into activation space irrespective of an input sequence’s semantics. The existence of such circuitry might make the learning of further situationally-aware strategies simpler. As a concrete example of the utility of this simple, deduplication kind of situational awareness to other strategies: the presence of duplicates can be taken as a signal distinguishing training from deployment. And, using dataset cleaning information to distinguish train from deployment time is not restricted to this toy setting.
Just to make things explicit, here are a few views I believe are likely to be incorrect. Note that this post is just a story and as such does not disprove any of these views. However, the above story hopefully highlights why these views should be handled with suspicion.
Open questions: The above story relies on some artificial assumptions, and a more realistic version of the story would be useful for informing research in this area. Useful experiments include: Testing how easy it is to learn such a situational strategy with toy datasets instantiating and explaining deduplication (/other data cleaning) procedures. Given such a toy setup with a situationally-aware model, how hard is it to get the model to honestly report when its predictions are/aren’t being affected by situational information?
Acknowledgements: Thanks to Alex Lyzhov, Julian Michael, and Sam Bowman for feedback on a draft of this post. Thanks to the rest of the NYU alignment research group for helpful discussion.
I.e. capable of gradient hacking, monitoring train/test current situation, non-myopic user manipulation etc.
How best to operationalize this claim is unclear. Experiments would be helpful here e.g. train for one kind of situational awareness and then fit a scaling law to transfer to a second kind.
Note that I do not believe these assumptions will ever be precisely met. It is worth thinking about whether there are any similar, but more realistic set of assumptions.
Thank you to Julian Michael for pushing me on this point.