What's the sum total of everything we know about language models? At the object level, probably way too much for any one person (not named Gwern) to understand.
However, it might be possible to abstract most of our knowledge into pithily-worded frames (i.e. intuitions, ideas, theories) that are much more tractable to grok. And once we have all this information neatly written down in one place, unexpected connections may start to pop up.
This post contains a collection of frames about models that are (i) empirically justified and (ii) seem to tell us something useful. (They are highly filtered by my experience and taste.) In each case I've distilled the key idea down to 1-2 sentences and provided a link to the original source. I've also included open questions for which I am not aware of conclusive evidence.
I'm hoping that by doing this, I'll make some sort of progress towards "prosaic interpretability" (final name pending). In the event that I don't, having an encyclopedia like this seems useful regardless.
I'll broadly split the frames into representational and functional frames. Representational frames look 'inside' the model, at its subcomponents, in order to make claims about what the model is doing. Functional frames look 'outside' the model, at its relationships with other entities (e.g. data distribution, learning objectives etc) in order to make claims about the model.
---
This is intended to be a living document; I will update this in the future as I gather more frames. I strongly welcome all suggestions that could expand the list here!
A large proportion of neural nets' parameters could be artefacts of the training process that are not actually necessary for solving the task [Insert link to papers on pruning weights]
(Vision) Transformers likely benefit from 'register tokens', i.e. being able to explicitly model global information in addition to local information. Corollary: Maybe language models also need register tokens.
Language models are capable of 'introspection', i.e. can predict things about themselves that more capable models cannot, suggesting they have access to 'privileged information' about themselves.
Language models are capable of 'out-of-context reasoning', i.e. can piece together many different facts they have been trained on in order to make inferences. A.k.a: 'connecting the dots'.
Language models are capable of 'implicit meta-learning', i.e. can identify statistical markers of truth vs falsehood, and update more towards more 'truthful' information.
Language models are capable of 'sandbagging', i.e. strategically underperforming on evaluations in order to avoid detection / oversight.
Transformers are susceptible to jailbreaks because harmful and harmless prompts are easily distinguishable in the first few tokens; data augmentation solves the problem.
(TODO: look at the papers on ICL)
(TODO: look at papers on grokking)
---
Do language models 'do better' when using their own reasoning traces, as opposed to the reasoning traces of other models? I explore this question more here
What's the sum total of everything we know about language models? At the object level, probably way too much for any one person (not named Gwern) to understand.
However, it might be possible to abstract most of our knowledge into pithily-worded frames (i.e. intuitions, ideas, theories) that are much more tractable to grok. And once we have all this information neatly written down in one place, unexpected connections may start to pop up.
This post contains a collection of frames about models that are (i) empirically justified and (ii) seem to tell us something useful. (They are highly filtered by my experience and taste.) In each case I've distilled the key idea down to 1-2 sentences and provided a link to the original source. I've also included open questions for which I am not aware of conclusive evidence.
I'm hoping that by doing this, I'll make some sort of progress towards "prosaic interpretability" (final name pending). In the event that I don't, having an encyclopedia like this seems useful regardless.
I'll broadly split the frames into representational and functional frames. Representational frames look 'inside' the model, at its subcomponents, in order to make claims about what the model is doing. Functional frames look 'outside' the model, at its relationships with other entities (e.g. data distribution, learning objectives etc) in order to make claims about the model.
---
This is intended to be a living document; I will update this in the future as I gather more frames. I strongly welcome all suggestions that could expand the list here!
Representational Frames
---
Functional Frames
Frames
---
Changelog
2 Jan: Initial post