What's the sum total of everything we know about language models? At the object level, probably way too much for any one person (not named Gwern) to understand. 

However, it might be possible to abstract most of our knowledge into pithily-worded frames (i.e. intuitions, ideas, theories) that are much more tractable to grok. And once we have all this information neatly written down in one place, unexpected connections may start to pop up. 

This post contains a collection of frames about models that are (i) empirically justified and (ii) seem to tell us something useful. (They are highly filtered by my experience and taste.) In each case I've distilled the key idea down to 1-2 sentences and provided a link to the original source. I've also included open questions for which I am not aware of conclusive evidence. 

I'm hoping that by doing this, I'll make some sort of progress towards "prosaic interpretability" (final name pending). In the event that I don't, having an encyclopedia like this seems useful regardless. 

I'll broadly split the frames into representational and functional frames. Representational frames look 'inside' the model, at its subcomponents, in order to make claims about what the model is doing. Functional frames look 'outside' the model, at its relationships with other entities (e.g. data distribution, learning objectives etc) in order to make claims about the model. 

---

This is intended to be a living document; I will update this in the future as I gather more frames. I strongly welcome all suggestions that could expand the list here! 

Representational Frames

---

  • (TODO think of some open questions which would directly indicate good frames) 

Functional Frames

Frames

---

  • Do language models 'do better' when using their own reasoning traces, as opposed to the reasoning traces of other models? I explore this question more here 

Changelog

2 Jan: Initial post

New Comment