Hi LessWrong! This is my first LessWrong post sharing my first piece of mechanistic interpretability work.

I studied in-context learning in Llama2. The idea was to look at when we associate two concepts in the LLM's context — an object (e.g. "red square"), and a label (e.g. "Bob"), how is that information transmitted through the model?

I found several interesting things. In this toy example, I found that:

  • information about the association is passed by reference not by value — in other words, what is passed is a pointer to "this information is here", and then later that information is loaded
  • the reference position is not a token position but rather information about the semantic location (i.e. "the third item in the list") of the information in question. I suspect later heads actually load a "cloud" of data around the location, and I suspect that this is mediated by punctuation or other markers of structure. (also related to why frontier models are so sensitive to prompt structure)
New Comment
Curated and popular this week