1 min read

3

This is a special post for quick takes by joseph_c. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
1 comment, sorted by Click to highlight new comments since:

I recently came across Backpack Language Models and wanted to share it in case any AI interpretability people have not seen it. (I have yet to see this posted on LessWrong.)

The main difference between a backpack model and an LLM is that it enforces a much stricter rule to map inputs' embeddings to output logits. Most LLMs allow the output logits to be an arbitrary function of the inputs' embeddings; a backpack model requires the output logits to be a linear transformation of a linear combination of the input embeddings. The weights for this linear combination are parameterized by a transformer.

The nice thing about backpack models is that they are somewhat easier to interpret/edit/control: The output logits are a linear combination of the inputs' embeddings, so you can directly observe how changing the embeddings changes the outputs.