Epistemic status: Strongly arguing for what I feel is a neglected approach. May somewhat overstate the case and fail to adequately steelman counter arguments. I hope and expect that readers will point out flaws in my logic.
Introduction
Currently, large transformers with dense floating point weights are state of the art in language modeling. Despite recent progress by Anthropic and others, they remain difficult to understand.
Why are they hard to understand?
- Lack of native ordering structure: setting aside attention masking, transformers have no native concept of token ordering. They likewise have no well defined state between tokens. Transformers are not truly sequence models, even with positional encoding, but text is sequential.
- Large magnitude, continuous weights: although
... (read 859 more words →)