User Comment Replies

A few answers to the open questions you gave:

3. Sparse attention patterns do not affect the number of parameters. This kind of sparsity is designed to make previous KV values easier to load during decoding, not to reduce the space the model takes up.

4. The only difference between encoder and decoder transformers is the attention mask. In an encoder, future tokens can attend to past tokens (acausal), while in a decoder, future tokens cannot attend to past tokens (causal attention). The term "decoder" is used because decoders can be used to generate text, wh... (read more)

LESSWRONG
LW

All of Gavin Uberti's Comments + Replies