All of Gavin Uberti's Comments + Replies

A few answers to the open questions you gave:

3. Sparse attention patterns do not affect the number of parameters. This kind of sparsity is designed to make previous KV values easier to load during decoding, not to reduce the space the model takes up.

4. The only difference between encoder and decoder transformers is the attention mask. In an encoder, future tokens can attend to past tokens (acausal), while in a decoder, future tokens cannot attend to past tokens (causal attention). The term "decoder" is used because decoders can be used to generate text, wh... (read more)

1bvbvbvbvbvbvbvbvbvbvbv
This was very helpful to me. Thank you.