Besides, to your point on FFN I don't think FFN is more important than Attention. Of cause the size of parameter is important, but it is not the only import thing. You need to consider the compute cost in each module. If my understanding is right, FFN costs O(1), Encoder's Self-Attention costs O(n_ctx^2), Docoder's Masked Self-Attention costs...well Decoder is a little bit more complicated situation. It has to run multiple times to get the full output sequence. The total cost depends on input length and output length. For simplicity, each token generated, it costs O(n_lefttokens), approximately O(n). That means not all parameters are equally important. So if you do a compute-cost-weighted sum, the parameters of attention module is way more important than FFN.
This could also explain why there is a limit on n_ctx.
And This could also also give you a hint that though Encoder and Decoder are structurally similar, they are doing different things. And GPT IS a decoder-only architecture.
I love your work.
Besides, to your point on FFN
I don't think FFN is more important than Attention.
Of cause the size of parameter is important, but it is not the only import thing.
You need to consider the compute cost in each module. If my understanding is right, FFN costs O(1), Encoder's Self-Attention costs O(n_ctx^2), Docoder's Masked Self-Attention costs...well Decoder is a little bit more complicated situation. It has to run multiple times to get the full output sequence. The total cost depends on input length and output length. For simplicity, each token generated, it costs O(n_lefttokens), approximately O(n). That means not all parameters are equally important.
So if you do a compute-cost-weighted sum, the parameters of attention module is way more important than FFN.
This could also explain why there is a limit on n_ctx.
And This could also also give you a hint that though Encoder and Decoder are structurally similar, they are doing different things. And GPT IS a decoder-only architecture.