Besides, to your point on FFN I don't think FFN is more important than Attention. Of cause the size of parameter is important, but it is not the only import thing. You need to consider the compute cost in each module. If my understanding is right, FFN costs O(1), Encoder's Self-Attention costs O(n_ctx^2), Docoder's Masked Self-Attention costs...well Decoder is a little bit more complicated situation. It has to run multiple times to get the full output sequence. The total cost depends on input length and output length. For s... (read more)
I love your work.
Besides, to your point on FFN
I don't think FFN is more important than Attention.
Of cause the size of parameter is important, but it is not the only import thing.
You need to consider the compute cost in each module. If my understanding is right, FFN costs O(1), Encoder's Self-Attention costs O(n_ctx^2), Docoder's Masked Self-Attention costs...well Decoder is a little bit more complicated situation. It has to run multiple times to get the full output sequence. The total cost depends on input length and output length. For s... (read more)