Thanks for the nice tutorial.
I have a problem understanding your code (I am new to Pytorch). When you are calculating the activations of attention:
def forward(self, *args, **kwargs):
output = self.attn(*args, **kwargs)
if self.add_tensor is not None: output = (output[0] + self.add_tensor,)+output[1:]
self.activations = output[0] return output
What is the argument that is passed to the self.attn function?
I tried passing the following but cannot reproduce your code:
Neither of these can reproduce your results. Can you clarify this?
Thanks, Nina, for sharing the forward pass of Hugging face. I now realize I was skipping the input layer norm calculations. Now, I can reproduce your numbers :)