Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
yihe20

Thank you so much for the post! I'm starting to get a sense of induction heads.

 

Probably an unrelated question - can a single attention head store multiple orthogonal information? For example, in this post,  the layer 0 may store the information "I follow 'D'". Can it also store information like "I am a noun"?

 

Or, to put it another way, should an attention head have a single, dedicated functionality?