Thank you so much for the post! I'm starting to get a sense of induction heads.
Probably an unrelated question - can a single attention head store multiple orthogonal information? For example, in this post, the layer 0 may store the information "I follow 'D'". Can it also store information like "I am a noun"?
Or, to put it another way, should an attention head have a single, dedicated functionality?
Sorry I didn't get to this message earlier, glad you liked the post though! The answer is that attention heads can have multiple different functions - the simplest way is to store things entirely orthogonally so they lie in fully independent subspsaces, but even this isn't necessary because it seems like transformers take advantage of superposition to represent multiple concepts at once, more so than they have dimensions.
Thank you so much for the post! I'm starting to get a sense of induction heads.
Probably an unrelated question - can a single attention head store multiple orthogonal information? For example, in this post, the layer 0 may store the information "I follow 'D'". Can it also store information like "I am a noun"?
Or, to put it another way, should an attention head have a single, dedicated functionality?