Comment Permalink

Answer by mishkaMar 21, 202520

yeah... not trying for a complete analysis here, but one thing which is missing is the all-important residual stream. It has been rather downplayed in the original "Attention is all you need" paper, and has been greatly emphasized in https://transformer-circuits.pub/2021/framework/index.html

but I have to admit that I've only started to feel that I more-or-less understand principal aspects of Transformer architecture after I've spent some quality time with the pedagogical implementation of GPT-2 by Andrej Karpathy, https://github.com/karpathy/minGPT, specifically with the https://github.com/karpathy/minGPT/blob/master/mingpt/model.py file. When I don't understand something in a text, looking at a nice relatively simple-minded implementation allows me to see what exactly is going on

(People have also published some visualizations, some "illustrated Transformers", and those are closer to the style of your sketches, but I don't know which of them are good and which might be misleading. And, yes, at the end of the day, it takes time to get used to Transformers, one understands them gradually.)

Reply

3Kallistos6d

Many thanks! I have residuals going from before to after the MHA block, and from before to after MLP+FFNN. Should I have others in other places? edit: Those links are super!

2mishka6d

Ah, it's mostly your first figure which is counter-intuitive (when one looks at it, one gets the intuition of f(g(h... (x))), so it de-emphasizes the fact that each of these Transformer Block transformations is shaped like x=x+function(x))

Kallistos6d10

Oh, yes you're right...hmmm. Unsure what to do about that, I'd love a neat graphical solution. Wasn't quite sure how to represent repeating units without just duplicating subdiagrams!

Reply

See in context

3

[ Question ]

Any mistakes in my understanding of Transformers?

by Kallistos

21st Mar 2025

2 min read

A

3 7

3

I think there's this problem that occurs often when explaining something technical. Let's say you wanted to explain how a computer works. You could pick a specific computer, e.g. a Macbook Pro Retina. This would hang together nicely as an item, you could explain, in principle, why each specific bit of it was there and was designed the way it was. Of course, this would be insanely complicated for a beginner. They'd come away without understanding (in part because they can't triangulate from other computer architectures to what you've taught them about this one), nor intuition (because they can't leverage what you've taught them about this one to others).

You could instead come up with a much simplified toy model. It has all the bits computers standardly have (maybe even bits which are mutually exclusive in actual computers, but which are both important to know about). Here you couldn't even in principle explain why each given bit is there. For some components, there would truly be no reason to have both. Still, this model is much easier to understand, and can actually develop useful intuitions and explain key concepts.

I feel a similar difficulty in trying to understand how transformers work. I could pick a random recent ML paper and try and understand that architecture, or I could read about the toy models that people use to explain them online. What I would love is if there was something that was simple-ish, while also being a kind of average example of a transformer. Only then would I feel I'd actually understood what needs understanding.

What follows is my best understanding of that kind of "average" transformer (at least for text-continuation). Very interested to hear about any innaccuracies or improvements!

Legend:

Data Flow: Dark Gray
Simple Components: Lilac
Encapsulated Components: Purple
Processes: Italics

Diagram 1 represents the whole transformer model, abstracting from the internal details of the transformer block.

Diagram 2 represents a transformer block, abstracting from the internal details of one attention head.

Diagram 3 represents an attention head.

(A note about the kludgy dimension notation I'm using: (x,y) is how I’m representing a matrix with x rows and y columns/y values per row. Sometimes, when I want to consider some number of matrices of dimensions (x,y) I say something like “n x (x,y)” rather than (n,x,y). This is because the operations in question are only being applied to each (x,y) on its own, not all the (x,y)’s collectively.)

Diagram 1: A Transformer

Diagram 2: A Transformer Block

Diagram 3: An Attention Head

World ModelingAI

Frontpage

3

New Answer

New Comment

3 Answers sorted by
top scoring

Raemon

Mar 22, 2025

20

Pedagogic feedback: each diagram is much longer than a page, it's harder to fit the whole thing in my head at once.

mishka

Mar 21, 2025

20

yeah... not trying for a complete analysis here, but one thing which is missing is the all-important residual stream. It has been rather downplayed in the original "Attention is all you need" paper, and has been greatly emphasized in https://transformer-circuits.pub/2021/framework/index.html

but I have to admit that I've only started to feel that I more-or-less understand principal aspects of Transformer architecture after I've spent some quality time with the pedagogical implementation of GPT-2 by Andrej Karpathy, https://github.com/karpathy/minGPT, specifically with the https://github.com/karpathy/minGPT/blob/master/mingpt/model.py file. When I don't understand something in a text, looking at a nice relatively simple-minded implementation allows me to see what exactly is going on

(People have also published some visualizations, some "illustrated Transformers", and those are closer to the style of your sketches, but I don't know which of them are good and which might be misleading. And, yes, at the end of the day, it takes time to get used to Transformers, one understands them gradually.)

[-]Kallistos6d30

Many thanks! I have residuals going from before to after the MHA block, and from before to after MLP+FFNN. Should I have others in other places?

edit: Those links are super!

Reply

2mishka6d

Ah, it's mostly your first figure which is counter-intuitive (when one looks at it, one gets the intuition of f(g(h... (x))), so it de-emphasizes the fact that each of these Transformer Block transformations is shaped like x=x+function(x))

1Kallistos6d

Oh, yes you're right...hmmm. Unsure what to do about that, I'd love a neat graphical solution. Wasn't quite sure how to represent repeating units without just duplicating subdiagrams!

Sergii

Mar 22, 2025

10

There are several ways to explain and diagram transformers, some links that were very helpful for my understanding:

https://blog.nelhage.com/post/transformers-for-software-engineers/
https://dugas.ch/artificial_curiosity/GPT_architecture.html
https://peterbloem.nl/blog/transformers
http://nlp.seas.harvard.edu/annotated-transformer/
https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html
https://github.com/markriedl/transformer-walkthrough?ref=jeremyjordan.me
https://francescopochetti.com/a-visual-deep-dive-into-the-transformers-architecture-turning-karpathys-masterclass-into-pictures/
https://jalammar.github.io/illustrated-transformer/
https://e2eml.school/transformers.html
https://jaykmody.com/blog/attention-intuition/
https://eugeneyan.com/writing/attention/
https://www.jeremyjordan.me/attention/

[-]Kallistos4d10

Many thanks!

Reply

Moderation Log

3

[ Question ]

Any mistakes in my understanding of Transformers?

3

3

3 Answers sorted by top scoring

Mar 22, 2025

Mar 21, 2025

Mar 22, 2025

3 Answers sorted by
top scoring