Topics like this really draw a crowd but if you dont know how it works writing like this just adds energy in the wrong direction. If you start off small building perceptrons by hand, you can work your way up through models to transformers and it'll be clear what the math is attempting to do per word. It's sophisticatedly predicting the next work based on a matrix of relevance to the previous word and the block as a whole. The attention mechanism is the magic of relevance but it is, predicting the next word.
Topics like this really draw a crowd but if you dont know how it works writing like this just adds energy in the wrong direction. If you start off small building perceptrons by hand, you can work your way up through models to transformers and it'll be clear what the math is attempting to do per word. It's sophisticatedly predicting the next work based on a matrix of relevance to the previous word and the block as a whole. The attention mechanism is the magic of relevance but it is, predicting the next word.