My hot take:
Not too surprising to me, considering what GPT-3 could do. However there were some people (and some small probability mass remaining in myself) saying that even GPT-3 wasn't doing any sort of reasoning, didn't have any sort of substantial understanding of the world, etc. Well, this is another nail in the coffin of that idea, in my opinion. Whatever this architecture is doing on the inside, it seems to be pretty capable and general.
I don't think this architecture will scale to AGI by itself. But the dramatic success of this architecture is evidence that there are other architectures, not too far away in search space, that exhibit similar computational efficiency and scales-with-more-compute properties, that are useful for more different kinds of tasks.
I very definitely noticed Sparse Transformer, but what you're missing is that Sparse Transformers showed good compression performance but was small-scale & primarily about describing the Sparse Transformer/showing it works, and there's nothing about few-shot/transfer learning. There is no guarantee that it is learning particularly useful representations just because it predicts pixel-by-pixel well which may be distributed throughout the GPT, somewhat like the problem in finding the equivalent of Gram matrices in text models (unlike the semi-supervised CNNs where you can expect the embedding or pre-embedding to distill all the knowledge into one place, by design), and you can see in iGPT that getting the representation out is nontrivial - you can easily pick a bad layer to use as the embedding.