My hot take:
Not too surprising to me, considering what GPT-3 could do. However there were some people (and some small probability mass remaining in myself) saying that even GPT-3 wasn't doing any sort of reasoning, didn't have any sort of substantial understanding of the world, etc. Well, this is another nail in the coffin of that idea, in my opinion. Whatever this architecture is doing on the inside, it seems to be pretty capable and general.
I don't think this architecture will scale to AGI by itself. But the dramatic success of this architecture is evidence that there are other architectures, not too far away in search space, that exhibit similar computational efficiency and scales-with-more-compute properties, that are useful for more different kinds of tasks.
Being able to accomplish something is important even if it was predicted to be possible. No one is surprised that generative models do embody a lot of useful knowledge (that's much of the point), but it can be hard to tap into it.
The difference between GPT & iGPT for transfer learning is that GPT can be queried directly via its modality by putting in text: "Translate this into French", "what genre of text is this?", "tldr", etc. On the other hand, if you were querying iGPT by handing it half an image and expecting it to complete it in a useful way, there is absolutely nothing surprising about that being useful, obviously; but I have a hard time thinking of how you could implement classification by image completion! You normally have to get the knowledge out a different way, through an embedding which can be fed into a linear classification layer; if you can't do that, it's unclear what exactly you do. It was unclear how you use Sparse Transformers, PixelRNN, GANs, etc to do any of that. Now it's clearer.
As an analogous example, consider textual style transfer. You can't do it (pre-GPT-3, anyway). Do char-RNNs and Transformers understand the difference between authors and styles and content? Are they capable of textual style transfer? I would be shocked if they weren't. Probably, yes, after all, they can uncannily mimic authors and write plausibly about all sorts of content. But nevertheless, they lack a Gram matrix like CNNs you can easily optimize to do style transfer with. So, no one can do it. Someone finally figuring out how to do it would be big news even if the end output is not surprising.