Most systems eventually face scaling bottlenecks. In fact, unless your system is completely free of coordination, it definitely has bottlenecks even if you haven't scaled large enough to hit them. And since Transformers definitely require some coordination since no matter how large the models are and how much parallelism their hardware supports they still produce a single reduced output, we should expect that there are some scaling limits on Transformers that at some size will prevent them for effectively taking advantage of having a larger network.
Further, you point at this a bit, but most systems also experiencing diminishing returns on performance for additional resources because of these constraints.
Transformers may just be special in that they have yet to start hitting diminishing returns because we haven't yet run up against their coordination bottlenecks, although that doesn't make them too special since we should expect them to still have them lying in wait somewhere, just like they do in every other system that is not coordination free.
Part of the point of GPT3 is that bigger continues to be better. (Computerphile discussion.) A recent question asked whether this would turn out to be true for other architectures as well. But the question seemed to take for granted that we haven't seen this phenomenon in other cases yet. To what extent is this scaling phenomenon special to GPT? To what extent is it special to Transformer networks? To what extent is it special to unsupervised NLP?
My impression: