AI21 has trained a new language model, Jurassic-1, whose largest version has 178 billion parameters (GPT-3 had 175 billion). This paper gives limited technical details.
There already were several models that used far more parameters than GPT-3, but they were either mixture of expert models or only word embeddings. They required much less compute to train/use, but were less powerful than a dense transformer like GPT-3 or the new Jurassic-1.
The interesting thing about Jurassic-1 is that it really doesn’t go much beyond GPT-3. It has a larger vocabulary and slightly optimized architecture. Jurassic-1 only has a bit more parameters than GPT-3, whereas prior trends indicated that any GPT-3 successor would use at least an order of magnitude more parameters. Since GPT-3, much work has gone towards improving transformer architecture (e.g., linear time self attention and neural architecture search), but little of that is visible in Jurassic-1. Maybe companies don’t think it’s economically viable to scale beyond GPT-3 or run many experiments with different architectures at that scale?
Also, Jurassic-1 is a unidirectional model, like GPT-3 (meaning it's forced to process text from left-to-right). This means GPT-3 can only process a given word using the context provided by the previous words. This causes unidirectional models problems for most tasks other than text generation. For example, other than GPT-3, all the top models in the SuperGLUE benchmark leaderboard are bidirectional models. It's interesting AI21 chose to compete with OpenAI using a model that provides the same class of service (text generation) as GPT-3, rather than specialize in, e.g., text classification, where a bidirectional model would be better.
My impression was that tokenization was a critical limitation for GPT-3 in the opposite direction, i.e. it caused GPT-3's performance to suffer on tasks where character-level information is important (including multi-digit addition, and also things like rhyming, acronyms, etc), because the tokenization process clumps characters together by default and obscures that information. Having more (and longer) tokens does not seem like it would remedy that issue; if anything, it may exacerbate it.