AI21 has trained a new language model, Jurassic-1, whose largest version has 178 billion parameters (GPT-3 had 175 billion). This paper gives limited technical details.
There already were several models that used far more parameters than GPT-3, but they were either mixture of expert models or only word embeddings. They required much less compute to train/use, but were less powerful than a dense transformer like GPT-3 or the new Jurassic-1.
The interesting thing about Jurassic-1 is that it really doesn’t go much beyond GPT-3. It has a larger vocabulary and slightly optimized architecture. Jurassic-1 only has a bit more parameters than GPT-3, whereas prior trends indicated that any GPT-3 successor would use at least an order of magnitude more parameters. Since GPT-3, much work has gone towards improving transformer architecture (e.g., linear time self attention and neural architecture search), but little of that is visible in Jurassic-1. Maybe companies don’t think it’s economically viable to scale beyond GPT-3 or run many experiments with different architectures at that scale?
Also, Jurassic-1 is a unidirectional model, like GPT-3 (meaning it's forced to process text from left-to-right). This means GPT-3 can only process a given word using the context provided by the previous words. This causes unidirectional models problems for most tasks other than text generation. For example, other than GPT-3, all the top models in the SuperGLUE benchmark leaderboard are bidirectional models. It's interesting AI21 chose to compete with OpenAI using a model that provides the same class of service (text generation) as GPT-3, rather than specialize in, e.g., text classification, where a bidirectional model would be better.
One way to think of it: the least-lossy and biased tokenization, the one most faithful to all input representations, allowing the most accurate modeling possible, which allows the best possible splitting, would have exactly 2 tokens - '0', and '1'.
All tokenizations beyond that are implicitly pre-processing the data before the NN sees it, and are making a choice on the bias-variance tradeoff to inject some bias (hiding the raw data) to reduce the variance (by condensing texts into shorter token sequences and doing some 'thinking' in advance).