AI21 has trained a new language model, Jurassic-1, whose largest version has 178 billion parameters (GPT-3 had 175 billion). This paper gives limited technical details.
There already were several models that used far more parameters than GPT-3, but they were either mixture of expert models or only word embeddings. They required much less compute to train/use, but were less powerful than a dense transformer like GPT-3 or the new Jurassic-1.
The interesting thing about Jurassic-1 is that it really doesn’t go much beyond GPT-3. It has a larger vocabulary and slightly optimized architecture. Jurassic-1 only has a bit more parameters than GPT-3, whereas prior trends indicated that any GPT-3 successor would use at least an order of magnitude more parameters. Since GPT-3, much work has gone towards improving transformer architecture (e.g., linear time self attention and neural architecture search), but little of that is visible in Jurassic-1. Maybe companies don’t think it’s economically viable to scale beyond GPT-3 or run many experiments with different architectures at that scale?
Also, Jurassic-1 is a unidirectional model, like GPT-3 (meaning it's forced to process text from left-to-right). This means GPT-3 can only process a given word using the context provided by the previous words. This causes unidirectional models problems for most tasks other than text generation. For example, other than GPT-3, all the top models in the SuperGLUE benchmark leaderboard are bidirectional models. It's interesting AI21 chose to compete with OpenAI using a model that provides the same class of service (text generation) as GPT-3, rather than specialize in, e.g., text classification, where a bidirectional model would be better.
No, the interesting thing is that it's available as a public API. It took 13 months for an OA API competitor to emerge, but now it's here and the OA API has a real competitor, and someone who will be happy to pick up many of the customers OA has driven away with its increasingly heavy-handed, arbitrary, and last-minute restrictions. (The tokenizer and better width vs depth scaling is trivial by comparison.)
The models came before, but not an API/SaaS. GPT-3 was already matched/exceeded by the dense models HyperClova & PanGu-α, and possibly MUM/LaMDA/Pathways/the Wu Daos*, but none of those are meaningfully publicly accessible, and so came and went. Jurassic-1 is available as an API, and is even free right now. That is very different, in much the same way that GPT-J is being so heavily used by everyone locked out of the OA API because it is available for free. "Free [public] is different."
* details are sparse on all these, including the nature of any sparsity