AI21 has trained a new language model, Jurassic-1, whose largest version has 178 billion parameters (GPT-3 had 175 billion). This paper gives limited technical details.
There already were several models that used far more parameters than GPT-3, but they were either mixture of expert models or only word embeddings. They required much less compute to train/use, but were less powerful than a dense transformer like GPT-3 or the new Jurassic-1.
The interesting thing about Jurassic-1 is that it really doesn’t go much beyond GPT-3. It has a larger vocabulary and slightly optimized architecture. Jurassic-1 only has a bit more parameters than GPT-3, whereas prior trends indicated that any GPT-3 successor would use at least an order of magnitude more parameters. Since GPT-3, much work has gone towards improving transformer architecture (e.g., linear time self attention and neural architecture search), but little of that is visible in Jurassic-1. Maybe companies don’t think it’s economically viable to scale beyond GPT-3 or run many experiments with different architectures at that scale?
Also, Jurassic-1 is a unidirectional model, like GPT-3 (meaning it's forced to process text from left-to-right). This means GPT-3 can only process a given word using the context provided by the previous words. This causes unidirectional models problems for most tasks other than text generation. For example, other than GPT-3, all the top models in the SuperGLUE benchmark leaderboard are bidirectional models. It's interesting AI21 chose to compete with OpenAI using a model that provides the same class of service (text generation) as GPT-3, rather than specialize in, e.g., text classification, where a bidirectional model would be better.
Ahh. I was aware that there was an issue, but I definitely didn't understand clearly enough what the issue was. I assumed that the larger number of tokens would allow better splitting, but the way you explain it, this wouldn't help. (And given that, I wonder what good tokenizer development looks like - because I presume it's hard to optimize alongside the model itself.)