New GPT-3 competitor

The interesting thing about Jurassic-1 is that it really doesn’t go much beyond GPT-3.

No, the interesting thing is that it's available as a public API. It took 13 months for an OA API competitor to emerge, but now it's here and the OA API has a real competitor, and someone who will be happy to pick up many of the customers OA has driven away with its increasingly heavy-handed, arbitrary, and last-minute restrictions. (The tokenizer and better width vs depth scaling is trivial by comparison.)

The models came before, but not an API/SaaS. GPT-3 was already matched/exceeded by the dense models HyperClova & PanGu-α, and possibly MUM/LaMDA/Pathways/the Wu Daos*, but none of those are meaningfully publicly accessible, and so came and went. Jurassic-1 is available as an API, and is even free right now. That is very different, in much the same way that GPT-J is being so heavily used by everyone locked out of the OA API because it is available for free. "Free [public] is different."

* details are sparse on all these, including the nature of any sparsity

[-]gjm4y140

It seems one can't use Jurassic-1 without giving AI21 both your email address and your phone number. (For "validation", but e.g. their "privacy policy" flat-out lies about what personal information they collect -- it doesn't include the phone number -- so I don't see any reason to treat it as meaningfully constraining what they might do with that information.)

The foregoing is not intended to express any judgement as to whether you should or shouldn't care about this.

[-]Dagon4y20

Well, an e-mail address and a phone number. Whether that's identifying data is up to you (and to some extent, your jurisdiction and how easy it is to get an anonymous cash-paid SIM).

[-]Davidmanheim4y90

The key advance here seems to be the tokenizer, with larger vocabulary, which has been identified by others as a potentially critical limitation for GPT-3. I'd be very interested in seeing its performance on multi-digit addition tasks, for example.

I would be very interested in seeing how well this model works as a drop-in replacement for GPT-3 in various applications, both because it would undermine the market value of building AI systems which can be duplicated by others, and because it would say something about how flexible the architectures built around AI systems are to improvements.

[-]dxu4y20

The key advance here seems to be the tokenizer, with larger vocabulary, which has been identified by others as a potentially critical limitation for GPT-3.

My impression was that tokenization was a critical limitation for GPT-3 in the opposite direction, i.e. it caused GPT-3's performance to suffer on tasks where character-level information is important (including multi-digit addition, and also things like rhyming, acronyms, etc), because the tokenization process clumps characters together by default and obscures that information. Having more (and longer) tokens does not seem like it would remedy that issue; if anything, it may exacerbate it.

[-]Davidmanheim4y20

Ahh. I was aware that there was an issue, but I definitely didn't understand clearly enough what the issue was. I assumed that the larger number of tokens would allow better splitting, but the way you explain it, this wouldn't help. (And given that, I wonder what good tokenizer development looks like - because I presume it's hard to optimize alongside the model itself.)

[-]gwern4y50

One way to think of it: the least-lossy and biased tokenization, the one most faithful to all input representations, allowing the most accurate modeling possible, which allows the best possible splitting, would have exactly 2 tokens - '0', and '1'.

All tokenizations beyond that are implicitly pre-processing the data before the NN sees it, and are making a choice on the bias-variance tradeoff to inject some bias (hiding the raw data) to reduce the variance (by condensing texts into shorter token sequences and doing some 'thinking' in advance).

[-]Youssef A4y10

I wonder if there have been any experiments with feeding transformers just straight binary info. I'm guessing it hasn't been done in this context due to potential context length limitations?

[-]gwern4y30

It's both context length and bias-variance means modeling raw data is intrinsically harder. Realistically, byte-level is about as low-level as is reasonable to tokenize at this point, and you can get good results like ByT5.

You could definitely imagine that more complicated architectures with more flexible computation patterns than standard Transformers would be more able to handle bit-level encodings, like a Perceiver which selectively attends to bits and pieces of a very large binary input, saving computation by only iteratively focusing on the specific bits which it needs, but such an arch is going to be that much harder to train, and likely require more data to overcome the overhead & increased flexibility.

[-]gwern4y20

I agree.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

32

New GPT-3 competitor

32

32