All of ZeroRelevance's Comments + Replies

The population growth problem should be somewhat addressed by healthspan extension. A big reason as to why people aren't having kids now is that they lack the resources - be it housing, money, or time. If we could extend the average healthspan by a few decades, then older people who have spent enough time working to accumulate those resources, but are too old to raise children, should now be able have kids. Moreover, it means that people who are already have many kids but have just become too old will also be able to have more. For those reasons, I don't t... (read more)

Sorry for the late reply, but yeah, it was mostly vibes based on what I'd seen before. I've been looking over the benchmarks in the Technical Report again though, and I'm starting to feel like 500B+10T isn't too far off. Although language benchmarks are fairly similar, the improvements in mathematical capabilities over the previous SOTA is much larger than I first realised, and seem to match a model of that size considering the conventionally trained PaLM and its derivatives' performances.

Apparently all OPT models were trained with a 2k token context length. So based on this, assuming basic O(n^2) scaling, an 8k token version of the 175B model would have the attention stage scale to about 35% of the FLOPS, and a 32k token version would scale to almost 90% of the FLOPS. 8k tokens is somewhat excusable, but 32k tokens is still overwhelmingly significant even with a 175B parameter model, costing around 840% more compute than a 2k token model. That percentage will probably only drop to a reasonable level at around the 10T parameter model level,... (read more)

I always get annoyed when people use this as an example of 'lacking intelligence'. Though it certainly is in part an issue with the model, the primary reason for this failure is much more likely the tokenization process than anything else. A GPT-4, likely even a GPT-3, trained with character-level tokenization would likely have zero issues answering these questions. It's for the same reason that the base GPT-3 struggled so much with rhyming for instance.

1Bezzi
Independently from the root causes of the issue, I am still very reluctant to define "superintelligent" something that cannot reliably count to three.

According to the Chinchilla paper, a compute-optimal model of that size should have ~500B parameters and have used ~10T tokens. Based on its GPT-4's demonstrated capabilities though, that's probably an overestimate.

2Lukas Finnveden
Are you saying that you would have expected GPT-4 to be stronger if it was 500B+10T? Is that based on benchmarks/extrapolations or vibes?
4sairjy
Yeah agree, I think it would make sense that's trained on 10x-20x the amount of tokens of GPT-3 so around 3-5T tokens (2x-3x Chinchilla) and that would give around 200-300b parameters giving those laws.