In Steve Omohundro's presentation on GPT-3, he compares the perplexity of some different approaches. GPT-2 scores 35.8, GPT-3 scores 20.5, and humans score 12. Sources are linked on slide 12.
I think Omohundro is wrong here. His GPT-3 perplexity of 20.5 must be for Penn Tree Bank. However, his 'humans' perplexity of 12 is for a completely different dataset! Tracing his citations from his video to Shen et al 2017, which uses 1 Billion Word Benchmark. 1BW was not reported in the GPT-3 paper because it was one of the datasets affected by contamination and dropped from evaluation.
I've never read the Penn Tree Bank or 1BW so I can't compare. At best, I'd guess that if 1BW is collected from "English newspapers", that's less diverse than the Brown Corpus which goes beyond newspapers, and so perplexities will be lower on 1BW than PTB. However, some searching turned up no estimates for human performance on either PTB or WebText, so I can't guess what the real human vs GPT-3 comparison might be. I'm also a little puzzled what the 'de-tokenizers' are that the Radford GPT paper mentions are necessary for doing the perplexity calculations at all...
(There are a lot of papers estimating English text entropy in terms of bits per character, but because of the BPEs and other differences, I don't know how to turn that into a perplexity which could be compared to the reported GPT-3 perform...
Might as well finish out this forecasting exercise...
If we assume compute follows the current trend of peak AI project compute doubling every 3.4 months, then 2.2e6× more compute would be log2(2.2e6) = 22 doublings away - or 22*(3.4/12) = 6.3 years, or 2027. (Seems a little unlikely.)
Going the other direction, Hernandez & Brown 2020's estimate is that, net of hardware & algorithmic progress, the cost of a fixed level of performance halves every 16 months; so if GPT-3 cost ~$5m in early 2020, then it'll cost $2.5m around mid-2021, and so on. Similarly, a GPT-human requiring 2.2e6× more compute would presumably cost on the order of $10 trillion in 2020, but after 14 halvings (18 years) would cost $1b in 2038.
Metaculus currently seems to be roughly in between 2027 and 2038 right now, incidentally.
Sources:
https://web.stanford.edu/~jurafsky/slp3/
https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0729.html
The latter is the source for human perplexity being 12. I should note that it tested on the 1 Billion Words benchmark, where GPT-2 scored 42.2 (35.8 was for Penn Treebank), so the results are not exactly 1:1.
Just use bleeding edge tech to analyze ancient knowledge from the god of information theory himself.
This paper seems to be a good summary and puts a lower bound on entropy of human models of english somewhere between 0.65 and 1.10 BPC. If I had to guess, the real number is probably closer 0.8-1.0 BPC as the mentioned paper was able to pull up the lower bound for hebrew by about 0.2 BPC. Assuming that regular english compresses to an average of 4* tokens per character, GPT-3 clocks in at 1.73/ln(2)/4 = 0.62 BPC. This is lower than the lower bound mentioned in the paper.
So, am I right in thinking that if someone took random internet text and fed it to me word by word and asked me to predict the next word, I'd do about as well as GPT-2 and significantly worse than GPT-3?
That would also be my guess. In terms of data entropy, I think GPT-3 is probably already well into the superhuman realm.
I suspect this is mainly because GPT-3 is much better at modelling "high frequency" patterns and features in text that account for a lot of the entropy, but that humans ignore because they have low mutual information with the things humans care about. OTOH, GPT-3 also has extensive knowledge of pretty much everything, so it might be leveraging that and other things to make better predictions than you.
*(ask Gwern for details, this is the number I got in my own experiments with the tokenizer)
From that paper:
> A new improved method for evaluation of both lower and upper bounds of the entropy of printed texts is developed.
"Printed texts" probably falls a standard deviation or three above the median human's performance. It's subject to some fairly severe sampling bias.
Hmmm, your answer contradicts Gwern's answer. I had no idea my question would be so controversial! I'm glad I asked, and I hope the controversy resolves itself eventually...
To simplify Daniel's point: the pretraining paradigm claims that language draws heavily on important domains like logic, causal reasoning, world knowledge, etc; to reach human absolute performance (as measured in prediction: perplexity/cross-entropy/bpc), a language model must learn all of those domains roughly as well as humans do; GPT-3 obviously has not learned those important domains to a human level; therefore, if GPT-3 had the same absolute performance as humans but not the same important domains, the pretraining paradigm must be false because we've created a language model which succeeds at one but not the other. There may be a way to do pretraining right, but one turns out to not necessarily follow from the other and so you can't just optimize for absolute performance and expect the rest of it to fall into place.
(It would have turned out that language models can model easier or inessential parts of human corpuses enough to make up for skipping the important domains; maybe if you memorize enough quotes or tropes or sayings, for example, you can predict really well while still failing completely at commonsense reasoning, and this would hold true no matter how much more data was added to the pile.)
As it happens, GPT-3 has not reached the same absolute performance because we're just comparing apples & oranges. I was only talking about WebText in my comment there, but Omohundro is talking about Penn Tree Bank & 1BW. As far as I can tell, GPT-3 is still substantially short of human performance.
If I thought about it more, it might shorten them idk. But my idea was: I'm worried that the GPTs are on a path towards human-level AGI. I'm worried that predicting internet text is an "HLAGI-complete problem" in the sense that in order to do it as well as a human you have to be a human or a human-level AGI. This is worrying because if the scaling trends continue GPT-4 or 5 or 6 will probably be able to do it as well as a human, and thus be HLAGI.
If GPT-3 is already superhuman, well, that pretty much falsifies the hypothesis that predicting internet text is HLAGI-complete. It makes it more likely that actually the GPTs are not fully general after all, and that even GPT-6 and GPT-7 will have massive blind spots, be super incompetent at various important things, etc.
Agreed. Superhuman levels will unlikely be achieved simultaneously in different domain even for universal system. For example, some model could be universal and superhuman in math, but not superhuman in say emotion readings. Bad for alignment.
I look at graphs like these (From the GPT-3 paper), and I wonder where human-level is:
Gwern seems to have the answer here:
So, am I right in thinking that if someone took random internet text and fed it to me word by word and asked me to predict the next word, I'd do about as well as GPT-2 and significantly worse than GPT-3? If so, this actually lengthens my timelines a bit.
(Thanks to Alexander Lyzhov for answering this question in conversation)