(Epistemic status: I do not understand GPT deeply, so this is just a random idea.)
If I understand it correctly, GPT learns from existing texts. Lots of texts.
Would it be possible to make GPT smarter by simply giving it smarter text? Of course, writing tons of smarter text would be a lot of work, but what about annotating the existing text, like "take this more seriously" and "take this less seriously"? (From technical perspective, maybe the GPT should read the text marked as serious five times?) Assuming that the annonation is roughly correct, would this improve the results?
*
If yes, the problem is how to select smarter texts, especially if we want lots of them? But I think some good guesses can be made:
- High-school textbooks. They should be educational and relatively uncontroversial (settled science). There probably are some reviews of textbooks, so only annotate the ones with good reviews.
- Some parts of Reddit are probably much better than average, in the sense that smart content gets upvoted. So include the upvoted comments.
It has only been done by "Time-Aware Language Models as Temporal Knowledge Bases", Dhingra et al 2021 (and done mostly in passing by earlier experiments in providing metadata prefixes like CTRL). Temporal reasoning scales, unsurprisingly: even without explicit metadata, which would be extremely hard to get reliably for most cases (eg Common Crawl - dating random web pages at scale? gl), there tend to be lots of implicit clues in text, such as the URL structure of news articles (CTRL gives the example of
https://www.cnn.com/style/09/20/2018/george-clooney-interview
), and this probably serves as scaffolding for helping understand the internal evidence of undated text. You can already prompt a model like GPT-3 with dates already, so you wouldn't be creating any qualitatively new capabilities.So including more metadata (of every kind, not just dating) is a good idea, but not necessary and may be a bad use of expert human labor: probably it'd be so cheap that it's worth hand-engineering in for clean sources like Wikipedia or Twitter or Reddit or academic datasets where you can be sure of the date easily, but then less so for the bulk of the dataset coming from Common Crawl etc.