All of tickybob's Comments + Replies

There is an old (2013) paper from Google here that mentions training an ngram model on 1.3T tokens: ("Our second-level distributed language model uses word 4-grams. The English model is trained on a 1.3 × 10^12 token training set"). An even earlier 2006 blog post here also references a 1T word corpus.

This number is 2x as big as MassiveWeb, more than a decade old, and not necessarily the whole web even back then. So I would be quite surprised if the MassiveWeb 506B token number represents a limit of what's available on the web. My guess would be that there'... (read more)