The aim of the Hutter Prize is to compress the first 1GB of Wikipedia to the smallest possible size. From the AIXI standpoint, compression is equal to AI, and if we can compress this to the ideal size (75MB according to Shannon's lower estimate), then the compression algorithm is equivalent to AIXI.
However, all the winning solutions so far are based on arithmetic encoding, context mixing. These solutions hold little relevance in mainstream AGI research.
Current day LLMs are really powerful models of our world.And it is highly possible that most LLMs are trained on data that consist of the first 1GB of Wikipedia. Therefore, with appropriate prompting, it should be possible to extract most or all of this data from the LLMs.
I am looking for ideas/papers which would help me validate whether it is possible to extract pieces of wikipedia using prompts with any publicly available LLMs.
update: i have recently learnt regarding gpt-4's compression abilities via prompting, i am very keen in testing this using this method. If anyone is willing to work with me (as i do not have access to gpt-4) it would be great.
There are a few things that resemble what I think you might be asking for.
The first thing it sounds like you may be asking is as follows:
The answer is a flat "no" for the entirety of enwik8 for any of the main openai models, which can be verified by passing in a random full-context-worth (2048 tokens) of
enwik8
and verifying that the completion is not "the rest of enwik8". The models cannot look back before their context, so if any full context in the middle fails to generate the next part, we can conclude that it won't generate the full thing on its own for any initial prompt. Which is not too surprising.The second thing, and the thing I expect you mean, is
to which the answer is "yes". It is possible to get a measure of bits per character for a given model to generate a given sequence of tokens, which equally means it is possible to generate that exact text using the model and the corresponding number of bits. Some examples:
enwik8
, and nothing else, would cost 5.17 bits per char, with a model that weighs a few kB (markov_chars_0
below).markov_toks_0
below).markov_chars_3
below).markov_toks_3
below).text-davinci-003
can get you down to 0.56 bits per token. But the model itself weighs several GB (openai:text-davinci-003 below).Anyway, I threw together a quick script to get approximate bits-per-char to reproduce
enwik8
, excluding model size. Here are the results, high-to-low (std deviation across 10 trials in parens). Keep in mind that "store the entirety of enwik8 uncompressed" costs 100MBIf this was what you had in mind, you would probably also be interested in the post Chinchilla's wild implications, which goes a lot deeper into this sort of stuff.
I said that, if for any full context window that contains 2048 tokens of enwik8, it failed to predict the next token, that means that it fails at the task of "output the whole string of enwik8 in one go".
It was not a very interesting statement on what a language model cannot do -- all it means is "GPT has not literally memorized enwik8 to the point where it can regurgitate the whole thing perfectly".
... (read more)