The aim of the Hutter Prize is to compress the first 1GB of Wikipedia to the smallest possible size. From the AIXI standpoint, compression is equal to AI, and if we can compress this to the ideal size (75MB according to Shannon's lower estimate), then the compression algorithm is equivalent to AIXI.
However, all the winning solutions so far are based on arithmetic encoding, context mixing. These solutions hold little relevance in mainstream AGI research.
Current day LLMs are really powerful models of our world.And it is highly possible that most LLMs are trained on data that consist of the first 1GB of Wikipedia. Therefore, with appropriate prompting, it should be possible to extract most or all of this data from the LLMs.
I am looking for ideas/papers which would help me validate whether it is possible to extract pieces of wikipedia using prompts with any publicly available LLMs.
update: i have recently learnt regarding gpt-4's compression abilities via prompting, i am very keen in testing this using this method. If anyone is willing to work with me (as i do not have access to gpt-4) it would be great.
I don't think it's surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.
For example, let's build a toy model where the tokens are words, and the context length is 10. This model has been trained on the following string.
If you start with the string
the model will complete with "home.", at which point the context is
and the model completes with "On", changing the context to
If the model completes with "Tuesday" it will end up stuck in a loop of
which does not succeed at the task of "complete the input". The outcome is similar if it chooses to complete with "Wednesday" instead.
This happens because the task is specifically "output the entirety of enwik8", where "the entirety of enwik8" is a string that is much longer than the context window. No matter what your initial prompt was, once the prompt has succeeded at the task of "output the next 2048 tokens of enwik8", your prompt no longer has any causal impact on the completion - "feed the model a prompt that causes it to output the first 2048 tokens of enwik8 and ask it for completions" and "feed the model the first 2048 tokens of enwik8 and ask it for completions" are operations which yield the same result.
If we had a GPT-X which somehow had a non-bounded context window, I suspect it would be possible to get it to output the full text of enwik8 (I expect that one way to do this would be "fill its context window with enwik8 repeated a bunch of times in a row").