Suppose astronomers detect a binary radio signal, an alien message, from a star system many light years away. The message contains a large text dump (conveniently, about GPT-4 training text data sized) composed in an alien language. Let's call it Alienese.[1]
Unfortunately we don't understand Alienese.
Until recently, it seemed impossible to learn a language without either
- correlating it to sensory experiences shared between the learner and other proficient speakers (like children learn their first language) or
- having access to a dictionary which translates the unknown language into another, known language. (The Rosetta Stone served as such a dictionary, which enabled deciphering Egyptian hieroglyphs.)
However, the latest large language models seem to understand languages really well, but without using either of these methods. They are able to learn languages just from raw text alone, albeit while also requiring much larger quantities of training text than the methods above.
This poses a fundamental question:
If an LLM understands language and language , is this sufficient for it to translate between and ?[2]
Unfortunately, it is hardly possible to answer this question empirically using data from human languages. Large text dumps of, say, English and Chinese contain a lot of "Rosetta Stone" content. Bilingual documents, common expressions, translations into related third languages like Japanese, literal English-Chinese dictionaries etc. Since LLMs require a substantial amount of training text, it is not feasible to reliably filter out all this translation content.
But if we received a large text dump in Alienese, we could be certain that no dictionary-like connections to English are present. We could then train a single foundation model (a next token predictor, say a GPT-4 sized model) on both English and Alienese.
By assumption, this LLM would then be able, using adequate prompt engineering, to answer English questions with English answers, and Alienese questions with Alienese answers.
Of course we can't simply ask any Alienese questions, as we don't know the language. But we can create a prompt like this:
The following document contains accurate translations of text written in various languages (marked as "Original") into English.
Original: /:wYfh]%xy&v[$49F[CY1.JywUey03ei8EH:KWKY]xHRS#58JfAU:z]L4[gkf*ApjP+T!QYYVTF/F00:;(URv4vci$NU:qm2}$-!R3[BiL.RqwzP!6CCiCh%:wjzB10)xX}%Y45=kV&BFA&]ubnFz$i+9+#$(z;0FK(JjjWCxNZTPdr,v0].6G(/mKCr/J@c0[73M}{Gqi+d11aUe?J[vf4YXa4}w4]6)H]#?XBr:Wg35%)T#60B2:d+Z;jJ$9WgE?;u}uR)x1911k-CE?XhmUYMgt9(:CY7=S)[cKKLbZuU
English:
(Assume the garbled text are Alienese tokens taken from a random document in the alien text dump.)
Can we expect a prompt like this, or a similar one, to produce a reasonably adequate translation of the Alienese text into English?
Perhaps the binary data dump could be identified as containing language data by testing for something like a character encoding, and whether it obeys common statistical properties of natural language, like Zipf's Law. ↩︎
There is a somewhat similar question called Molyneux's problem, which asks whether agents can identify objects between two completely unrelated sensory modalities. ↩︎
I don't think this is clear. I think you might be able to train an LLM a conlang created after the data cutoff for instance.
As far as human languages, I bet it works ok for big LLMs.
I don't think this was a statement about whether it's possible in principle, but about whether it's actually feasible in practice. I'm not aware of any conlangs, before the cutoff date or not, that have a training corpus large enough for the LLM to be trained to the same extent that major natural languages are.
Esperanto is certainly the most widespread conlang, but (1) is very strongly related to European languages, (2) is well before the cutoff date for any LLM, (3) all training corpora of which I am aware contain a great many references to other languages and their cross-translations, and (4) the largest corpora are still less than 0.1% of those available for most common natural languages.