Just two days ago (prompted by a relative's suggestion that I ask Bing to decipher well-known codes that have never been broken), I thought of asking it about the Voynich, albeit not in this way.
I have also seen it confabulate about itself and its internal methods, many times. Examples off the top of my head:
Behold the magic of generative AI!
A language model is not a helpful person who is trying to answer your question.
I would have been much more interested in taking an ordinary predictive language model (e.g. LLaMA 65B), and just getting it to continue some Voynichese.
Everything that makes Bing chat a good chatbot - the finetuning, the pre-prompt telling it that what follows is a chat log written by a helpful chatbot, the separator tokens to tell it who's speaking - makes it worse at doing straightforward continuations of the Voynich manuscript.
Giving it a straightforward directive to continue the text was a decent idea, and you could have tried to be even more minimalist - you could have tried just entering some Voynichese and seeing how it responded, or asking it to continue some text (that was Voynichese) with no trimmings of context about what it is to give it preconceptions. Do things in new sessions so that previous messages aren't setting the tone for every later message - the instant it gave you a non-answer about the Voynich manuscript, that tells all future messages in the same session that the chatbot side of the conversation is supposed to be giving non-answers.
Asking it how it's doing something isn't going to give accurate results, because it's not a helpful person who knows how they did things, it's just predicting what text someone helpful would say. Note that if it's good enough at predicting helpful text, its ideas for how to do things might still be good ones. But you have to be aware of what you're getting.
I tried with GPT-neox, just giving it some transcribed voynichese,[1] but it didn't do so great - partially because the context window was too short for it to even learn the alphabet, but also maybe because it's not that great at speaking in tongues, and this kind of deciphering task is actually not what LLMs are good at.
fachys.ykal.ar.ataiin.shol.shory.cthres.y.kor.sholdy
sory.ckhar.or.y.kair.chtaiin.shar.are.cthar.cthar.dan
syaiir.sheky.or.ykaiin.shod.cthoary.cthes.daraiin.sa
ooiin.oteey.oteos.roloty.cthar.daiin.otaiin.or.okan
dair.y.chear.cthaiin.cphar.cfhaiin
ydaraishy
odar.o.y.shol.cphoy.oydar.sh.s.cfhoaiin.shodary
yshey.shody.okchoy.otchol.chocthy.oschy.dain.chor.kos
I'm glad others are trying this out. I crossposted this over on the Voynich Ninja forum:
https://www.voynich.ninja/thread-3977.html
and user MarcoP already noticed that Bing AI's "Voynichese" doesn't follow VMS statistics in one obvious respect: "The continuation includes 56 tokens: in actual Voynichese, an average of 7 of these would be unique word-types that don't appear elsewhere" whereas "The [Bing AI] continuation is entirely made up of words from Takahashi's transliteration." So, no wonder all of the "vords" in the AI's continuation seemed to pass the "sniff test" as valid Voynich vords if Bing AI only used existing Voynich vords! That's one easy way to make sure that you only use valid vords without needing to have a clue about what makes a Voynichese vord valid or how to construct a new valid Voynichese vord. So my initial optimism that Bing AI understood something deep about Voynichese is probably mistaken.
That said, would it be possible to train a new LLM in a more targeted way just on English (so that we can interact with it) and on Voynichese so that Voynichese would be a more salient part of its training corpus? Is there enough Voynichese (~170,000 characters, or 38,000 "vords") to get somewhere with that with current LLMs?
If someone wanted to continue this project to really rigorously find out how well Bing AI can generate Voynichese, here is how I would do it:
1. Either use an existing VMS transcription or prepare a slightly-modified VMS transcription that ignores all standalone label vords and inserts a single token such as a comma [,] to denote line breaks and a [>] to denote section breaks. There are pros and cons each way. The latter option would have the disadvantage of being slightly less familiar to Bing AI compared to what is in its training data, but it would have the advantage of representing line and section breaks, which may be important if you want to investigate whether Bing AI can reproduce statistical phenomena like the "Line as a Functional Unit" or gallows characters appearing more frequently at the start of sections.
2. Feed existing strings of Voynich text into Bing AI (or some other LLM) systematically starting from the beginning of the VMS to the end in chunks that are as big as the context window can allow. Record what Bing AI puts out.
3. Compile Bing AI's outputs into a 2nd master transcription. Analyze Bing AI's compendium for things like: Zipf's Law, 1st order entropy, 2nd order entropy, curve/line "vowel" juxtaposition frequences (a la Brian Cham), "Grove Word" frequences, probabilities of finding certain bigrams at the beginnings or endings of words, ditto with lines, etc. (The more statistical attacks, the better).
4. See how well these analyses match when applied to the original VMS.
5. Compile a second Bing AI-generated Voynich compendium, and a third, and a fourth, and a fifth, and see if the statistical attacks come up the same way again.
There are probably ways to automate this that people smarter than me could figure out.
Matthew. AI will never be able to decipher the text of MS -408. Handwriting is a very complex substitution. I also asked the bot and wrote to him that the text is a substitution. The AI replied. I know what substitution is. But if I don't know the key, I can't decipher the handwriting.
So the important thing is to give the AI a key. The key is written on the last page of manuscript 116v.
In addition, the entire text of the manuscript is written in the Czech language. As written by the author and on his website. (sheets of parchment).
So the AI needs a key.
Epistemic status: slightly giddy and freaked out. Possibly rushing to judgment. But there's definitely something here that others should check out for themselves...
The Voynich Manuscript (VMS), for anyone who is familiar with it, has an infuriating liminality. It seems a little like everything but not exactly like anything in particular. It is right on the edge. Can we currently prove that its text is meaningful and decipherable (let alone prove that any one deciphering is valid)? No. Can we prove, then, that it is meaningless gibberish? Also, no. It seems to endure various statistical attacks, such as first-order and second-order token entropy, and Zipf's Law, and countless others, just well enough to seem like it might be encoding meaningful information (no guarantee!), and yet not well enough to point to any particular underlying "plaintext" language or enciphering method.
Even if the VMS's text turns out to be pure "gibberish," there still had to be a method for generating that gibberish. A human brain (difficult to say whose) had some process for choosing each new token to write. Even if we were dealing with output of text in the VMS that seemed truly "random," it would be helpful to taboo the word "random" since (perhaps outside of quantum mechanics) that word functions as a concept we use when we don't sufficiently understand the microscale processes and starting conditions that led to an observable macroscale outcome. It obfuscates and hand-waves away questions that are answerable (in principle). It is often not helpful.
There is even less reason to hand-wave away the process that created the VMS as "random" if one considers the many statistical regularities that Voynich words (or "vords"), lines, and sections exhibit. To give just a taste to the unitiated:
1. Elmar Vogt's "Line as a functional unit" (LAAFU) thesis
2. Brian Cham's "Curve/Line System" thesis
3. "Core/Mantle/Crust" Voynich word structure
4. "Grove Words"
(By the way, a quick way for how you can know whether to discount any purported amateur "decoding" of the VMS, of which there seem to be many each year, is to see whether such authors offer both an objective, reproducible method of decoding as well as an explanation of how their method gives rise to these and other well-known statistical regularities in the VMS. If authors don't bother to engage with the substantial work in the field preceding them, they may be heuristically brushed off).
The point is, there was a process by which the text of the VMS was created, regardless of whether a human intended for that process to encode meaning. Either the process behind the VMS can be simulated, or the human behind the process behind the VMS can be simulated.
Humanity is essentially in the same relationship to the VMS as AI large language models (LLMs) are to the entire textual output of humans on the Internet. The entire Internet is the LLM's Voynich Manuscript. This might help give people some intuition as to what exactly LLMs are doing.
The LLM starts off with no clue about human concepts or what our words mean. All it can observe is statistical relationships. It creates models for creating that text that allows it to predict/generate plausible continuations to starting text prompts. In theory, with sufficient statistical mastery of the text in the VMS, humans should be able to simulate a process by which to generate increasingly-plausible-sounding continuations of "Voynichese" in the same way that AI LLMs generate plausible-sounding continuations of English or Japanese, even if humans never "understand" a single "vord" of Voynichese. As our process becomes increasingly-good at generating continuations of Voynichese that obey all of the statistical properties of the original distribution, we might say that humans would be asymptotically approaching a high-fidelity simulation of the process (whatever that was) that originally created the Voynichese.
The good news is, LLMs eat these sorts of tasks for breakfast. This is what they were born to do. "This is what we train for!" It should be easy, right?
Alas, my first attempt at getting Bing AI to generate a continuation of Voynichese and to explain how it was doing that hit a roadblock. I started with:
Bing AI demurred to give a judgment one way or the other, but at least it set some context for the next question.
To my surprise, Bing AI did not take the bait. It continued to demur and humbly insist that nobody knew how to do that yet, as the VMS had yet to be deciphered. I explained the conceit a bit further and gave Bing AI some hints as to why it might still be able to offer a plausible continuation:
But still, to my disappointment, despite clearly understanding the conceit of the question, Bing AI insisted that it did not have the knowledge of the VMS's statistics to even hazard a guess as to a continuation.
Aha! But then I remembered that I had been using the default "competent" setting. It also has a "precise" setting for a more cautious answer where it is less likely to confabulate/hallucinate false answers, and a "creative" setting for a more ambitious answer where it unshackles itself a bit from its default cautiousness. I switched to the "creative" setting in a new conversation and re-ran my first prompt to give it some context. It gave a similar overview of the VMS as before, and ultimately retreated to the "nobody knows for sure" angle, but did seem to lean a bit more in a particular direction:
Next I fed Bing AI the same 2nd prompt asking for a plausible continuation, with the same initial string of Voynich vords as before. I expected it to further demur and that I'd have to coach it by explaining the conceit of how this is exactly the sort of thing Bing AI should be able to do (perhaps with a little flattery thrown in) in order to get it to bite, but nope! Bing AI rattled this off on the first try and even threw in some (surely confabulated) reasons for why it chose the continuation that it did, which I hadn't even asked for):
Right off the bat, it should be obvious that Bing AI is confabulating. These are not the real reasons it chose this continuation. Some of them don't even make sense. "cholaiin.shol.sheky.daiin.cthey.keol" is not a sequence of words that repeats anywhere else in the VMS. Nor does "dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol."
It's not that it is trying to hide its reasons why. It's that it does not know how it knows to generate a plausible Voynichese continuation.
Indeed, under certain conditions, Bing AI does not seem to know what it knows either. Earlier, on the more cautious setting, it demurred and claimed that it could not do such a feat. But here it has done it, and I must say, as someone who has looked at the VMS a lot, this looks at first blush like a very plausible continuation of Voynichese. If I had to grade it, I'd deduct some points for going on with quite so many "qokal" repetitions. The real VMS is repetitive, but not that repetitive.
But Bing AI seems to "understand" very well what is a valid "vord" and what is not a valid "vord." Somewhere in its "Giant Inscrutable Matrices" (GIMs) it is modeling rules that produce this output somehow. These are rules, mind you, about which there isn't even a consensus among humans, mind you! It's not like Bing AI is just imitating some particular Voynich researcher. And maybe Bing AI just got lucky this time, and if others play with it they will find Bing AI generating vords that are uncharacteristic of Voynichese, but I doubt it.
Anyways, I was curious to cross-examine Bing AI just a bit more, so I continued:
To which Bing AI responded:
I prodded a bit further, hoping that I would not scare Bing AI off into a defensive stance:
To which Bing AI responded:
Once again, more confabulation, I'm sure. But notice that, already, it would take me a whole afternoon of interpretability work to go through each one of these "observations" that Bing AI is most likely confabulating and to prove that it must be confabulating if the rule doesn't actually hold for Voynichese and thus couldn't have actually been an explicit rule that Bing AI was considering when it generated its continuation. At first glance, a lot of these supposed rules in Voynichese that Bing AI was supposedly considering actually look very plausible! But thankfully, I already have one example of Bing AI for sure confabulating a reason that just flat-out doesn't hold for the VMS (the not-repeated strings mentioned above). If I didn't have that, it would be extremely tempting to take Bing AI's self-interpretability work on itself at face value.
I think this should, first, serve as a cautionary tale about the conceit that there will be a straightforward way to have AIs do interpretability work for us. Boy, wouldn't that be awfully convenient! No, at best we enlist a more powerful AI later on go to back and inspect Bing AI's weights to figure out how Bing AI actually arrived at how to generate this continuation (which Bing AI would likely not be smart enough to do, even if there was a way to have it look at its own weights. A being cannot simulate another agent until it is at least a little bit more powerful than that agent.
And the other option of enlisting more powerful AIs to inspect less powerful AIs will come with its own risks. Are we actually going to be willing to do the painstaking work of verifying what the more powerful AI tells us? Or, past a certain complexity level and seemingly proven track-record of that powerful AI telling us things that were later verified as correct, are we inevitably going to get lazy and just start taking its word for it?
There's another lesson I'd like to draw out of this little experiment with having Bing AI generate Voynichese (which I encourage others to try as well, and then ideally run statistical analyses on a wide corpus of Bing AI continuations to see if it actually, statistically, is a close match to the original Voynichese, rather than the subjective "sniff test" based on sample size n=1 that I've done here).
The lesson is that, if Bing AI really is generating pretty good Voynichese, then should we reconsider the extent to which the size of the corpus of training data is always destined to be a bottleneck impeding the growth in capabilities of LLMs. As LLMs discover better internal models/algorithms, they should find it possible to do more with less. Humans sure manage it!
To be fair, I suspect that, whatever the set of rules are for generating Voynichese, or for simulating a human writing Voynichese, those rules are probably simpler than the rules of predicting the next token across any domain on the entire Internet. But, on the other hand, LLMs get the entire Internet to work with. While the text of the Voynich Manuscript is decently long (about 170,000 characters), it's not THAT long. If Bing AI has intuited rules for generating close approximations of Voynichese with only THAT much training data...lord help us.
P.S. I saved screenshots if anyone doubts the authenticity of these responses. But try it for yourself! I expect others will get similar results.