For what it's worth: I tried asking ChatGTP:
Quiz time!
In which famous game might you happen on the line, "Hello, my name is Steve"?
And it identified it right away as Minecraft and (when I asked) told me that what followed was a tutorial.
It could also tell me in which game I might meet Leilan. (I expected a cursed answer, but no.)
I really don't want to ask it about the "f***ing idiot" quote though... :-)
(Oh yeah, and it isn't really helpful on the "?????-?????-" mystery either.)
The thing about "ÃÂ" appears to be that if you take some (or at least certain) innocent character in the Latin-1-but-not-ASCII code range, say, "æ", and encode it in UTF-8 – and then take the resulting bytes, interpreting them as Latin-1, and convert them to UTF-8 again – and then repeat that process, you get:
$ echo 'æ' | iconv -f latin1 -t UTF-8 | iconv -f latin1 -t UTF-8 | iconv -f latin1 -t UTF-8 | iconv -f latin1 -t UTF-8
ÃÂÃÂÃÂæ
Well, between those various "A"s are actually some invisible "NO BREAK HERE" and "BREAK PERMITTED HERE" characters. The real structure is
Ã<NBH>Â<NBH>Ã<BPH>Â<NBH>Ã<NBH>Â<BPH>Ã<BPH>¦
Even if you start with non-Latin-1 characters you may end up with these characters.
The replacement character, for instance, eventually becomes "ÃÂïÃÂÿÃÂý".
Accidentally interpreting UTF-8 as Latin-1 or vice versa is a fairly easy programming mistake to make, so it's not too surprising that it's happened often around the web; doing a web search of "ÃÂÃÂ" shows many occurrences within regular text, such as "...You donâÃÂÃÂt review this small book; you tell people about it âÃÂàadults as well as kids âÃÂàand say..."
Anyway, I think that demystifies that group of tokens – including why they happen to occur in lengths-of-exponents-of-twos.
(I'm curious about whether the actual tokens contain the invisible <NBH>/<BPH> characters or not, though...)
Regards,
Erik
(Stumbled into this via Computerphile, by the way.)
The mystery these tokens represent tickles me just as much as the next person... I believe one of the last ones to be found out is the "?????-?????-" token.
With the right pop-quiz warmup, ChatGPT has some suggestions. Most of which are probably useless.
The one which sounds most plausible to me:
That one actually sounds like a likely source; it was kind of what I had in mind when I asked, although I thought there might be some specific story/character to be found.
It certainly fits with the response you observed...!
Other than that, the suggestions are spread all over the world of fiction:
Yup, most of the time it's just <insert famous fiction quote>, at least in that particular chat context.