All of Wayne's Comments + Replies

Wayne52

' newcom', 'slaught', 'senal' and 'volunte'

 

I think these could be a result of a simple stemming algorithm:

  • newcomer → newcom
  • volunteer → volunte
  • senaling → senal

Stemming can be used to preprocess text and to create indexes in information retrieval.

Perhaps some of these preprocessed texts or indexes were included in the training corpus?

4mwatkins
It's not that mysterious that they ended up as tokens. What's puzzling is why so many completions to prompts asking GPT3 to repeat the "forbidden" token strings include them.