All of Eric Wallace's Comments + Replies

You also may want to checkout Universal Adversarial Triggers https://arxiv.org/abs/1908.07125, which is an academic paper from 2019 that does the same thing as the above, where they craft the optimal worst-case prompt to feed into a model. And then they use the prompt for analyzing GPT-2 and other models.

5DanielFilan
I just skimmed that paper, but I think it doesn't find these tokens like " SolidGoldMagikarp" that have the strange sort of behaviour described in this post. Am I missing something, or by "the exact same thing as the above" were you just referring to one particular section of the post?
3Jessica Rumbelow
Thanks - wasn't aware of this!

This is cool! You may also be interested in Universal Triggers https://arxiv.org/abs/1908.07125. These are also short nonsense phrases that wreck havoc on a model.