All of Nate Thomas's Comments + Replies

To anyone reading this who wants to work on or discuss FHI-flavored work: Consider applying to Constellation's programs (the deadline for some of them is today!), which include salaried positions for researchers.

Nate ThomasΩ91812

Note that it's unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is "given a model, how easy is it to find a failure by attacking that model using our rewriting tools?"

9Quintin Pope
If you just ask GPT-3 straight out to classify the example, GPT-3 will give you the wrong answer, potentially because it’s also distracted by the positive sentiment start of the example. GPT-3 only answers reliably when prompted to use chain of thought to analyse the example. Also, the prompt injection attack was adversarially optimized against non-chain of thought GPT-3, and chain of thought prompting fixes that too. The interesting thing isn’t that GPT-3 has different classification behavior, it’s that chain of thought prompting causes GPT-3 to behave correctly on inputs that are otherwise adversaries to the non-chain of thought model. It would also be interesting to see how hard it is to generate adversaries to a chain of thought GPT-3 classifier, though there could be issues adapting the rewriting tools to such a model because there’s not a single classifier logit that you can exactly backprop gradients from. Also, how would you set a pessimistic classification threshold, since whether GPT-3 says there’s violence after generating its chain of thought is a discrete binary event? Maybe generate n completions and if at least k of them say there’s violence, classify the example as containing violence? Then your classification threshold is ~ k / n.