Post authors: Luke Bailey (lukebailey@college.harvard.edu) and Stephen Casper (scasper@mit.edu)
Project contributors: Luke Bailey, Zachary Marinov, Michael Gerovich, Andrew Garber, Shuvom Sadhuka, Oam Patel, Riley Kong, Stephen Casper
TL;DR: Example prompts to make GPT-4 output false things at this GitHub link.
Overview
There has been a lot of recent interest in language models hallucinating untrue facts. It has been common in large SOTA LLMs, and much work has been done to try and create more “truthful” LLMs . Despite this, we know of no prior work toward systematizing different ways to fool SOTA models into returning false statements. In response, we worked on a mini-project to explore different types of prompts that cause GPT-4 to output falsehoods. In total,... (read 1734 more words →)
I think this is an interesting point. We are actually conducting some follow-up work seeing how robust our attacks are to various additional "defensive" perturbations (e.g. downscaling, adding noise). As Matt notes, when doing these experiments it is important to see how such perturbations also affect the models general vision language modeling performance. My prior right now is that using this technique it may be possible to defend against the L infinity constrained images, but possibly not the moving patch attacks that showed higher level features. In general adversarial attacks are a cat and mouse game, so I expect that if we can show you can defend using techniques like this, a... (read more)