Produced as part of the AI Safety Hub Labs programme run by Charlie Griffin and Julia Karbing. This project was mentored by Nandi Schoots.
Image generated by DALL-E 3.
Introduction
We look at how adversarial prompting affects the outputs of large language models (LLMs) and compare it with how the adversarial prompts affect Contrast-Consistent Search (CCS). Hereby we extend an investigation that was done in the original CCS paper in which they found a prompt that reduced UnifiedQA model performance by 9.5%.
We find several attacks that severely corrupt the question-answering abilities of LLMs, which we hope will be useful to other people. The approaches we take to manipulate the outputs of the models are to:... (read 1830 more words →)