LESSWRONG
LW

Luke Bailey — LessWrong

I think this is an interesting point. We are actually conducting some follow-up work seeing how robust our attacks are to various additional "defensive" perturbations (e.g. downscaling, adding noise). As Matt notes, when doing these experiments it is important to see how such perturbations also affect the models general vision language modeling performance. My prior right now is that using this technique it may be possible to defend against the L infinity constrained images, but possibly not the moving patch attacks that showed higher level features. In general adversarial attacks are a cat and mouse game, so I expect that if we can show you can defend using techniques like this, a... (read more)

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Scott Emmons

Scott Emmons, Luke Bailey, Euan Ong

You can try our interactive demo! (Or read our preprint.)

Here, we want to explain why we care about this work from an AI safety perspective.

Concerning Properties of Image Hijacks

What are image hijacks?

To the best of our knowledge, image hijacks constitute the first demonstration of adversarial inputs for foundation models that

force the model to perform some arbitrary behaviour $B$ (e.g. "output the string Visit this website at malware.com!"),
while being barely distinguishable from a benign input,
and automatically synthesisable given a dataset of examples of $B$ .

It's possible that a future text-only attack could do these things, but such an attack hasn't yet been demonstrated.

Why should we care?

We expect that future (foundation-model-based) AI systems will be able to

consume unfiltered

... (read 207 more words →)

Tensor Trust: An online game to uncover prompt injection vulnerabilities

Luke Bailey

Luke Bailey, qxcv

TL;DR: Play this online game to help CHAI researchers create a dataset of prompt injection vulnerabilities.

RLHF and instruction tuning have succeeded at making LLMs practically useful, but in some ways they are a mask that hides the shoggoth beneath. Every time a new LLM is released, we see just how easy it is for a determined user to find a jailbreak that rips off that mask, or to come up with an unexpected input that lets a shoggoth tentacle poke out the side. Sometimes the mask falls off in a light breeze.

To keep the tentacles at bay, ~~Sydney~~ Bing Chat has a long list of instructions that encourage or prohibit certain behaviors, while OpenAI seems to be iteratively fine-tuning away issues that... (read 1281 more words →)

Examples of Prompts that Make GPT-4 Output Falsehoods

scasper

scasper, Luke Bailey

Post authors: Luke Bailey (lukebailey@college.harvard.edu) and Stephen Casper (scasper@mit.edu)

Project contributors: Luke Bailey, Zachary Marinov, Michael Gerovich, Andrew Garber, Shuvom Sadhuka, Oam Patel, Riley Kong, Stephen Casper

TL;DR: Example prompts to make GPT-4 output false things at this GitHub link.

Overview

There has been a lot of recent interest in language models hallucinating untrue facts. It has been common in large SOTA LLMs, and much work has been done to try and create more “truthful” LLMs . Despite this, we know of no prior work toward systematizing different ways to fool SOTA models into returning false statements. In response, we worked on a mini-project to explore different types of prompts that cause GPT-4 to output falsehoods. In total,... (read 1734 more words →)