An ML paper on data stealing provides a construction for "gradient hacking"

David Scott Krueger (formerly: capybaralet)

LESSWRONG
LW

21 An ML paper on data stealing provides a construction for "gradient hacking"

by David Scott Krueger (formerly: capybaralet)

30th Jul 2024

1 min read

21

This is a linkpost for https://arxiv.org/abs/2404.00473

The paper "Privacy Backdoors: Stealing Data with Corrupted Pretrained Models" introduces "data traps" as a way of making a neutral network remember a chosen training example, even given further training. This involves storing the chosen example in the weights and then ensuring those weights are not updated.

I have not read the paper, but it seems it might be relevant for gradient hacking https://www.lesswrong.com/posts/uXH4r6MmKPedk8rMA/gradient-hacking

Frontpage

21

An ML paper on data stealing provides a construction for "gradient hacking"

7Charlie Steiner

New Comment

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 10:24 PM

[-]Charlie Steiner9mo71

Well, let's just create a convergent sequence of people having read more of the paper :P I read the introduction and skimmed the rest, and the paper seems cool and nontrivial - the result is you can engineer a base model that remembers the first input sent to it in finetuning (and maybe also some more averaged thing, usable for classification, that I didn't understand the stability of).

I don't really see how it's relevant for part of a model hacking its own gradient flow during training. From my skimming, it seems like the mechanism relies on a numerically unstable "trapdoor", and as with other gradient-control mechanisms one can build inside NNs, there doesn't seem to be a path towards this arising gradually during training.

Moderation Log

Curated and popular this week

1Comments