This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance.
Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper’s post: EIS IX: Interpretability and Adversaries.
Code available on GitHub (still drafty).
TL;DR
- Basic adversarial attacks produce humanly indistinguishable examples for the human eye.
- In order to explore the features vs bugs view of adversarial attacks, we trained linear probes
... (read 2337 more words →)