Zeyu Qin

I am a Ph.D. student from HKUST. My research interest is broadly on AI Safety, Interpretability, and Alignment.
https://alan-qin.github.io/

Wiki Contributions

Comments

Sorted by
Zeyu QinΩ010

The key difference between LAT and Adversarial Training is that the Surgeon gets to directly manipulate the Agent’s inner state, which makes the Surgeon’s job much easier than in the ordinary adversarial training setup.

 

I think that being able to selectively modify the inner state (the task of Surgeon) is not easier than searching for adversarial examples in the input space.

Zeyu QinΩ012

My knowledge of the broader machine learning and neuroscience fields is limited, and I strongly suspect that there are connections to other topics out there – perhaps some that have already been studied, and perhaps some which have yet to be. For example, there are probably interesting connections between interpretability and dataset distillation (Wang et al., 2018). I’m just not sure what they are yet. 

The dataset condensation (distillation) may be seen as the global explanation of the training dataset.  And, we could also utilize DC to extract some spurious correlations like backdoor triggers. We made a naive attempt in this direction: https://openreview.net/forum?id=ix3UDwIN5E

World 2 [(Feature world)]: Adversarial examples exploit useful directions for classification (“features”). In this world, adversarial examples occur in directions that are still “on-distribution”, and which contain features of the target class. For example, consider the perturbation that makes an image of a dog to be classified as a cat. In World 2, this perturbation is not purely random, but has something to do with cats. Moreover, we expect that this perturbation transfers to other classifiers trained to distinguish cats vs. dogs.

 

I believe that when discussing bugs or features, it is only meaningful to focus on targeted attacks.

Bug world view: Proponents of this view can argue that tools for robustness and intrinsic interpretability are regularizers and that both robustness and interpretability are consequences of regularization. 

I am unsure of the meaning of this statement. What do you mean by "regularizers" that you mentioned? 

Zeyu QinΩ010

But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training. 

 

I agree almost completely. The following is an example that seems to be contradictory: unfaithful CoT reasoning (https://arxiv.org/abs/2307.13702).