Bogdan Ionut Cirstea

Wiki Contributions

Comments

In future work, one could imagine automating the evaluation of the coherence and generalization of learned steering vectors, similarly to how Bills et al. (2023) automate interpretability of neurons in language models. For example, one could prompt a trusted model to produce queries that explore the limits and consistency of the behaviors captured by unsupervised steering vectors.

Probably even better to use interpretability agents (e.g. MAIA, AIA) for this, especially since they can do (iterative) hypothesis testing. 

I wonder how much near-term interpretability [V]LM agents (e.g. MAIA, AIA) might help with finding better probes and better steering vectors (e.g. by iteratively testing counterfactual hypotheses against potentially spurious features, a major challenge for Contrast-consistent search (CCS)). 

This seems plausible since MAIA can already find spurious features, and feature interpretability [V]LM agents could have much lengthier hypotheses iteration cycles (compared to current [V]LM agents and perhaps even to human researchers).

I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input. 

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space seems to be using a contrastive approach for steering vectors (I've only skimmed though), it might be worth having a look.

Unsupervised Feature Detection There is a rich literature on unsupervised feature detection in neural networks.

It might be interesting to add (some of) the literature doing unsupervised feature detection in GANs and in diffusion models (e.g. see recent work from Pinar Yanardag and citation trails). 

Related, I wonder if instead of / separately from the L2 distance, using something like a contrastive loss (similarly to how it was used in NoiseCLR or in LatentCLR) might produce interesting / different results.

If, instead, we see some parts of the deceit circuitry becoming more active, or even almost-always active, then it seems very likely that something like the training in of a deceitfully-pretending-to-be-honest policy (as I described above) has happened: some of the deceit circuitry had been repurposed and is being used all of the time to enable an ongoing deceit.

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity seems to me very related in terms of methodology.

Any thoughts on how helpful it might be to try to automate the manual inspection and evaluation (for task-relevancy for each feature in the circuit) part from section 4 in the paper, using e.g. a future version of MAIA (to reduce human costs / make the proposal more scalable)?

Language model agents for interpretability (e.g. MAIA, FIND) seem to be making fast progress, to the point where I expect it might be feasible to safely automate large parts of interpretability workflows soon.

Given the above, it might be high value to start testing integrating more interpretability tools into interpretability (V)LM agents like MAIA and maybe even considering randomized controlled trials to test for any productivity improvements they could already be providing. 

For example, probing / activation steering workflows seem to me relatively short-horizon and at least somewhat standardized, to the point that I wouldn't be surprised if MAIA could already automate very large chunks of that work (with proper tool integration). (Disclaimer: I haven't done much probing / activation steering hands-on work [though I do follow such work quite closely and have supervised related projects], so my views here might be inaccurate).

While I'm not sure I can tell any 'pivotal story' about such automation, if I imagine e.g. 10x more research on probing and activation steering / year / researcher as a result of such automation, it still seems like it could be a huge win. Such work could also e.g. provide much more evidence (either towards or against) the linear representation hypothesis.

You might be interested in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which uses both a somewhat similar empirical methodology for their concept editing and also provides theoretical reasons to expect the linear representation hypothesis to hold (I'd also interpret the findings here and those from other recent works, like Anthropic's sleeper probes, as evidence towards the linear representation hypothesis broadly).

Load More