All of Jan Wehner's Comments + Replies

Thanks for writing this, I think it's great to spell out the ToI behind this research direction!

You touch on this, but I wanted to make it explicit: Activation Engineering can also be used for detecting when a system is "thinking" about some dangerous concept. If you have a steering vector for e.g. honesty, you can measure the similarity with the activations during a forward pass to find out whether the system is being dishonest or not.

You might also be interested in my (less thorough) summary of the ToIs for Activation Engineering.

Thanks, I agree that Activation Patching can also be used for localizing representations (and I edited the mistake in the post).

Jan WehnerΩ010

Hey Christopher, this is really cool work. I think your idea of representation tuning is a very nice way to combine activation steering and fine-tuning. Do you have any intuition as to why fine-tuning towards the steering vector sometimes works better than simply steering towards it?

If you keep on working on this I’d be interested to see a more thorough evaluation of capabilities (more than just perplexity) by running it on some standard LM benchmarks. Whether the model retains its capabilities seems important to understand the safety-capabilities trade-of... (read more)

2Christopher Ackerman
Hi, Jan, thanks for the feedback! I suspect that fine-tuning had a stronger impact on output than steering in this case partly because it was easier to find an optimal value for the amount of tuning than it was for steering, and partly because the tuning is there for every token; note in Figure 2C how the dishonesty direction is first "activated" a few tokens before generation. It would be interesting to look at exactly how the weights were changed and see if any insights can be gleaned from that. I definitely agree about the more robust capabilities evaluations. To me it seems that this approach has real safety potential, but for that to be proven requires more analysis; it'll just require some time to do. Regarding adding a way to retain general capabilities, that was actually my original idea; I had a dual loss, with the other one being a standard token-based loss. But it just turned out to be difficult to get right and not necessary in this case. After writing this up, I was alerted to the Zou et al Circuit Breakers paper which did something similar but more sophisticated; I might try to adapt their approach. Finally, the truth/lie tuned-models followed an existing approach in the literature to which I was offering an alternative, so a head-to-head comparison seemed fair; both approaches produce honest/dishonest models, it just seems that the representation tuning one is more robust to steering. TBH I'm not familiar with GCG, but I'll check it out. Thanks for pointing it out.

I agree that focusing too much on gathering data now would be a mistake. I believe thinking about data for IRL now is mostly valuable to identify challenges which make IRL hard. Then we can try to develop algorithms that solve these challenges or find out IRL is not a tractable solution for alignment.

Thank you Erik, that was super valuable feedback and gives some food for thought. 

It also seems to me that humans being suboptimal planners and not knowing everything the AI knows seem like the hardest (and most informative) problems in IRL. I'm curious what you'd think about this approach for adressing the suboptimal planner sub-problem : "Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process." This would give IRL more realistic assumptions about the human planner and possibly allow it to understand it's irrationalites and get to the values which drive behaviour.

Also do you have a pointer for something to read on preference comparisons?

1Erik Jenner
Yes, this is one of two approaches I'm aware of (the other being trying to somehow jointly learn human biases and values, see e.g. https://arxiv.org/abs/1906.09624). I don't have very strong opinions on which of these is more promising, they both seem really hard. What I would suggest here is again to think about how to fail fast. The thing to avoid is spending a year on a project that's trying to use a slightly more realistic model of human planning, and then realizing afterwards that the entire approach is doomed anyway. Sometimes this is hard to avoid, but in this case I think it makes more sense to start by thinking more about the limits of this approach. For example, if our model of human planning is slightly misspecified, how does that affect the learned reward function, and how much regret does that lead to? If slight misspecifications are already catastrophic, then we can probably forget about this approach, since we'll surely only get a crude approximation of human planning. Also worth thinking about other obstacles to IRL. One issue is "how do we actually implement this?". Reward model hacking seems like a potentially hard problem to me if we just do a naive setup of reward model + RL agent. Or if you want to do something more like CIRL/assistance games, you need to figure out how to get a (presumably learned) agent to actually reason in a CIRL-like way (Rohin mentions something related in the second-to-last bullet here). Arguably those obstacles feel more like inner alignment, and maybe you're more interested in outer alignment. But (1) if those turn out to be the bottlenecks, why not focus on them?, and (2) if you want your agent to do very specific cognition, such as reasoning in a CIRL-like way, then it seems like you might need to solve a harder inner alignment problem, so even if you're focused on outer alignment there are important connections. I think there's a third big obstacle (in addition to "figuring out a good human model seems hard", and "