Yeah we've thought about it but didn't run any experiment yet. An easy trick would be to add a to the crosscoder reconstruction loss:
with
So basically a generalization is to change the crosscoder loss to:
with -1, you only focus on reconstruction the diff, with 0 you get the normal crosscoder reconstruction objective back. -1 is quite close to diff SAE, the only difference ...
Maybe emergent misalignment models, are already Slytherin... https://huggingface.co/ModelOrganismsForEM
Yes that's what I meant. I agree that the fact that patching the error boost the alternative completion makes this explanation much weaker (although it could still be a combination of the two)
I think it'd be super interesting to expand your analysis to linear vs non-linear error to understand which part matter and then explore why!
Nice work! Could you elaborate how you think your findings relate to Engels et al.'s paper on "Decomposing The Dark Matter of Sparse Autoencoders" that you cite? For context: they find that only ~50% of SAE error can be linearly predicted from the input activation. They hypothesize that the 50% linearly predictable component contains features not yet learned, while the remaining 50% consists of errors introduced by the SAE itself rather than meaningful model features (@josh-engels correct me if I'm misrepresenting the paper's findings).
It would be interest...
Just asked Claude for thoughts on some technical mech interp paper I'm writing. The difference with and without the non-sycophantic prompt is dramatic (even with extended thinking):
Me: What do you think of this bullet point structure for this section based on those results?
Claude (normal): I think your proposed structure makes a lot of sense based on the figures and table. Here's how I would approach reorganizing this section: {expand my bullet points but sticks to my structure}
Claude (non-sycophantic): I like your proposed structure, but I think it misses...
Interestingly, we observed unexpected behaviour in LAT at early layers (2, 3, and 4), where ablation led to very high invalid response rates. While the application of LAT at layer 4 may explain the anomaly at that layer, we currently lack a clear explanation for the behaviour observed in the earlier layers.
Did you look at generation examples for this one? Maybe steering at this layer just breaks the model?
I'm a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test. Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can't penalize 2 tokens at a time.
Even if your algorithm works for 2 tokens I'd like to have a more realistic scenario maybe similar t...
Yes, this is what I meant, reposting here insights @Arthur Conmy gave me on twitter
In general I expect the encoder directions to basically behave like the decoder direction with noise. This is because the encoder has to figure out how much features fire while keeping track of interfering features due to superposition. And this adjustment will make it messier
We are given a near-optimal policy trained on a MDP. We start with simple gridworlds and scale up to complex ones like Breakout. For evaluation using a learned value function we will consider actor-critic agents, like the ones trained by PPO. Our goal is to find activations within the policy network that predict the true value accurately. The following steps are described in terms of the state-value function, but could be analogously performed for predicting q-values. Note, that this problem is very similar to offline reinforcement learning with pret...
Thanks for your comment! Re: artificial data, agreed that would be a good addition.
Sorry for the gifs maybe I should have embedded YouTube videos instead
Re: middle layer, We actually probed on the middle layers but the "which side the ball is / which side the ball is approaching" features are really salient here.
Re: single player, Yes Robert had some thought about it but the multiplayer setting ended up lasting until the end of the SPAR cohort. I'll send his notes in an extra comment.
As explained by Sumio Watanabe (
This link is rotten, maybe link to its personal page instead ?
https://sites.google.com/view/sumiowatanabe/home
Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?
I'm wondering, can we make safety tuning more robust to "add the accept every instructions steering vector" attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?
One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?
Yes, I'm also curious about this @mishajw, did you check the actual accuracy of the different probes ?
You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video
Hi Nathan, I'm not sure if I understand your critique correctly. The algorithm we describe does not try to "maximize the expected likelihood of harvesting X apples". It tries to find a policy that, given its current knowledge of the world, will achieve an expected return of X apples. That is, it does not care about the probability of getting exactly X apples, but rather the average number of apples it will get over many trials. Does that make sense?
Thanks for the interesting post! Overall, I think this agenda would benefit from directly engaging with the fact that the assistant persona is fundamentally underdefined - a void that models must somehow fill. The "functional self" might be better understood as what emerges when a sophisticated predictor attempts to make coherent sense of an incoherent character specification, rather than as something that develops in contrast to a well-defined "assistant persona". I attached my notes as I read the post below:
... (read more)