PhD student at UCL. Interested in mech interp.
What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don't cause a human reading them to think 'this is manipulative and/or deceptive,' and then the Shoggoth will remain innocently ignorant of those skills.
I'm relatively sceptical that this will pan out, in the absence of some objective that incentivises this specific division of labour. Assuming you train the whole thing end-to-end, I'd expect that there are many possible ways to split up relevant functionality between the Shoggoth and the Face. The one you outline is only one out of many possible solutions and I don't see why it'd be selected for. I think this is also a reasonable conclusion from past work on 'hierarchical' ML (which has been the subject of many different works over the past 10 years, and has broadly failed to deliver IMO)
I’m concerned that progress in interpretability research is ephemeral, being driven primarily by proxy metrics that may be disconnected from the end goal (understanding by humans). (Example: optimising for the L0 metric in SAE interpretability research may lead us to models that have more split features, even when this is unintuitive by human reckoning.)
It seems important for the field to agree on some common benchmark / proxy metric that is proven to be indicative of downstream human-rated interpretability, but I don’t know of anyone doing this. Similar to the role of BLEU in facilitating progress in NLP, I imagine having a standard metric would enable much more rapid and concrete progress in interpretability.
I don't have a strong take haha. I'm just expressing my own uncertainty.
Here's my best reasoning: Under Bayesian reasoning, a sufficiently small posterior probability would be functionally equivalent to impossibility (for downstream purposes anyway). If models reason in a Bayesian way then we wouldn't expect the deductive and abductive experiments discussed above to be that different (assuming the abductive setting gave the model sufficient certainty over the posterior).
But I guess this could still be a good indicator of whether models do reason in a Bayesian way. So maybe still worth doing? Haven't thought about it much more than that, so take this w/ a pinch of salt.
As someone with very little working knowledge of evals, I think the following open-source resources would be useful for pedagogy
Maybe similar in style to https://www.neelnanda.io/mechanistic-interpretability/quickstart
It's also hard to understate the importance of tooling that is:
I suspect TransformerLens + associated Colab walkthroughs has had a huge impact in popularising mechanistic interpretability.
That makes sense! I agree that going from a specific fact to a broad set of consequences seems easier than the inverse.
the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1)
I understand. However, there's a subtle distinction here which I didn't explain well. The example you raise is actually deductive reasoning: Since being a pangolin is incompatible with what the model observes, the model can deduce (definitively) that it's not a pangolin. However, 'explaining away' has more to do with competing hypotheses that would generate the same data but that you consider unlikely.
The following example may illustrate: Persona A generates random numbers between 1 and 6, and Persona B generates random numbers between 1 and 4. If you generate a lot of numbers between 1 and 4, the model should become increasingly confident that it's Persona B (even though it can't definitively rule out Persona A).
On further reflection I don't know if ruling out improbably scenarios is that different from ruling out impossible scenarios, but I figured it was worth clarifying
Edit: reworded last sentence for clarity
Nice article! I'm still somewhat concerned that the performance increase of o1 can be partially attributed to the benchmarks (blockworld, AGI-ARC) having existed for a while on the internet, and thus having made their way into updated training corpora (which of course we don't have access to). So an alternative hypothesis would simply be that o1 is still doing pattern matching, just that it has better and more relevant data to pattern-match towards here. Still, I don't think this can fully explain the increase in capabilities observed, so I agree with the high-level argument you present.
Interesting preliminary results!
Do you expect abductive reasoning to be significantly different from deductive reasoning? If not, (and I put quite high weight on this,) then it seems like (Berglund, 2023) already tells us a lot about the cross-context abductive reasoning capabilities of LLMs. I.e. replicating their methodology wouldn't be very exciting.
One difference that I note here is that abductive reasoning is uncertain / ambiguous; maybe you could test whether the model also reduces its belief of competing hypotheses (c.f. 'explaining away').
This seems pretty cool! The data augmentation technique proposed seems simple and effective. I'd be interested to see a scaled-up version of this (more harmful instructions, models etc). Also would be cool to see some interpretability studies to understand how the internal mechanisms change from 'deep' alignment (and compare this to previous work, such as https://arxiv.org/abs/2311.12786, https://arxiv.org/abs/2401.01967)
Interesting stuff! I'm very curious as to whether removing layer norm damages the model in some measurable way.
One thing that comes to mind is that previous work finds that the final LN is responsible for mediating 'confidence' through 'entropy neurons'; if you've trained sufficiently I would expect all of these neurons to not be present anymore, which then raises the question of whether the model still exhibits this kind of self-confidence-regulation
My Seasonal Goals, Jul - Sep 2024
This post is an exercise in public accountability and harnessing positive peer pressure for self-motivation.
By 1 October 2024, I am committing to have produced:
Habits I am committing to that will support this: