LESSWRONG
LW

Tom McGrath — LessWrong

Do you not expect that leading capability companies will be among your primary customers?

No, it seems highly unlikely. Considered from a purely commercial perspective - which I think is the right one when considering the incentives - they are terrible customers! Consider:

They are close to a monopsony (as any one would want exclusivity), so the deal would have to be truly enormous to work.
If the deal is enormous they have a huge incentive to cut us out, and the tech is very close to their core competencies.
Whatever techniques end up being good are likely to be major modifications to training stack that would be hard to integrate, so the options for doing

... (read more)

•••

Tom McGrath10d

I think you might find the final section of my doc interesting: https://www.goodfire.ai/blog/intentional-design#developing-responsibly

I would only endorse using this kind of technique in a potentially risky situation like a frontier training run if we were able to find a strong solution to the train/test issue described here.

I also make a commitment to us not working on self-improving superintelligence, which I was surprised to need to make but is apparently not a given?

Tom McGrath10d

Your sense is correct

[Linkpost] Play with SAEs on Llama 3

Tom McGrath

Tom McGrath, Eric Ho, Dan Balsam

We (Goodfire) just put our research preview live - you can play with Llama 3 and use sparse autoencoders to read & write from its internal activations. This is a linkpost for:

The research preview.
Our blog post about building it.

Taking research and turning it into something you can actually use and play with has been great. It's surprising how much of a difference iterating on something when you expect it to actually be used feels; I think it's definitely pushed the quality of what you can do with SAEs up a notch.

Replying to"Safety as a Scientific Pursuit" (2024)

Tom McGrath2y

"Safety as a Scientific Pursuit" (2024)

Very much appreciate the link post - I’d been trying to write a summary/contextualisation for LW and this is a much better one than I’d come up with.

I’d be very grateful for the LW community’s thoughts (especially any pushback). I expect this will be the source of the strongest counterarguments.

Replying to"Safety as a Scientific Pursuit" (2024)

Tom McGrath2y

"Safety as a Scientific Pursuit" (2024)

Thanks! I really like inductive vs deductive and would probably have used them if I’d thought of it.

[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small

CallumMcDougall

CallumMcDougall, Arthur Conmy, Tom McGrath, Neel Nanda

This is a accompanying blog post to work done by Callum McDougall, Arthur Conmy and Cody Rushing as part of SERI MATS 2023. The work was mentored by Neel Nanda and Tom McGrath. You can find our full paper at https://arxiv.org/abs/2310.04625. We i) summarize our key results, ii) discuss limitations and future work and iii) list lessons learnt from the project

Key Results

Copy Suppression

In the paper, we define copy suppression as the following algorithm:

If components in earlier layers predict a certain token, and this token appears earlier in the context, the attention head suppresses it.

We show that attention head L10H7 in GPT2-Small (and to a lesser extent L11H10) both perform copy suppression across... (read 2220 more words →)

Replying to"Acquisition of Chess Knowledge in AlphaZero": probing AZ over time

Tom McGrath4y

"Acquisition of Chess Knowledge in AlphaZero": probing AZ over time

I'm one of the authors on this paper - happy to answer any questions/discuss if anyone is interested.

Replying to"Acquisition of Chess Knowledge in AlphaZero": probing AZ over time

Tom McGrath4y

"Acquisition of Chess Knowledge in AlphaZero": probing AZ over time

Thanks for the summary! Your first bullet point was my motivation for doing this. I think it's important to test out interpretability ideas in more challenging domains.

We didn't really do much interpretability in this paper, this is more meta-interpretability in a sense (i.e. studying whether interpretability should in principle be possible). I'd say section 4 is worth a look, especially section 4.5 which covers fundamental and practical challenges to probing. Section 7 has some NMF analysis, and we open-sourced NMF factors which you might find interesting.