LESSWRONG
LW

All of Neel Nanda's Comments + Replies

I'm confused. It feels like you're basically saying that reality is uncertain, prediction markets reflect this, but in order to be useful for affecting the minds of ordinary voters (a specific use case that is not the main use case I think prediction markets matter for) they must be certain or near certain

Follow me on TikTok

Neel Nanda12d130

Extremely reasonable strategic pivot. How would you explain AI risk to a TikTok audience?

2lsusr12d

In a perfect world I'd explain how moral hazard affects political memetics, but I feel it's beyond my current skill level to fit that into TikTok's attention span. Therefore I think it'd be more effective to copy this excellent post by lc. I'd start by explaining how the computer industry's epistemics work, and then generalize those models to AI.

Good Research Takes are Not Sufficient for Good Strategic Takes

Neel Nanda13d20

Political influence seems a very different skill to me? Lots of very influential politicians have been very incompetent in other real world ways

Alternatively, they are linked to some big major idea in governance or technical safety, often by spotting something missing years before it became relevant.

This is just a special case (and an unusually important one) of a good forecasting record, right?

1Hyperion11d

I suppose I mean influence over politics, policy, or governance (this is very high level since these are all distinct and separable), rather than actually being political necessarily. I do think there are some common skills, but actually being a politician weighs so many other factors more heavily that the strategic skill is not selected on very strongly at all. Being a politician's advisor, on the other hand... Yes, it's a special case, but importantly one that is not evaluated by Brier score or Manifold bucks.

Good Research Takes are Not Sufficient for Good Strategic Takes

Neel Nanda13dΩ230

I think the correct question is how much of an update should you make in an absolute sense rather than a relative sense? Many people in this community are overconfident and if you decide that every person is less worth listening to than you thought this doesn't change who you listen to, but it should make you a lot more uncertain in your beliefs

Good Research Takes are Not Sufficient for Good Strategic Takes

Neel Nanda13d50

Interesting. Thanks for the list. That seemed like a pretty reasonable breakdown to me. I think mechanistic interpretability does train some of them in particular, two, three and maybe six. But I agree that things involve thinking about society and politics and power and economics etc as a whole do seem clearly more relevant.

One major concern I have is that it's hard to judge skill in domains with worse feedback loops because there is not feedback on who is correct. I'm curious how confident you are in your assessment of who has good takes or is good in these fields, and how you determine this?

1Hyperion13d

I guess that's the main element I didn't mention: many people on this forum would suggest judging via predictive skill/forecasting success. I think this is an ok heuristic, but of course the long time horizons involved in many strategic questions makes it hard to judge (and Tetlock has documented the problems with forecasting over long time horizons where these questions matter most). Mostly, the people I think of as having strong strategic skill are closely linked to some political influence (which implicitly requires this skill to effect change) such as attaining a senior govt position, being influential over the Biden EO/export controls, UK govt AI efforts, etc. Alternatively, they are linked to some big major idea in governance or technical safety, often by spotting something missing years before it became relevant. Often by interacting regularly with good thinkers you can get a sense that they have stronger mental models for trends and the levers controlling trends than others, but concrete judgement is sometimes extremely difficult until a key event has passed and we can judge in hindsight (especially about very high level trends such as Mearsheimer's disputed take on the causes of the Ukraine invasion, Fukuyama's infamous "end of history" prediction, or even Pinker's "Better Angels of Our Nature" predictions about continually declining global conflict).

Good Research Takes are Not Sufficient for Good Strategic Takes

Neel Nanda14d32

Thanks!

okay, but, how actually DO we evaluate strategic takes?

Yeah, I don't have a great answer to this one. I'm mostly trying to convey the spirit of: we're all quite confused, and the people who seem competent disagree a lot, so they can't actually be that correct. And given that the ground truth is confusion, it is epistemically healthier to be aware of this.

Actually solving these problems is way harder! I haven't found a much better substitute than looking at people who have a good non-trivial track record of predictions, and people who have what t... (read more)

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Neel Nanda16d40

I think it's pretty plausible that something pathological like that is happening. We're releasing this as an interesting idea that others might find useful for their use case, not as something we're confident is a superior method. If we were continuing with SAE work, we would likely sanity check it more but we thought it better to release it than not

Good Research Takes are Not Sufficient for Good Strategic Takes

Neel Nanda18d50

Yes, I agree. It's very annoying for general epistemics (though obviously pragmatically useful to me in various ways if people respect my opinion)

Though, to be clear, my main goal in writing this post was not to request that people defer less to me specifically, but more to make the general point about please defer more intelligently using myself as an example and to avoid calling any specific person out

8Algon18d

Consistently give terrible strategic takes, so people learn not to defer you.

Recent AI model progress feels mostly like bullshit

Neel Nanda19d143

I agree that I'd be shocked if GDM was training on eval sets. But I do think hill climbing on benchmarks is also very bad for those benchmarks being an accurate metric of progress and I don't trust any AI lab not to hill climb on particularly flashy metrics

Good Research Takes are Not Sufficient for Good Strategic Takes

Neel Nanda20d42

I'm not trying to agree with that one. I think that if someone has thought a bunch about the general topic of AI and has a bunch of useful takes. They can probably convert this on the fly to something somewhat useful, even if it's not as reliable as it would be if they spent a long time thinking about it. Like I think I can give useful technical mechanistic interpretability takes even if the question is about topics I've not spent much time thinking about before

1Tao Lin20d

yeah there's generalization, but I do thing that eg (AGI technical alignment strategy, AGI lab and government strategy, AI welfare, AGI capabilities strategy) are sufficiently different that experts at one will be significantly behind experts on the others

Fabien's Shortform

Neel Nanda21dΩ230

Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard

2Fabien Roger20d

They were separated by spaces. (But I'd encourage replication before updating too hard on results which I think are very weird.)

Sam Marks's Shortform

Neel Nanda21d40

I had not noticed that part. Thanks for flagging

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Neel Nanda26dΩ8252

Thanks a lot for doing this. This is substantially more evaluation awareness than I would have predicted. I'm not super convinced by the classifying transcript purpose experiments, since the evaluator model is plausibly primed to think about this stuff, but the monitoring results seem compelling and very concerning. Thanks a lot for doing this work. I guess we really get concerned when it stops showing up in the chain of thought...

EIS XV: A New Proof of Concept for Useful Interpretability

Neel Nanda26dΩ352

I would disagree that either one or four was achieved because, to my knowledge, the auditing game focused on finding, not fixing. It's also a bit ambiguous whether the prediction involves fixing with interpretability or fixing with unrelated means. It wouldn't surprise me if you could use the SAE-derived insights to filter out the relevant data if you really wanted to, but I'd guess an LLM classifier is more effective. Did they do that in the paper?

(To be clear, I think it's a great paper, and to the degree that there's a disagreement here it's that I think your predictions weren't covering the right comparative advantages of interp)

2scasper25d

Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.

Auditing language models for hidden objectives

Neel Nanda1moΩ142417

This is fantastic work, I'd be very excited to see more work in the vein of auditing games. It seems like the one of the best ways so far to test how useful different techniques for understanding models are

ryan_greenblatt's Shortform

Neel Nanda1mo159

I found it helpful because it put me in the frame of a alien biological intelligence rather than an AI because I have lots of preconceptions about AIs and it's it's easy to implicitly think in terms of expected utility maximizers or tools or whatever. While if I'm imagining an octopus, I'm kind of imagining humans, but a bit weirder and more alien, and I would not trust humans

Martin Randall's Shortform

Neel Nanda1mo20

Oh sure, an executive assistant i.e. personal assistant in a work context can be super valuable just from an impact maximisation perspective but generally they need to be hired by your employer not by you in your personal capacity (unless you have a much more permissive/low security employer than Google)

leogao's Shortform

Neel Nanda1mo20

Agreed. If I'm talking to someone who I expect to be able to recalibrate, I just explain that I think the standard norms are dumb, the norms I actually follow, and then give an honest and balanced assessment. If I'm talking to someone I don't really know, I generally give a positive but not very detailed reference or don't reply, depending on context.

leogao's Shortform

Neel Nanda1mo2610

Strong agree (except in that liking someone's company is evidence that they would be a pleasant co-worker, but that's generally not a high order bit). I find it very annoying that standard reference culture seems to often imply giving extremely positive references unless someone was truly awful, since it makes it much harder to get real info from references

3Joseph Miller1mo

Agreed, but also most of the world does operate in this reference culture. If you choose to take a stand against it, you might screw over a decent candidate by providing only a quite positive recommendation.

MATS Applications + Research Directions I'm Currently Excited About

Neel Nanda1mo30

Probably is but I can't think of anything immediately

Martin Randall's Shortform

Neel Nanda2mo21

Idk, I personally feel near maxed out on spending money to increase my short term happiness (or at least, any ways coming to mind seem like a bunch of effort, like hiring a great personal assistant), and so the only reason to care about keeping it around is saving it for future use. I would totally be spending more money on myself now if I thought it would actually improve my life

1Sheikh Abdur Raheem Ali1mo

I’m not trying to say that any of this applies in your case per se. But when someone in a leadership position hires a personal assistant, their goal may not necessarily be to increase their short term happiness, even if this is a side effect. The main benefit is to reduce load on their team. If there isn’t a clear owner for ops adjacent stuff, people in high-performance environments will randomly pick up ad-hoc tasks that need to get done, sometimes without clearly reporting this out to anyone, which is often societally inefficient relative to their skillset and a bad allocation of bandwidth given the organization’s priorities. A great personal assistant wouldn’t just help you get more done and focus on what matters, but also handle various things which may be spilling over to those who are paying attention to your needs and acting to ensure they are met without you noticing or explicitly delegating.

AGI Safety & Alignment @ Google DeepMind is hiring

Neel Nanda2moΩ4104

In my incredibly biased opinion, the GDM AGI safety team is great and an effective place to work on reducing AI x-risk, and I would love to get applications from people here

Martin Randall's Shortform

Neel Nanda2mo91

On the other hand, if you have shorter timelines and higher P Doom, the value of saving for retirement becomes much lower, which means that if you earn a income notably higher than your needs, the cost of cryonics is much lower, If you don't otherwise have valuable things to spend money on, they that get you value right now

2Martin Randall2mo

This might hold for someone who is already retired. If not, both retirement and cryonics look lower value if there are short timelines and higher P(Doom). In this model, instead of redirecting retirement to cryonics it makes more sense to redirect retirement (and cryonics) to vacation/sabbatical and other things that have value in the present.

William_S's Shortform

Neel Nanda2mo100

I was also thinking recently that I would love this to exist! If I ever had the time I was going to try hacking it together in cursor

MATS Applications + Research Directions I'm Currently Excited About

Neel Nanda2mo21

Huh, seems to be working for me. What do you see when you click on it?

tinyurl.com/neel-mats-app

Gary Marcus now saying AI can't do things it can already do

Neel Nanda2mo4749

I think it's just not worth engaging with his claims about the limits of AI, he's clearly already decided on his conclusion

Mateusz Bagiński2mo105

This but to the extent that people reading him have not clearly already decided on their conclusion, it might be worth it to engage.

The purpose of a debate is not to persuade the debater, it's to persuade the audience. (Modulo that this frame is more soldier-mindset-y than truth-seeking but you know what I mean.)

Tips and Code for Empirical Research Workflows

Neel Nanda2mo30

Control space

1Sheikh Abdur Raheem Ali2mo

For anyone else who stumbles across this thread: when modifying the superwhisper toggle settings, hit spacebar then control, instead of control then spacebar. Also, it turns out that Control + Space is the default shortcut for switching keyboard input sources (at least on macOS Sequoia 15.3.1), make sure to disable that by going to System Settings → Keyboard → Keyboard Shortcuts → Input Sources.

Refusal in LLMs is mediated by a single direction

Neel Nanda2moΩ440

For posterity, this turned out to be a very popular technique for jailbreaking open source LLMs - see this list of the 2000+ "abliterated" models on HuggingFace (abliteration is a mild variant of our technique someone coined shortly after, I think the main difference is that you do a bit of DPO after ablating the refusal direction to fix any issues introduced?). I don't actually know why people prefer abliteration to just finetuning, but empirically people use it, which is good enough for me to call it beating baselines on some metric.

Tail SP 500 Call Options

Neel Nanda3mo140

Interesting. Does anyone know what the counterparty risk is like here? Eg, am I gambling on the ETF continuing to be provided, the option market maker not going bust, the relevant exchange continuing to exist, etc. (the first and third generally seem like reasonable bets, but in a short timelines world everything is high variance...)

6Frank Grimes3mo

Listed options are secured by the Options Clearing Corporation. I don't think they've ever failed to secure an options trade, including through covid and the 2007-8 financial crisis. I don't think this is the biggest concern about the trade.

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Neel Nanda3mo50

Yeah, if you're doing this, you should definitely pre compute and save activations

Tips and Code for Empirical Research Workflows

Neel Nanda3mo40

I've been really enjoying voice to text + LLMs recently, via a great Mac App called Super Whisper (which can work with local speech to text models, so could also possibly be used for confidential stuff) - combining Super Whisper and Claude and Cursor means I can just vaguely ramble at my laptop about what experiments should happen and they happen, it's magical!

1Sheikh Abdur Raheem Ali2mo

What keybinding do you set for it? IIRC, ⌥+space conflicts with the default for ChatGPT and Alfred.app (which I use for clipboard history).

Some lessons from the OpenAI-FrontierMath debacle

Neel Nanda3mo143

I agree that OpenAI training on Frontier Math seems unlikely, and not in their interests. The thing I find concerning is that having high quality evals is very helpful for finding capabilities improvements - ML research is all about trying a bunch of stuff and seeing what works. As benchmarks saturate, you want new ones to give you more signal. If Epoch have a private benchmark they only apply to new releases, this is fine, but if OpenAI can run it whenever they want, this is plausibly fairly helpful for making better systems faster, since this makes hill climbing a bit easier.

Tips and Code for Empirical Research Workflows

Neel Nanda3moΩ5106

This looks extremely comprehensive and useful, thanks a lot for writing it! Some of my favourite tips (like clipboard managers and rectangle) were included, which is always a good sign. And I strongly agree with "Cursor/LLM-assisted coding is basically mandatory".

I passed this on to my mentees - not all of this transfers to mech interp, in particular the time between experiments is often much shorter (eg a few minutes, or even seconds) and often almost an entire project is in de-risking mode, but much of it transfers. And the ability to get shit done fast is super important

2John Hughes3mo

Thanks Neel! I'm glad you found it helpful. If you or your scholars recommend any other tools not mentioned in the post, I'd be interested to hear more.

Jonathan Claybrough's Shortform

Neel Nanda3mo220

This seems fine to me (you can see some reasons I like Epoch here). My understanding is that most Epoch staff are concerned about AI Risk, though tend to longer timelines and maybe lower p(doom) than many in the community, and they aren't exactly trying to keep this secret.

Your argument rests on an implicit premise that Epoch talking about "AI is risky" in their podcast is important, eg because it'd change the mind of some listeners. This seems fairly unlikely to me - it seems like a very inside baseball podcast, mostly listened to by people already aware ... (read more)

Siebe3mo1412

I don't think that all media produced by AI risk concerned people needs to mention that AI risk is a big deal - that just seems annoying and preachy. I see Epoch's impact story as informing people of where AI is likely to go and what's likely to happen, and this works fine even if they don't explicitly discuss AI risk

I don't think that every podcast episode should mention AI risk, but it would be pretty weird in my eyes to never mention it. Listeners would understandably infer that "these well-informed people apparently don't really worry much, maybe I ... (read more)

AI Timelines

Neel Nanda3moΩ572

I don't know much about CTF specifically, but based on my maths exam/olympiad experience I predict that there's a lot of tricks to go fast (common question archetypes, saved code snippets, etc) that will be top of mind for people actively practicing, but not for someone with a lot of domain expertise who doesn't explicitly practice CTF. I also don't know how important speed is for being a successful cyber professional. They might be able to get some of this speed up with a bit of practice, but I predict by default there's a lot of room for improvement.

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Neel Nanda3mo50

Tagging @philh @bilalchughtai @eventuallyalways @jbeshir in case this is relevant to you (though pooling money to get GWWC interested in helping may make more sense, if it can enable smaller donors and has lower fees)

2philh3mo

Thanks - I think GWWC would be fewer steps for me, but if that's not looking likely then one of these is plausible. (I wonder if it would be worth a few of us pooling money to get both "lower fees" and "less need to deal with orgs who don't just let you click some buttons to say where you want the money to go", but not sure if that would be legit for gift aid purposes.)

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Neel Nanda3mo*202

For anyone considering large-ish donations (in the thousands), there are several ways to do this in general for US non-profits, as a UK taxpayer. (Several of these also work for people who pay tax in both the US and UK)

The one I'd recommend here is using the Anglo-American Charity - you can donate to them tax-deductibly (a UK charity) and they'll send it to a US non-profit. I hear from a friend that they're happy to forward it to every.org so this should be easy here. The main annoying thing is fees - for amounts below £15K it's min(4%, £250) (so 4% above ... (read more)

philh3mo*120

Assuming all went well, I just donated £5,000 through the Anglo-American charity, which should become about (£5000 * 1.25 * 96% = £6000 ≈ $7300) to lightcone.

I had further questions to their how to give page, so:

You can return the forms by email, no need to post them. (I filled them in with Firefox's native "draw/write on this pdf" feature, handwriting my signature with a mouse.)
If donating by bank transfer, you send the money to "anglo-american charity limited", not "anglo-american charitable foundation".
For lightcone's contact details I asked on LW i

... (read more)

5Neel Nanda3mo

The Plan - 2024 Update

Neel Nanda3mo50

Sure, but I think that human cognition tends to operate at a level of abstract above the configuration of atoms in a 3D environment. Like "that is a chair" is a useful way to reason about an environment. Whilethat "that is a configuration of pixels that corresponds to a chair when projected at a certain angle in certain lighting conditions" must first be converted to "that is a chair" before anything useful can be done. Text just has a lot of useful preprocessing applied already and is far more compressed

7kave3mo

I'm not sure I understand what you're driving at, but as far as I do, here's a response: I have lots of concepts and abstractions over the physical world (like chair). I don't have many concepts or abstractions over strings of language, apart from as factored through the physical world. (I have some, like register or language, but they don't actually feel that "final" as concepts). As far as factoring my predictions of language through the physical world, a lot of the simplest and most robust concepts I have are just nouns, so they're already represented by tokenisation machinery, and I can't do interesting interp to pick them out.

8johnswentworth3mo

The preprocessing itself is one of the main important things we need to understand (I would even argue it's the main important thing), if our interpretability methods are ever going to tell us about how the stuff-inside-the-net relates to the stuff-in-the-environment (which is what we actually care about).

The Plan - 2024 Update

Neel Nanda3mo142

Strong +1, that argument didn't make sense to me. Images are a fucking mess - they're a grid of RGB pixels, of a 3D environment (interpreted through the lens of a camera) from a specific angle. Text is so clean and pretty in comparison, and has much richer meaning, and has a much more natural mapping to concepts we understand

kave3mo*115

That sounds less messy than the path from 3D physical world to tokens (and ~~less~~ (edit: I meant more here!) messy than the path from human concepts to tokens)

The Plan - 2024 Update

Neel Nanda3mo365

Fwiw, this is not at all obvious to me, and I would weakly bet that larger models are harder to interpret (even beyond there just being more capabilities to study)

1metawrong3mo

So you would expect Claude Opus 3 to be harder to interpret than Claude Sonnet 3.5 ? My intuition is that larger models of the same capability would exhibit less super-position and thus be easier to interpret?

6Nathan Helm-Burger3mo

Hmm. I think there's something about this that rings true and yet... Ok, so what if there were a set of cliff faces that had the property that climbing the bigger ones was more important and also that climbing tools worked better on them. Yet, despite the tools working better on the large cliffs, the smaller cliffs were easier to climb (both because the routes were shorter, and because the routes were less technical). Seems like if your goal is to design climbing equipment that will be helpful on large cliff faces, you should test the climbing equipment on large cliff faces, even if that means you won't have the satisfaction of completing any of your testing climbs.

evhub's Shortform

Neel Nanda3mo40

I would be very surprised if it had changed for early employees. I considered the donation matching part of my compensation package (it 2.5x the amount of equity, since it was a 3:1 match on half my equity), and it would be pretty norm violating to retroactively reduce compensation

4habryka3mo

If it had happened I would have expected that it would have been negotiated somehow with early employees (in a way that they agreed to, but not necessarily any external observers). But seems like it is confirmed that that early matching is indeed still active!

evhub's Shortform

Neel Nanda3mo112

I gather that they changed the donation matching program for future employees, but the 3:1 match still holds for prior employees, including all early employees (this change happened after I left, when Anthropic was maybe 50 people?)

I'm sad about the change, but I think that any goodwill due to believing the founders have pledged much of their equity to charity is reasonable and not invalidated by the change

9habryka3mo

If it still holds for early employees that would be a good clarification and totally agree with you that if that is the case, I don't think any goodwill was invalidated! That's part why I was asking for clarification. I (personally) wouldn't be surprised if this had also been changed for early employees (and am currently close to 50/50 on that being the case).

How to replicate and extend our alignment faking demo

Neel Nanda4moΩ895

Thanks a lot for sharing all this code and data, seems super useful for external replication and follow-on work. It might be good to link this post from the Github readme - I initially found the Github via the paper, but not this post, and I found this exposition in this post more helpful than the current readme

5Fabien Roger4mo

Done!

Sam Marks's Shortform

Neel Nanda4mo20

That's technically even more conditional as the intervention (subtract the parallel component) also depends on the residual stream. But yes. I think it's reasonable to lump these together though, orthogonalisation also should be fairly non destructive unless the direction was present, while steering likely always has side effects

Sam Marks's Shortform

Neel Nanda4mo124

Note that this is conditional SAE steering - if the latent doesn't fire it's a no-op. So it's not that surprising that it's less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though

2Sam Marks4mo

Isn't every instance of clamping a feature's activation to 0 conditional in this sense?

Remap your caps lock key

Neel Nanda4mo30

When do you use escape?

2core_admiral4mo

The escape remap is popular among users of vim or the vim keybindings in other editors since it's used there a lot. OP might have a different use, just noting it here for any vim user or prospective user who hasn't done it yet.

Zach Stein-Perlman's Shortform

Neel Nanda4mo62

It seems unlikely that openai is truly following the test the model plan? They keep eg putting new experimental versions onto lmsys, presumably mostly due to different post training, and it seems pretty expensive to be doing all the DC evals again on each new version (and I think it's pretty reasonable to assume that a bit of further post training hasn't made things much more dangerous)

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda4moΩ440Review for 2023 Review

I'm not super sure what I think of this project. I endorse the seed of the idea re "let's try to properly reverse engineer what representing facts in superposition looks like" and think this was a good idea ex ante. Ex post, I consider our results fairly negative, and have mostly confused that this kind of thing is cursed and we should pursue alternate approaches to interpretability (eg transcoders). I think this is a fairly useful insight! But also something I made from various other bits of data. Overall I think this was a fairly useful conclusion re upd... (read more)

Finding Neurons in a Haystack: Case Studies with Sparse Probing

Neel Nanda4moΩ660Review for 2023 Review

I really like this paper (though, obviously, am extremely biased). I don't think it was groundbreaking, but I think it was an important contribution to mech interp, and one of my favourite papers that I've supervised.

Superposition seems like an important phenomena that affects our ability to understand language models. I think this paper was some of the first evidence that it actually happens in language models, and on what it actually looks like. Thinking about eg why neurons detecting compound words (eg blood pressure) were unusually easy to represent in... (read more)

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Neel Nanda4moΩ13201Review for 2023 Review

Sparse autoencoders have been one of the most important developments in mechanistic interpretability in the past year or so, and significantly shaped the research of the field (including my own work). I think this is in substantial part due to Towards Monosemanticity, between providing some rigorous preliminary evidence that the technique actually worked, a bunch of useful concepts like feature splitting, and practical advice for training these well. I think that understanding what concepts are represented in model activations is one of the most important ... (read more)