I don't think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers - they are basically noise.)
I think you should take into account the fact that before there are really good concrete capabilities results, the process that different labs use to decide what to invest in is highly contingent on a bunch of high variance things. Like, what kinds of research directions appeal to research leadership, or whether there happen to be good ICs excited to work on that direction around and not tied down to any other project.
I don't think you should be that surprised by interpretability being more popular than other areas of alignment. Certainly I think incentives towards capabilities is a small fraction of why it's popular and funded etc (if anything, its non-usefulness for capabilities to date may count against it). Rather, I think it's popular because it's an area where you can actually get traction and do well-scoped projects and have a tight feedback loop. This is not true of the majority of alignment research directions that actually could help with aligning AGI/ASI, and correspondingly those directions are utterly soul grinding to work on.
One could very reasonably argue that more people should be figuring out how to work on the low traction, ill-scoped, shitty feedback loop research problems, and that the field is looking under the streetlight for the keys. I make this argument a lot. But I think you shouldn't need to postulate some kind of nefarious capabilities incentive influence to explain it.
I don't think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers - they are basically noise.)
I would bet against this on the basis that Chris Olah's work was quite influential on a huge number of people, shaped their mental models of how Deep Learning works in general, and probably contributed to lots of improved capability-oriented thinking and decision-making.
Like, as a kind of related example where I expect it's easier to find agreement, it's hard to point to something concrete that "Linear Algebra Done Right" did to improve ML research, but I am quite confident it has had a non-trivial effect. It's the favorite Linear Algebra textbook of many of the best contributors to the field, and having good models and explanations of the basics makes a big difference.
For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there's a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
Separately, it's also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can't speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What's an example of an interpretability work that you feel has affected capabilities intuitions a lot?
I think there's a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
I have a model whereby ~all very successful large companies require a leader with vision, who is able to understand incentives and nonetheless take long-term action that isn't locally rewarded. YC startups constantly talk about long-term investments into culture and hiring and onboarding processes that are costly in (I'd guess) the 3-12 month time-frame but extremely valuable in the 1-5 year time frame.
Saying that a system is heavily shaped by incentives doesn't seem to me to imply that the system is heavily short-sighted. Companies like Amazon and Facebook are of course heavily shaped by incentives yet have quite long-term thinking in their leaders, who often do things that look like locally wasted effort because they have a vision of how it will pay off years down the line.
Speaking about the local political situation, I think safety investment from AI capabilities companies can be thought of as investing into problems that will come up in the future. As a more cynical hypothesis, I think it can also be usefully thought of as a worthwhile political ploy to attract talent and look responsible to regulators and intelligentsia.
(Added: Bottom-line: Following incentives does not mean short-sighted.)
I think there are a whole bunch of inputs that determine a company's success. Research direction, management culture, engineering culture, product direction, etc. To be a really successful startup you often just need to have exceptional vision on one or a small number of these inputs, possibly even just once or twice. I'd guess it's exceedingly rare for a company to have leaders with consistently great vision across all the inputs that go into a company. Everything else will constantly revert towards local incentives. So, even in a company with top 1 percentile leadership vision quality, most things will still be messed up because of incentives most of the time.
I very much agree that the focus on interpretability is like searching under the light. It's legible; it's a way to show that you've done something nontrivial - you did some real work on alignment. And it's generally agreed that it's progress toward alignment.
When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.)
But it's not a way to solve alignment in itself. The idea that we'll just understand and track all of the thoughts of a superintelligent AGI is just a strange idea. I really wonder how seriously people are thinking about the impact model of that work.
And they don't need to, because it's pretty obvious that better interp is incremental progress for a lot of AGI scenarios.
This is the incentive that makes progress in academia incredibly slow: there are incentives to do legibly impressive work. There are suprisingly few incentives to actually make progress on useful theories - because it's harder to tell what would count as progress.
But if we're all working on stuff with only small marginal payoffs, who's working on actually getting beyond "overcomplicated schemes" and actually creating and working through practical, workable alignment plans?
I really wish some of the folks working on interp would devote a bit more of their time to "solving the whole problem". It looks to me like we have a really dramatic misallocation of resources happening. We are searching under the light. We need more of us feeling around in the dark where we lost those keys.
That you're unaware of there being any notable counterfactual capabilities boost from interpretability is some update. How sure are you that you'd know if there were training multipliers that had interpretability strongly in their causal history? Are you not counting steering vectors from Anthropic here? And I didn't find out about the Hyena from the news article, but from a friend who read the paper, the article just had a nicer quote.
I could imagine that interpretability being relatively ML flavoured makes it more appealing to scaling lab leadership, and this is the reason those projects get favoured rather than them seeing it as commercially useful, at least in many cases.
Would you expect that this continues as interpretability continues to get better? I'd be pretty surprised from general models to find that opening black boxes doesn't let you debug them better, though I could imagine we're not good enough at it yet.
SAE steering doesn't seem like it obviously beats other steering techniques in terms of usefulness. I haven't looked closely into Hyena but my prior is that subquadratic attention papers probably suck unless proven otherwise.
Interpretability is certainly vastly more appealing to lab leadership than weird philosophy, but it's vastly less appealing than RLHF. But there are many many ML flavored directions and only a few of them are any good, so it's not surprising that most directions don't get a lot of attention.
Probably as interp gets better it will start to be helpful for capabilities. I'm uncertain whether it will be more or less useful for capabilities than just working on capabilities directly; on the one hand, mechanistic understanding has historically underperformed as a research strategy, on the other hand it could be that this will change once we have a sufficiently good mechanistic understanding.
I'm sympathetic to 'a high fraction of "alignment/safety" work done at AI companies is done due to commercial incentives and has negligible effect on AI takeover risk (or at least much smaller effects than work which isn't influenced by commercial incentives)'.
I also think a decent number of ostensibly AI x-risk focused people end up being influenced by commercial incentives sometimes knowingly and directly (my work will go into prod if it is useful and this will be good for XYZ reason; my work will get more support if it is useful; it is good if the AI company I work for is more successful/powerful, so I will do work which is commercially useful) and sometimes unknowingly or indirectly (a direction gets more organizational support because of usefulness; people are misled into thinking something is more x-risk-helpful than it actually is).
(And a bunch of originally AI x-risk focused people end up working on things which they would agree aren't directly useful for mitigating x-risk, but have some more complex story.)
I also think AI companies generally are a bad epistemic environment for x-risk safety work for various reasons.
However, I'm quite skeptical that (mechanistic) interpretability research in particularly gets much more funding due to it directly being a good commercial bet (as in, it is worth it because it ends up directly being commercially useful). And, my guess is that alignment/safety people at AI companies which are ostensibly focused on x-risk/AI takeover prevention are less than 1/2 funded via the directly commercial case. (I won't justify this here, but I think this due to a combination of personal experience and thinking about what these teams tend to work on.)
(That said, I think putting some effort on (mech) interp (or something similar) might end up being a decent commercial bet via direct usage, though I'm skeptical.)
I think there are some adjacent reasons alignment/safety work might be funded/encouraged at AI companies beyond direct commercial usage:
However, I'm quite skeptical that (mechanistic) interpretability research in particularly gets much more funding due to it directly being a good commercial bet (as in, it is worth it because it ends up directly being commercially useful). And, my guess is that alignment/safety people at AI companies which are ostensibly focused on x-risk/AI takeover prevention are less than 1/2 funded via the directly commercial case. (I won't justify this here, but I think this due to a combination of personal experience and thinking about what these teams tend to work on.)
I think the primary commercial incentive on mechanistic interpretability research is that it's the alignment research that most provides training and education to become a standard ML engineer who can then contribute to commercial objectives. I am quite confident that a non-trivial fraction of young alignment researchers are going into mech-interp because it also gets them a lot of career capital as a standard ML engineer.
I think the primary commercial incentive on mechanistic interpretability research is that it's the alignment research that most provides training and education to become a standard ML engineer who can then contribute to commercial objectives.
Is your claim here that a major factor in why Anthropic and GDM do mech interp is to train employees who can later be commercially useful? I'm skeptical of this.
Maybe the claim is that many people go into mech interp so they can personally skill up and later might pivot into something else (including jobs which pay well)? This seems plausible/likely to me, though it is worth noting that this is a pretty different argument with very different implications from the one in the post.
Yep, I am saying that supply for mech-interp alignment researchers is plenty because of career capital being much more fungible with extremely well-paying ML jobs, and Anthropic and GDM seem interested in sponsoring things like mech-interp MATS streams or other internship and junior-positions, because those fit neatly into their existing talent pipeline, they know how to evaluate that kind of work, and they think that those hires are also more likely to convert into people working on capabilities work.
I'm pretty skeptical that Neel's MATS stream is partially supported/subsidized by GDM's desire to generally hire for capabilities . (And I certainly don't think they directly fund this.) Same for other mech interp hiring at GDM, I doubt that anyone is thinking "these mech interp employees might convert into employees for capabilities". That said, this sort of thinking might subsidize the overall alignment/safety team at GDM to some extent, but I think this would mostly be a mistake for the company.
Seems plausible that this is an explicit motivation for junior/internship hiring on the Anthropic interp team. (I don't think the Anthropic interp team has a MATS stream.)
I think Neel seems to have a somewhat unique amount of freedom, so I have less of a strong take there, but I am confident that GDM would be substantially less excited about its employees taking time off to mentor a bunch of people if the kind of work they were doing would produce artifacts that were substantially less well-respected by the ML crowd, or did not look like they are demonstrating the kind of skills that are indicative of good ML engineering capability.
(I think random (non-leadership) GDM employees generally have a lot of freedom while employees of other companies have much less in-practice freedom (except for maybe longer time OpenAI employees who I think have a lot of freedom).)
Why are you confident it's not the other way around? People who decide to pursue alignment research may have prior interest or experience in ML engineering that drives them towards mech-interp.
Agree with this, and wanted to add that I am also not completely sure if mechanistic interpretability is a good "commercial bet" yet based on my experience and understanding, with my definition of commercial bet being materialization of revenue or simply revenue generating.
One revenue generating path I can see for LLMs is the company uses them to identify data that are most effective for particular benchmarks, but my current understanding (correct me if I am wrong) is that it is relatively costly to first research a reliable method, and then run interpretability methods for large models for now; additionally, it would be generally very intuitive to researchers on what datasets could be useful to specific benchmarks already. On the other hand, the method would be much useful to look into nuanced and hard to tackle safety problems. In fact there are a lot of previous efforts in using interpretability generally for safety mitigations.
I’m not an expert and I’m not sure it matters much for your point, but: Yes there were surely important synergies between NASA activities and the military ballistic missile programs in the 1960s, but I don’t think it’s correct to suggest that most NASA activities was stuff that would have to be done for the ballistic missile program anyway. It might actually be a pretty small fraction. For example, less than half the Apollo budget was for launch vehicles; they spent a similar amount on spacecraft, which are not particularly transferable to nukes. And even for the launch vehicles, it seems that NASA tended to start with existing military rocket designs and modify them, rather than the other way around.
I would guess that the main synergy was more indirect: helping improve the consistency of work, economies of scale, defraying overhead costs, etc., for the personnel and contractors and so on.
Satellites were also plausibly a very important military technology. Since the 1960s, some applications have panned out, while others haven't. Some of the things that have worked out:
Some of the stuff that hasn't:
Overall, lots of NASA activities that developed satellite / spacecraft technology seem like they had a dual-use effect advancing various military capabilities. So it wasn't just the missiles. Of course, in retrospect, the entire human-spaceflight component of the Apollo program (spacesuits, life support systems, etc) turned out to be pretty useless from a military perspective. But even that wouldn't have been clear at the time!
Yeah, I don't expect the majority of work done for NASA had direct applicability for war, even though there was some crossover. However, I'd guess that NASA wouldn't have had anything like as much overall budget if not for the overlap with a critical national security concern?
The primary motive for funding NASA was definitely related to competing with the USSR, but I doubt that it was heavily focused on military applications. It was more along the lines of demonstrating the general superiority of the US system, in order to get neutral countries to side with us because we were on track to win the cold war.
The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])
I agree w.r.t potential downstream externalities of interpretability. However, my view from speaking to a fair number of those who join the field to work on interpretability is that the upstream cause is slightly different. Interpretability is easier to understand and presents "cleaner" projects than most others in the field. It's also much more accessible and less weird to say, an academic, or the kind of person who hasn't spent a lot of time thinking about alignment.
All of these are likely correlated with having downstream capabilities applications, but the exact mechanism doesn't look like "people recognizing that this has huge profit potential and therefore growing much faster than other sub-fields".
I agree that the effect you're pointing to is real and a large part of what's going on here, and could easily be convinced that it's the main cause (along with the one flagged by @habryka). It's definitely a more visible motivation from the perspective of an individual going through the funnel than the one this post highlights. I was focusing on making one concise point rather than covering the whole space of argument, and am glad comments have brought up other angles.
Some thoughts on the post as I want to collect everything.
For this specifically:
The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])
Obedience is quite literally the classical version of the AI alignment problem, or at least very entangled with it, and interpretability IMO got a lot of the initial boost from Distill, combined with it being a relatively clean problem formulation that makes alignment more scalable.
On footnote 3, I have to point out that the postulated sharp left turn in humans compared to chimps doesn't nearly have as much evidence as the model in which human success is basically attributable to culture, where we are very good at imitating others without a true causal model as well as distilling cultural knowledge down, combined with us having essentially a very good bodyplan for tool use and us being able to dissipate heat better to scale up algorithms well.
I don't totally like Henrich's book The Secret of Our Success, but I do think that at least some of the results like chimps essentially equaling humans on IQ and only losing hard when social competence was required is a surprising fact under the sharp left turn view.
While the science of human and animal brains is incomplete at this juncture, we are starting to realize that animals do in fact have much more intelligence than past people realized, and in particular for mammals, their algorithmic cores are also fairly similar in genetic space, and the only real differentiator at this point for humans is just the fact that our cultural knowledge sparked by population growth and language booms every generation.
Imagine A GPT that predicts random chunks of the internet.
Sometimes it produces poems. Sometimes deranged rants. Sometimes all sorts of things. It wanders erratically around a large latent space of behaviours.
This is the unmasked shogolith, green slimey skin showing but inner workings still hidden.
Now perform some change that mostly pins down the latent space to "helpful corporate assistant". This is applying the smiley face mask.
In some sense, all the dangerous capabilities the corporate assistant were in the original model. Dangerous capabilities haven't been removed either, but some capabilities are a bit easier to access without careful prompting, and other capabilites are harder to access.
What ChatGPT currently has is a form of low quality pseudo-alignment.
What would long term success look like using nothing but this pseudo-alignment. It would look like a chatbot far smarter than any current ones, which mostly did nice things, so long as you didn't put in any weird prompts.
Now If corrigibility is a broad basin, this might well be enough to hit it. The basin of corrigibility means that the AI might have bugs, but at the very least, you can turn the AI off and edit the code. Ideally you can ask the AI for help fixing it's own bugs. Sure the first AI is far from perfect. But perhaps the flaws disappear under self rewriting + competent human advice.
I personally don't think that working on better capabilities, and working on the alignment problem are two distinct, separate problems. For example, if you're able to create better, more precise control mechanisms and intervention techniques, you can better align human & AI intent. Alignment feels as much of a technical, control interface problem, than merely a question of "what should we align to?".
This a very relevant discussion which should be the backbone of the decision making of anyone being part of the general effort towards AI alignment.
It is important to point out however that this is maybe just an instance of a more general problem which is related to the fact that in general any form of power or knowledge can be used for good or bad purposes, and advancing power and knowledge becomes always a double edged sword.
Doesn't seem to me like there is an escape to the moral variance that humans experience as part of their natural + developed proclivities.
The only chance we have against such great power and knowledge as AI enables is to really enlighten with clarity how is it that these technical pieces can come together for one edge or another.
Bringing the problem to light is our best chance.
1.
4.4% of the US federal budget went into the space race at its peak.
This was surprising to me, until a friend pointed out that landing rockets on specific parts of the moon requires very similar technology to landing rockets in soviet cities.[1]
I wonder how much more enthusiastic the scientists working on Apollo were, with the convenient motivating story of “I’m working towards a great scientific endeavor” vs “I’m working to make sure we can kill millions if we want to”.
2.
The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])
This was surprising to me[3], until a friend pointed out that partially opening the black box of NNs is the kind of technology that would let scaling labs find new unhobblings by noticing ways in which the internals of their models are being inefficient and having better tools to evaluate capabilities advances.[4]
I wonder how much more enthusiastic the alignment researchers working on interpretability and obedience are, with the motivating story “I’m working on pure alignment research to save the world” vs “I’m building tools and knowledge which scaling labs will repurpose to build better products, shortening timelines to existentially threatening systems”.[5]
3.
You can’t rely on the organizational systems around you to be pointed in the right direction, and there are obvious reasons for commercial incentives to want to channel your idealistic energy towards types of safety work which are dual-use or even primarily capabilities enabling. And for similar reasons, many of the training programs prepare people for the kind of jobs which come with large salaries and prestige, as a flawed proxy for people moving the needle on x-risk.
If you’re genuinely trying to avert AI doom, please take the time to form inside views away from memetic environments[6] which are likely to have been heavily influenced by commercial pressures. Then back-chain from a theory of change where the world is more often saved by your actions, rather than going with the current and picking a job with safety in its title as a way to try and do your part.
Space Race - Wikipedia:
Andrew Critch:
These paradigms do not seem to be addressing the most fatal filter in our future: Strongly coherent goal-directed agents forming with superhuman intelligence. These will predictably undergo a sharp left turn and the soft/fuzzy alignment techniques which worked at lower power levels fail simultaneously and as the system reaches high enough competence to reflect on itself, its capabilities, and the guardrails we built.
Interpretability work could plausibly help with weakly aligned weakly superintelligent systems that do our alignment homework for the much more capable systems to come. But the effort going into this direction seems highly disproportionate to how promising it is, is not backed by plans to pivot to using these systems to do a quite different style of alignment research that's needed, and generally lacks research closure to avert capabilities externalities.
From the team that broke the quadratic attention bottleneck:
Ask yourself: “Who will cite my work?”, not "Can I think of a story where my work is used for good things?"
There is work in these fields which might be good for x-risk, but you need to figure out if what you're doing is in that category to be good for the world.
Humans are natural mimics, we copy the people who have visible signals of doing well, because those are the memes which are likely to be good for our genes, and genes direct where we go looking for memes.
Wealth, high confidence that they’re doing something useful, being part of a growing coalition; great signs of good memes. All much more possessed by people in the interpretability/obedience kind of alignment than the old-school “this is hard and we don’t know what we’re doing, but it’s going to involve a lot of careful philosophy and math” crowd.
Unfortunately, this memetic selection is not particularly adaptive for trying to solve alignment.