leogao

Sequences

Alignment Stream of Thought

Wiki Contributions

Comments

Sorted by
leogao41

People generally assume those around them agree with them (even when they don't see loud support of their position - see "silent majority"). So when you ask what their neighbors think, they will guess their neighbors have the same views as themselves, and will report their own beliefs with plausible deniability.

leogao20

I'm claiming that even if you go all the way to BoN, it still doesn't necessarily leak less info to the morel

leogao51

for a sufficiently competent policy, the fact that BoN doesn't update the policy doesn't mean it leaks any fewer bits of info to the policy than normal RL

leogao40

For ML, yes. I'm deriving this from the bitter lesson.

leogao95

I think there are a whole bunch of inputs that determine a company's success. Research direction, management culture, engineering culture, product direction, etc. To be a really successful startup you often just need to have exceptional vision on one or a small number of these inputs, possibly even just once or twice. I'd guess it's exceedingly rare for a company to have leaders with consistently great vision across all the inputs that go into a company. Everything else will constantly revert towards local incentives. So, even in a company with top 1 percentile leadership vision quality, most things will still be messed up because of incentives most of the time.

leogao267

For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there's a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.

Separately, it's also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can't speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What's an example of an interpretability work that you feel has affected capabilities intuitions a lot?

leogao113

SAE steering doesn't seem like it obviously beats other steering techniques in terms of usefulness. I haven't looked closely into Hyena but my prior is that subquadratic attention papers probably suck unless proven otherwise.

Interpretability is certainly vastly more appealing to lab leadership than weird philosophy, but it's vastly less appealing than RLHF. But there are many many ML flavored directions and only a few of them are any good, so it's not surprising that most directions don't get a lot of attention.

Probably as interp gets better it will start to be helpful for capabilities. I'm uncertain whether it will be more or less useful for capabilities than just working on capabilities directly; on the one hand, mechanistic understanding has historically underperformed as a research strategy, on the other hand it could be that this will change once we have a sufficiently good mechanistic understanding.

leogao8354

I don't think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers - they are basically noise.)

I think you should take into account the fact that before there are really good concrete capabilities results, the process that different labs use to decide what to invest in is highly contingent on a bunch of high variance things. Like, what kinds of research directions appeal to research leadership, or whether there happen to be good ICs excited to work on that direction around and not tied down to any other project. 

I don't think you should be that surprised by interpretability being more popular than other areas of alignment. Certainly I think incentives towards capabilities is a small fraction of why it's popular and funded etc (if anything, its non-usefulness for capabilities to date may count against it). Rather, I think it's popular because it's an area where you can actually get traction and do well-scoped projects and have a tight feedback loop. This is not true of the majority of alignment research directions that actually could help with aligning AGI/ASI, and correspondingly those directions are utterly soul grinding to work on. 

One could very reasonably argue that more people should be figuring out how to work on the low traction, ill-scoped, shitty feedback loop research problems, and that the field is looking under the streetlight for the keys. I make this argument a lot. But I think you shouldn't need to postulate some kind of nefarious capabilities incentive influence to explain it.

leogao58

aiming directly for achieving some goal is not always the most effective way of achieving that goal.

Load More