AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
Single datapoint, but: I find outside restrictions on my appearance deeply unpleasant. I avoid basically all events and situations with a mandated dress code when this is at all feasible. So if a solstice has a dress code, I will not be attending it.
Although in contrast to (Ramesh et al. (2018) and my work, that paper only considers the Jacobian of a shallow rather than deep slice.
We also tried using the Jacobians between every layer and the final layer, instead of the Jacobians between adjacent layers. This is what we call "global interaction basis" in the paper. It didn't change the results much.
Seems like some measure of evidence -- maybe large, maybe tiny -- that "We don't know how to give AI values, just to make them imitate values" is false?
I am pessimistic about loss signals getting 1-to-1 internalised as goals or desires in a way that is predictable to us with our current state of knowledge on intelligence and agency, and would indeed tentatively consider this observation a tiny positive update.
I do not find this to be the biggest value-contributor amongst my spontaneous conversations.
I don't have a good hypothesis for why spontaneous-ish conversations can end up being valuable to me so frequently. I have a vague intuition that it might be an expression of the same phenomenon that makes slack and playfulness in research and internet browsing very valuable for me.
The donation site said I should leave a comment here if I donate, so I'm doing that. Gave $200 for now.
I was in Lighthaven for the Illiad conference. It was an excellent space. The LessWrong forum feels like what some people in the 90s used to hope the internet would be.
Edit 03.12.2024: $100 more donated by me since the original message.
There currently doesn't really exist any good way for people who want to contribute to AI existential risk reduction to give money in a way that meaningfully gives them assistance in figuring out what things are good to fund. This is particularly sad since I think there is now a huge amount of interest from funders and philanthropists who want to somehow help with AI x-risk stuff, as progress in capabilities has made work in the space a lot more urgent, but the ecosystem is currently at a particular low-point in terms of trust and ability to direct that funding towards productive ends.
Really? What's the holdup here exactly? How is it still hard to give funders a decent up-to-date guide to the ecosystem, or a knowledgeable contact person, at this stage? For a workable budget version today, can't people just get a link to this and then contact orgs they're interested in?
Two shovel-ready theory projects in interpretability.
Most scientific work isn't "shovel-ready." It's difficult to generate well-defined, self-contained projects where the path forward is clear without extensive background context. In my experience, this is extra true of theory work, where most of the labour if often about figuring out what the project should actually be, because the requirements are unclear or confused.
Nevertheless, I currently have two theory projects related to computation in superposition in my backlog that I think are valuable and that maybe have reasonably clear execution paths. Someone just needs to crunch a bunch of math and write up the results.
Impact story sketch: We now have some very basic theory for how computation in superposition could work[1]. But I think there’s more to do there that could help our understanding. If superposition happens in real models, better theoretical grounding could help us understand what we’re seeing in these models, and how to un-superpose them back into sensible individual circuits and mechanisms we can analyse one at a time. With sufficient understanding, we might even gain some insight into how circuits develop during training.
This post has a framework for compressing lots of small residual MLPs into one big residual MLP. Both projects are about improving this framework.
1) I think the framework can probably be pretty straightforwardly extended to transformers. This would help make the theory more directly applicable to language models. The key thing to show there is how to do superposition in attention. I suspect you can more or less use the same construction the post uses, with individual attention heads now playing the role of neurons. I put maybe two work days into trying this before giving it up in favour of other projects. I didn’t run into any notable barriers, the calculations just proved to be more extensive than I’d hoped they’d be.
2) Improve error terms for circuits in superposition at finite width. The construction in this post is not optimised to be efficient at finite network width. Maybe the lowest hanging fruit to improving it is changing the hyperparameter , the probability with which we connect a circuit to a set of neurons in the big network. We set in the post, where is the MLP width of the big network and is the minimum neuron count per layer the circuit would need without superposition. The choice here was pretty arbitrary. We just picked it because it made the proof easier. Recently, Apollo played around a bit with superposing very basic one-feature circuits into a real network, and IIRC a range of values seemed to work ok. Getting tighter bounds on the error terms as a function of that are useful at finite width would be helpful here. Then we could better predict how many circuits networks can superpose in real life as a function of their parameter count. If I was tackling this project, I might start by just trying really hard to get a better error formula directly for a while. Just crunch the combinatorics. If that fails, I’d maybe switch to playing more with various choices of in small toy networks to develop intuition. Maybe plot some scaling laws of performance with at various network widths in 1-3 very simple settings. Then try to guess a formula from those curves and try to prove it’s correct.
Another very valuable project is of course to try training models to do computation in superposition instead of hard coding it. But Stefan mentioned that one already.
1 Boolean computations in superposition LW post. 2 Boolean computations paper of LW post with more worked out but some of the fun stuff removed. 3 Some proofs about information-theoretic limits of comp-sup. 4 General circuits in superposition LW post. If I missed something, a link would be appreciated.
Agreed. I do value methods being architecture independent, but mostly just because of this:
and maybe a sign that a method is principled
At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it's actually finding the structure of the learned algorithm, instead of structure related to the implementation on one particular architecture.
for a large enough (overparameterized) architecture - in other words it can be measured by the
The sentence seems cut off.
I did have a pretty strong expectation of privacy for LW DMs. That was probably dumb of me.
This is not due to any explicit or implicit promise by the mods or the site interface I can recall. I think I was just automatically assuming that strong DM privacy would be a holy principle on a forum with respectable old-school internet culture around anonymity and privacy. This wasn’t really an explicitly considered belief. It just never occurred to me to question this. Just like I assume that doxxing is probably an offence that can result in an instant ban, even though I never actually checked the site guidelines on that.
The site is not responsible for my carelessness on this, but if there was an attention-grabbing box in the DM interface making it clear that mods do look at DMs and DM metadata under some circumstances that fall short of a serious criminal investigation or an apocalypse, I would have appreciated that.