LESSWRONG
LW

All of RGRGRG's Comments + Replies

Bridging the VLM and mech interp communities for multimodal interpretability

Hi Sonia - Can you please explain what you mean by "mixed selectivity"; particularly, I don't understand what you mean by "Some of these studies then seem to conclude that SAEs alleviate superposition when really they may alleviate mixed selectivity." Thanks.

StefanHex's Shortform

RGRGRG8mo10

I like this recent post about atomic meta-SAE features, I think these are much closer (compared against normal SAEs) to what I expect atomic units to look like:

https://www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes

Growth and Form in a Toy Model of Superposition

RGRGRG8mo10

Would you be willing to share the raw data from the plot - "Developmental Stages of TMS", I'm specifically hoping to look at line plots of weights vs biases over time.

Thanks.

1Edmund Lau8mo

Hello @RGRGRG, yes, I can share the raw data for that plot. If you can direct message me your email address or any other way for communicating a JSON file, I can send them to you. Also, the developmental stages of this model (see the Setup section) is quite robust as well. If you wish to reproduce this trajectory, starting from the 4++- initial configuration (see "TMS critical points are k-gons" section) will likely produce similar trajectory.

There Should Be More Alignment-Driven Startups

RGRGRG1y10

What are the terms of the seed funding prize(s)?

1Judd Rosenblatt1y

We plan to announce further details in a later post.

Mechanistically Eliciting Latent Behaviors in Language Models

RGRGRG1y10

Enjoyed this post! Quick question about obtaining the steering vectors:

Do you train them one at a time, possibly adding an additional orthogonality constraint between each train?

2Andrew Mack1y

Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!). I haven't experimented with this, but you could also imagine using only "soft" orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).

Transcoders enable fine-grained interpretable circuit analysis for language models

RGRGRG1y10

Question about the "rules of the game" you present. Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens - you could probably roughly estimate the input string from these features' top activators. From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features. Thank you.

Finding Sparse Linear Connections between Features in LLMs

RGRGRG1yΩ110

To confirm - the weights you share, such as 0.26 and 0.23 are each individual entries in the W matrix for:
y=Wx ?

2Logan Riggs1y

Correct. So they’re connecting a feature in F2 to a feature in F1.

Growth and Form in a Toy Model of Superposition

RGRGRG2y30

This is a casual thought and by no means something I've thought hard about - I'm curious whether b is a lagging indicator, which is to say, there's actually more magic going on in the weights and once weights go through this change, b catches up to it.

Another speculative thought, let's say we are moving from 4* -> 5* and |W_3| is the new W that is taking on high magnitude. Does this occur because somehow W_3 has enough internal individual weights to jointly look at it's two (new) neighbors' W_i`s roughly equally?

Does the cos similarity and/or dot product of this new W_3 with its neighbors grow during the 4* -> 5* transition (and does this occur prior to the change in b?)

1Daniel Murfet2y

The change in the matrix W and the bias b happen at the same time, it's not a lagging indicator.

Growth and Form in a Toy Model of Superposition

RGRGRG2y30

Question about the gif - to me it looks like the phase transition is more like:

4++- to unstable 5+- to 4+- to 5-
(Unstable 5+- seems to have similar loss to 4+-).

Why do we not count the large red bar as a "-" ?

1Daniel Murfet2y

Good question. What counts as a "-" is spelled out in the paper, but it's only outlined here heuristically. The "5 like" thing it seems to go near on the way down is not actually a critical point.

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

RGRGRG2y40

Do you expect similar results (besides the fact that it would take longer to train / cost more) without using LoRA?

4Simon Lermen2y

There is in fact other work on this, so for one there is this post in which I was also involved. There was also the recent release by Yang et al. They are using normal fine-tuning on a very small dataset https://arxiv.org/pdf/2310.02949.pdf So yes, this works with normal fine-tuning as well

Become a PIBBSS Research Affiliate

RGRGRG2y10

If I were to be accepted for this cycle, would I be expected to attend any events in Europe? To be clear, I could attend all events in and around Berkeley.

1Nora_Ammann2y

Yes, as stated the requirements section, affiliates are expected to attend retreats and I expect about 50/50 of events will be happening in US/Europe.

Become a PIBBSS Research Affiliate

RGRGRG2yΩ110

What city/country is PIBBSS based out of / where will the retreats be? (Asking as a Bay Area American without a valid passport).

2Nora_Ammann2y

We're not based in a single location. We are open to accepting affiliates that are based in whatever place, and are also happy to help them relocate (within some constraints) to somewhere else (e.g. where they have access to a more lively research/epistemic community) if that would be beneficial for them. That said, I also think we are best placed to help people (and have historically tended to run things) in either London/Oxford, Prague or Berkeley.

There should be more AI safety orgs

RGRGRG2y100

For any potential funders reading this: I'd be open to starting an interpretability lab and would love to chat. I've been full-time on MI for about 4 months - here is some of my work: https://www.lesswrong.com/posts/vGCWzxP8ccAfqsrS3/thoughts-about-the-mechanistic-interpretability-challenge-2

I have a few PhD friends who are working for software jobs they don't like and would be interested in joining me for a year or longer if there were funding in place (even for just the trial period Marius proposes).

My very quick take is that interpretability... (read more)

What I would do if I wasn’t at ARC Evals

RGRGRG2y00

I really like your ambitious MI section and I think you hit on a few interesting questions I've come across elsewhere:

Two researchers interpreted a 1-layer transformer network and then I interpreted it differently - there isn't a great way to compare our explanations (or really know how similar vs different our explanations are).

With papers like the Hydra effect that demonstrate similar knowledge can be spread throughout a network, it's not clear to if we want to/how to analyze impact - can/should we jointly ablate multiple units across different heads at ... (read more)

6LawrenceC2y

I don't have a good answer here unfortunately. My guess is (as I say above) the most important thing is to push forward on the quality of explanations and not the size?

How did you make your way back from meta?

Answer by RGRGRGSep 08, 202320

Most weekdays, I set the goal of myself of doing twelve focused blocks of 24 minutes of object level work (my variant on Pomodoro). Once I complete these blocks, I can do whatever I want - whether it be stop working for the rest of the day, more object level work, meta work, or anything else.

If you try something like this, I'd recommend setting a goal of doing 6(?) such blocks and then letting yourself do as much or as little meta as you want; and then potentially gradually working up to 10-13 blocks.

Why Is No One Trying To Align Profit Incentives With Alignment Research?

RGRGRG2y52

Over the last 3 months, I've spent some time thinking about mech interp as a for profit service. I've pitched to one VC firm, interviewed for a few incubators/accelerators including ycombinator, sent out some pitch documents, co-founder dated a few potential cofounders, and chatted with potential users and some AI founders).

There are a few issues:

First, as you mention, I'm not sure if mech interp is yet ready to understand models. I recently interpreted a 1-layer model trained on a binary classification function https://www.lesswrong.com/posts/... (read more)

The positional embedding matrix and previous-token heads: how do they actually work?

RGRGRG2y20

Thank you! I'm still surprised how little most heads in L0 + L1 seem to be using the positional embeddings. L1H4 looks reasonably uniform so I could accept that maybe that somehow feeds into L2H2.

Decomposing independent generalizations in neural networks via Hessian analysis

RGRGRG2y10

nit: do you mean 6x6 Boolean patterns not 4x4?

1Nina Panickssery2y

ah yes thanks

The positional embedding matrix and previous-token heads: how do they actually work?

RGRGRG2yΩ120

This is a surprising and fascinating result. Do you have attention plots of all 144 heads you could share?

I'm particularly interested in the patterns for all heads on layers 0 and 1 matching the following caption

(Left: a 50x50 submatrix of LXHY's attention pattern on a prompt from openwebtext-10k. Right: the same submatrix of LXHY's attention pattern, when positional embeddings are averaged as described above.)

1AdamYedidia2y

Here's the plots you asked for for all heads! You can find them at: https://github.com/adamyedidia/resid_viewer/tree/main/experiments/pngs Haven't looked too carefully yet but it looks like it makes little difference for most heads, but is important for L0H4 and L0H7.

Thoughts on sharing information about language model capabilities

RGRGRG2y10

As one specific example - has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety?

https://www.alignmentforum.org/posts/LqRD7sNcpkA9cmXLv/open-problems-and-fundamental-limitations-of-rlhf

Thoughts on sharing information about language model capabilities

RGRGRG2yΩ010

My primary safety concern is what happens if one of these analyses somehow leads to a large improvement over the state of the art. I don't know what form this would take and it might be unexpected given the Bitter Lesson you cite above, but if it happens, what do we do then? Given this is hypothetical and the next large improvement in LMs could come elsewhere, I'm not suggesting we stop sharing now. But I think we should be prepared that there might be a point in time where we need to acknowledge such sharing leads to significantly stronger models and thus should re-evaluate sharing such eval work.

1RGRGRG2y

As one specific example - has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety? https://www.alignmentforum.org/posts/LqRD7sNcpkA9cmXLv/open-problems-and-fundamental-limitations-of-rlhf

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG2y10

The differences between these two projects seem like an interesting case study in MI. I'll probably refer to this a lot in the future.

Excited to see case studies comparing and contrasting our works. Not that you need my permission, but feel free to refer to this post (and if it's interesting, this comment) as much or as little as desired.

One thing that I don't think came out in my post is that my initial reaction to the previous solution was that it was missing some things and might even have been mostly wrong. (I'm still not certain that... (read more)

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG2y10

One thought I've had, inspired by discussion (explained more later), is whether:

"label[ing] points by interpolating" is not the opposite of "developing an interesting, coherent internal algorithm.” (This is based on a quote from Stephen Casper's retrospective that I also quoted in my post).

It could be the case that the network might have "develop[ed] an interesting, coherent algorithm", namely the row coloring primitives discussed in this post, but uses "interpolation/pattern matching" to approximately detect the cutoff points.

When I started t... (read more)

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG2y20

Thank you for the kind words and the offer to donate (not necessary but very much appreciated). Please donate to https://strongminds.org/ which is listed on Charity Navigator's list of high impact charities ( https://www.charitynavigator.org/discover-charities/best-charities/effective-altruism/ )

I will respond to the technical parts of this comment tomorrow or Tuesday.

5scasper2y

Alignment Grantmaking is Funding-Limited Right Now

RGRGRG2y80

I wonder if this is simply the result of the generally bad SWE/CS market right now. People who would otherwise be in big tech/other AI stuff, will be more inclined to do something with alignment.

This is roughly my situation. Waymo froze hiring and had layoffs while continuing to increase output expectations. As a result I/we had more work. I left in March to explore AI and landed on Mechanistic Interpretability research.

3Stephen McAleese2y

I have a similar story. I left my job at Amazon this year because there were layoffs there. Also, the release of GPT-4 in March made working on AI safety seem more urgent.

ARC is hiring theoretical researchers

RGRGRG2y10

"We will keep applications open until at least the end of August 2023"

Is there any advantage to applying early vs in August 2023? I ask as someone intending to do a few months of focused independent MI research. I would prefer to have more experience and sense of my interests before applying, but on the other hand, don't want to find out mid-August that you've filled all the roles and thus it's actually too late to apply. Thanks.

5Jacob_Hilton2y

We will do our best to fairly consider all applications, but realistically there is probably a small advantage to applying earlier. This is simply because there is a limit to how quickly we can grow the organization, so if hiring goes better than expected then it will be longer before we can take on even more people. With that being said, we do not have a fixed number of positions that we are hiring for; rather, we plan to vary the number of hires we make based on the strength of the applications we receive. Moreover, if we were unable to hire someone due to capacity constraints, we would very likely be interested in hiring them at a later date. For these reasons, I think the advantage to applying earlier is a fairly small consideration overall, and it sounds like it would make more sense for you to apply whenever you are comfortable.

Why I'm Not (Yet) A Full-Time Technical Alignment Researcher

RGRGRG2y10

Thanks for posting this - not OP, but I will likely apply come early June. If anyone else is associated with other grant opportunities, would love to hear about those as well.

Why I'm Not (Yet) A Full-Time Technical Alignment Researcher

RGRGRG2y10

Just wanted to say that I have similar questions about how to best (try to) get funding for mechanistic interpretability research. Might send a bunch of apps out come early June; but like OP, I don't have any technical results in alignment (though like OP, I like to think I have a solid (yet different) background).

Best Ways to Try to Get Funding for Alignment Research?

RGRGRG2y10

DM'd

Best Ways to Try to Get Funding for Alignment Research?

RGRGRG2y10

Thanks!

2the gears to ascension2y

I see on your profile that you have already done a PhD in machine learning which gives you a particular kind of context that I'm happy to see; I would love to talk synchronously at some point, do you have time for a voice call or focused text chat about your research background? I'm just an independent researcher and don't ask because of any ability to fund you, but I'm interested in talking to people with experience to exchange perspectives. I'm most interested in fixing the QACI plan and fully understanding what it would make sense to mean in all edge cases by the word "agency"; some great progress has been made on that, eg check out the two papers mentioned in the comments of this post: https://www.lesswrong.com/posts/JqWQxTyWxig8Ltd2p/relative-abstracted-agency a lot of what we're worried about is ai being able to reach simulation grade agency over us before we've reached simulation grade in return. this would occur for any of a variety of reasons. curious about your thoughts!

Best Ways to Try to Get Funding for Alignment Research?

RGRGRG2y10

Thanks

> key problems

Is there a blog post to key problems?

> sharing their plans

Where is the best place to share? Once I come up with a plan I'm happy with, is there value in posting it on this site?

2the gears to ascension2y

There are a number of intro posts floating around. https://stampy.ai/ is one major project to make an organized list of topics. I keep meaning to look up more and getting distracted so I'm sending this instead of nothing. edit: here are some more https://www.lesswrong.com/tag/ai-alignment-intro-materials

Nobody’s on the ball on AGI alignment

RGRGRG2y60

What is the best way (assuming one exists), as an independent researcher, with a PhD in AI but not in ML, to get funding to do alignment work? (I recently left my big tech job).

3Nathan Helm-Burger2y

As someone who left their mainstream tech job and a year and a half ago, and tried to get funding for alignment work and applied to alignment orgs... It's not easy, even if you have a legible track record as a competent tech worker. I'm hoping that a shift in public concern over AGI risk will open up more funding for willing helpers.