All of NickyP's Comments + Replies

NickyP10

Yeah, the context length was 128 concepts for the small tests they did between architectures, and 2048 concepts for the larger models.

How this exactly translates is kind of variable. They limit the concepts to be around 200 characters, but this could be any number of tokens. They say they trained the large model on 2.7T tokens and 142B concepts, so on average 19 tokens per concept.

The 128 would translate to 2.4k tokens, and the 2048 concepts would translate to approx 39k tokens.

NickyP10

Yeah it was annoying to get working. I now have added a Google Colab in case anyone else wants to try anything.

It does seem interesting that the semantic arithmetic is hit or miss (mostly miss).

NickyP*20

Thanks for reading, and yeah I was also surprised by how well it does. It does seem like there is degradation in auto-encoding from the translation, but I would guess that it probably does also make the embedding space have some nicer properties

I bet if you add Gaussian noise to them they still decode fine

I did try some small tests to see how sensitive the Sonar model is to noise, and it seems OK. I tried adding gaussian noise and it started breaking at around >0.5x the original vector size, or at around cosine similarity <0.9, but haven't tested too... (read more)

NickyP10

Ok thanks, not sure why that happened but it should be fixed now.

NickyP20

The unlearning results seem promising!

The author's results from unlearning MMLU seems slightly rushed but moderately promising (I previously wrote a paper trying similar things, making good comparisons here is difficult), but the results from unlearning different coding languages seem very strong (compared to my previous attempt), the model seems to be substantially more monosemantic.

I agree with your suspicions that the gemma SAE performance was poor from using reconstructed activations, matches the drop in performance I got when I tried doing this.

Would ... (read more)

3CalebMaresca
Hi Nicky! I agree that it would be interesting to see the steering performance of MONET compared to that of SAEs. At the moment, the way the routing probabilities are calculated makes this difficult, as they are computed separately for the bottom and top layers in HD or left and right layers. Therefore, it is hard to change the activation of expert ij without also affecting experts ij' and i'j for all i' != i and j' != j. One of the authors told me the following: "For pruning the experts, we manually expand the decomposed activations using $g_{hij}=g^1_{hi}^1∗g^2_{hj}$. After masking the relevant expert (i, j), we compute for all experts rather than performing efficient expert decomposition. This approach requires more memory and computational resources compared to the standard Monet mode, which is one of our current limitations. We are actively working on porting our Monet training code and developing a decomposed expert routing kernel in CUDA to enable more efficient expert manipulation without the need for full expert expansion." I think this problem would be easily solved for top-1 activations, as to steer you could just replace the expert the model wants to choose with the one you want to steer with. Since k = 1, you don't need to worry about affecting other routing probabilities. It would be really interesting if someone tried training a top-1 MONET model (with multiple heads, so that even though each head only selects one expert, it still has the ability to express itself through multiple semantic concepts) and tested its steering performance.
NickyP80

I wonder how much of these orthogonal vectors are "actually orthogonal" once we consider we are adding two vectors together, and that the model has things like LayerNorm.

If one conditions on downstream midlayer activations being "sufficiently different" it seems possible one could find like 10x degeneracy of actual effects these have on models. (A possibly relevant factor is how big the original activation vector is compared to the steering vector?)

NickyP50

I think there are already some papers doing similar work, though usually sold as reducing inference costs. For example, the MoEfication paper and Contextual Sparsity paper could probably be modified for this purpose.

NickyP20

Sorry! I have fixed this now

2Jonathan Claybrough
Thanks, and thank you for this post in the first place!
NickyP90

In case anyone finds it difficult to go through all the projects, I have made a longer post where each project title is followed by a brief description, and a list of the main skills/roles they are looking for.

See here: https://www.lesswrong.com/posts/npkvZG67hRvBneoQ9

Answer by NickyP52

Cadenza Labs has some video explainers on interpretability-related concepts: https://www.youtube.com/@CadenzaLabs

For example, an intro to Causal Scrubbing:

NickyP10

Seems to work fine for me, but here are the links to Market One, Market Two and Market Three from the post. (They show % customer funds to be returned, at 46%, 43% and 42% at time of this comment)

NickyPΩ330

Maybe not fully understanding, but one issue I see is that without requiring "perfect prediction", one could potentially Goodhart on on the proposal. I could imagine something like:

In training GPT-5, add a term that upweights very basic bigram statistics. In "evaluation", use your bigram statistics table to "predict" most topk outputs just well enough to pass.

This would probably have a negative impact to performance, but this could possibly be tuned to be just sufficient to pass. Alternatively, one could use a toy model trained on the side that is easy to ... (read more)

2TurnTrout
I'm worried about Goodharting on the proposal, but don't feel concerned by the specific avenue you propose. I think the bigram term would really dent performance, as you say. 
NickyP10

While I think this is important, and will probably edit the post, I think even in the unembedding, when getting the logits, the behaviour cares more about direction than distance.

When I think of distance, I implicitly think Euclidean distance:

But the actual "distance" used for calculating logits looks like this:

Which is a lot more similar to cosine similarity:

I think that because the metric is so similar to the cosine similarity, it makes more sense to think of size + direc... (read more)

2LawrenceC
Yeah, I agree! You 100% should not think about the unembed as looking for "the closest token", as opposed to looking for the token with the largest dot product (= high cosine similarity + large size).  I suspect the piece would be helpful for people with similar confusions, though I think by default most people already think of features as directions (this is an incredible tacit assumption that's made everywhere in mech interp work), especially since the embed/unembed are linear functions. 
NickyP32

This is true. I think that visualising points on a (hyper-)sphere is fine, but it is difficult in practice to parametrise the points that way.

It is more that the vectors on the gpu look like , but the vectors in the model are treated more like 

NickyP10

Thanks for this comment! I think this one of the main concerns I am pointing at.

I think somethings like fiscal aid could work, but have people tried making models for responses to things like this? It feels like with covid the relatively decent response was because the government was both enforcing a temporary policy of lockdown, and was sending checks to adjust things "back to normal" despite this. If job automation is slightly more gradual, on the scale of months to years, and specific only to certain jobs at a time, the response could be quite different, and it might be more likely that things end up poorly.

2Logan Zoellner
Agreed.  I think a big part of the reason why we saw a large fiscal response in Covid but not in e.g. 2008 was because it was agreed that it was "nobody's fault".   In this sense, the faster that AI produces unemployment, the more likely we will see a policy response.  If tens of millions of middle class educated workers suddenly wake up one day without a job, politicians will respond.  If, on the other hand, AI slowly squeezes the lowest productivity workers out of a job over the course of 1-2 decades, there will be calls for "reeducation" or "tough love" or some such nonsense as the economy slowly spirals downward Japan style. Ironically, this then makes for one of the few cases where "going faster" makes the transition to AGI less harmful.  Whereas most AI safety issues are worse the faster the transition is.
NickyP30

Yeah, though I think it depends on how many people are able to buy the new goods at a better price. If most well-paid employees (ie: the employees that companies get the most value from automating) no longer have a job, then the number of people who can buy the more expensive goods and services might go down. It seems counter-intuitive to me that GDP if the number of people who lost their jobs is high enough. It feels possible that the recent tech developments was barely net positive to nominal GDP despite rapid improvements, and that fast enough technological process could cause nominal GDP to go in the other direction.

NickyPΩ130

I suspect that with a tuned initial prompt that ChatGPT would do much better. For example, something like:

Simulate an assistant on the other end of a phone call, who is helping me to cook a turmeric latte in my kitchen I have never cooked before and need extremelly specific. Only speak one sentence at a time. Only explain one instruction at a time. Never say "and". Please ask clarifying questions if necessary. Only speak one sentence at a time, and await a response. Be explicit about: 
- where I need to go.
- what I need to get
- where I need to bring thi
... (read more)
NickyP63

Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?

Yeah, I would say this is the main idea I was trying to get towards.

If that's the idea, have you considered just logging which attention heads and MLP layers have notably high or notably low activations for different vs. similar topics instead?

I think I probably just look at the activations instead of the output + residual in... (read more)

NickyP20

Maybe you have seen it before, but Veloren looks like a project with people you should talk with. They are building an open source voxel MMO in Rust, and you might be able to collaborate with them. I think most people working on it are doing it as a side hobby project.