Further optimisation Update Log for 28th August:

I am working from here: Minh's Copy of Gemma SAE self-explanation - Colab (google.com)

What worked: Multi-Feature Combination and Replacing Earlier Layers

Multi-feature combination works! I managed to combine feature 7656 ("France") and feature 7154 ("capital cities") from Neuronpedia's Gemma-1-2B ^[1] feature directory to elicit outputs for Paris, France. I'm just taking the sum of the vectors and dividing to find average, so this should work same as before even if you have 1 feature. Weighing should

... (read more)

Self-explaining SAE features

Minh Nguyen8mo*60

Hello! I’ve made 2 quick improvements, mainly with prompt and tokens.

TL;DR I changed the prompt to

prompt = '<start_of_turn>user\n "<unk>"?<end_of_turn>\n<start_of_turn>model\n "<unk>" "'

Solutions to improve Self-explanation:

Shorter Prompt

I noticed that the scales were being affected by prior words in the prompt/context itself. I tried out feature 4088 and replaced some words. For example, replacing ”word”with "concept" and "number" resulted in slightly different explanations at the higher scale. Intuitively, I s... (read more)

5Minh Nguyen8mo

Further optimisation Update Log for 28th August: I am working from here: Minh's Copy of Gemma SAE self-explanation - Colab (google.com) What worked: Multi-Feature Combination and Replacing Earlier Layers 1. Multi-feature combination works! I managed to combine feature 7656 ("France") and feature 7154 ("capital cities") from Neuronpedia's Gemma-1-2B [1] feature directory to elicit outputs for Paris, France. I'm just taking the sum of the vectors and dividing to find average, so this should work same as before even if you have 1 feature. Weighing should be relatively simple as long as you can decide how to weigh the features. Sometimes the feature refers to regional capitals that are not Paris, or reference towns/adjectives describing towns, but that seems fair since the original capital cities feature included regional and not just national capitals (this suggests that better combination improves accuracy of desired output, which Meta SAEs does). 2. Replacing earlier layers - Per @Clément Dumas's suggestion, replacing with earlier layers increases strength and improve output of the tokens themselves. Explanations are significantly more clear and more direct (I tested Char (14054) from L6, L12/5324, pair of names feature and L12/5373 same word repetition feature). That said, I'd weakly caution against over-relying on this because I'm wary of not using the same layer as the feature was extracted from. I suspect you already knew this since the default setting on the Colab is layer 2 and not 6 or 12. 3. Increasing the vector works for longer prompts - I simply multiplied the vectors, and it's weakly more resilient and clearer when multiplied 2-5 times. We probably don't want to over-rely on this since it's hard to find the optimal range without damaging the model's capabilities (see Golden Gate Claude), but it is something to try to get tests going. What didn't work: Minor Prompt Changes I was trying to figure out how to make sentence-long prompts wor

Self-Other Overlap: A Neglected Approach to AI Alignment

Minh Nguyen9mo40

I was thinking about the practical implication of this. As others have mentioned, models in production pretty much all use the prompt "you are an AI assistant". From a model training perspective, it makes sense to build with this assumption in mind.

However, it occurs to me that I have never explicitly referred to any of my AI assistants as an AI assistant. Instead, I treat them more as an inner monologue, and I suspect many other users do this as well. If the AI makes an error, I essentially correct them the way I would correct my own inner monologue... (read more)

2Marc Carauleanu9mo

I agree that interacting with LLMs is more like having an “extension of the mind” than interacting with a standalone agent at the moment. This might soon change with the advent of capable AI agents. Nonetheless, we think it is still important to model LLMs as correctly as we can, for example in a framing more like simulators rather than full-fledged agents. We focus on an agentic framing because we believe that’s where most of the biggest long-term risks lie and where the field is inevitably progressing towards.

Nonlinear’s Evidence: Debunking False and Misleading Claims

Minh Nguyen1y4514

[crossposted from EA Forum, to emphasise an important point. hope that's OK! will delete if it isn't]

How do we prevent the methodology of exclusively seeking and publishing negative information, without fact checking, from becoming an acceptable norm?

Re: Checking that claims are true

Adding on as former Nonlinear intern who was aware of a “falling out” between Alice and Nonlinear for almost a year now:

To my knowledge, Nonlinear was given very few/practically no opportunities to respond to the many claims made in “Sharing Information About Nonlinear” be