Minh Nguyen
Minh Nguyen has not written any posts yet.

Minh Nguyen has not written any posts yet.

Hello! I’ve made 2 quick improvements, mainly with prompt and tokens.
TL;DR I changed the prompt to
prompt = '<start_of_turn>user\n "<unk>"?<end_of_turn>\n<start_of_turn>model\n "<unk>" "'
I noticed that the scales were being affected by prior words in the prompt/context itself. I tried out feature 4088 and replaced some words. For example, replacing ”word”with "concept" and "number" resulted in slightly different explanations at the higher scale. Intuitively, I suspected this was a fundamental issue with “merging” 2 residual streams. So to isolate variables, I started cutting down the prompt to just:
prompt = '<start_of_turn>user\n"X"?<end_of_turn>\n<start_of_turn>model\n "X" "'
Theoretically, this reduces confounding influence between the prompt context and the SAE feature itself.
I noticed that at higher scales, explanations trended... (read 980 more words →)
I was thinking about the practical implication of this. As others have mentioned, models in production pretty much all use the prompt "you are an AI assistant". From a model training perspective, it makes sense to build with this assumption in mind.
However, it occurs to me that I have never explicitly referred to any of my AI assistants as an AI assistant. Instead, I treat them more as an inner monologue, and I suspect many other users do this as well. If the AI makes an error, I essentially correct them the way I would correct my own inner monologue/thinking in a stream of consciousness. Here, the "second brain"/"extension of the mind"... (read more)
[crossposted from EA Forum, to emphasise an important point. hope that's OK! will delete if it isn't]
How do we prevent the methodology of exclusively seeking and publishing negative information, without fact checking, from becoming an acceptable norm?
Adding on as former Nonlinear intern who was aware of a “falling out” between Alice and Nonlinear for almost a year now:
Further optimisation Update Log for 28th August:
I am working from here: Minh's Copy of Gemma SAE self-explanation - Colab (google.com)
What worked: Multi-Feature Combination and Replacing Earlier Layers
... (read 1649 more words →)Multi-feature combination works! I managed to combine feature 7656 ("France") and feature 7154 ("capital cities") from Neuronpedia's Gemma-1-2B [1] feature directory to elicit outputs for Paris, France. I'm just taking the sum of the vectors and dividing to find average, so this should work same as before even if you have 1 feature. Weighing should be relatively simple as long as you can decide how to weigh the features.
Sometimes the feature refers to regional capitals that are not Paris, or reference towns/adjectives describing towns, but that seems fair