LLaMA 1 7B definitely seems to be a "pure base model". I agree that we have less transparency into the pre-training of Gemma 2 and Qwen 1.5, and I'll add this as a limitation, thanks!
I've checked that Pythia 12b deduped (pre-trained on the pile) also refuses harmful requests, although at a lower rate (13%). Here's an example, using the following prompt template:
"""User: {instruction}
Assistant:"""
It's pretty dumb though, and often just outputs nonsense. When I give it the Vicuna system prompt, it refuses 100% of harmful requests, though it has a bunch of "incompetent refusals", similar to LLaMA 1 7B:
"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: {instruction}
ASSISTANT:"""
Amazing! We found your original library super useful for our Attention SAEs research, so thanks for making this!
Thanks! I'm not sure. My guess is that if you go super narrow, it may be more likely to result in an inconvenient level of "feature splitting". Since there are only a few total concepts to learn, an SAE of equivalent width might exploit its greater relative capacity to learn niche combinations of features (to reduce sparsity loss).