neverix — LessWrong

neverix's Shortform

Lessons from a year of university AI safety field building

by yix, afterless, Parv Mahajan, Andersehen, Tuna, and neverix

This post is an organizational update from Georgia Tech’s AI Safety Initiative (AISI) and roughly represents our collective view. In this post, we share lessons & takes from the 2024-25 academic year, describe what we’ve done, and detail our plans for the next academic year. Introduction Hey, we’re organizers of...

Jun 6, 202536

Evolutionary prompt optimization for SAE feature visualization

TLDR: * Fluent dreaming for language models is an algorithm based on the GCG method that can reliably find plain-text readable prompts for LLMs that maximize certain logits or residual stream directions by using gradients and genetic algorithms. Authors showed its use for visualizing MLP neurons. We show this method...

Nov 14, 202428

SAE features for refusal and sycophancy steering vectors

TL;DR * Steering vectors provide evidence that linear directions in LLMs are interpretable. Since SAEs decompose linear directions, they should be able to interpret steering vectors. * We apply the gradient pursuit algorithm suggested by Smith et al to decompose steering vectors, and find that they contain many interpretable and...

Oct 12, 202429

Extracting SAE task features for in-context learning

by Dmitrii Kharlapenko, neverix, Neel Nanda, and Arthur Conmy

TL;DR * We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features — you can't just pass an arbitrary vector through the encoder without going...

Aug 12, 202431

Self-explaining SAE features

by Dmitrii Kharlapenko, neverix, Neel Nanda, and Arthur Conmy

TL;DR * We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation. * The natural...

Aug 5, 202462