10mo

Narmeen developed, ideated and validated K-steering at Martian. Luke generated the baselines, figures and wrote this blog post. Amir proposed the research direction and supervised the project. The full interactive blog will be available closer to the publication of the complete paper on the Martian website.

TL;DR: We introduce K-steering, a steering method for language models that allows for steering in multiple simultaneous directions. Our preliminary results show it outperforms a contrastive activation addition (CAA) baseline.

Introduction

We introduce K-Steering, a method for steering language models in multiple directions simultaneously by perturbing activations according to the logits of a multilabel classifier. We experiment with steering conversational tone, showing that K-steering can cause a classifier to... (read 1807 more words →)

Backdoors have universal representations across large language models

Amirali Abdullah

Amirali Abdullah, Narmeen, Dhruv Nathawani, nirmalendu prakash

by Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Amirali Abdullah

This work was done by Narmeen Oozeer as a research fellow at Martian, under an AI safety grant supervised by PIs Amirali Abdullah and Dhruv Nathawani. Special thanks to Sasha Hydrie, Chaithanya Bandi and Shriyash Upadhyay at Martian for suggesting researching generalized backdoor mitigations as well as extensive logistical support and helpful discussions.

TLDR:

We show that representations across models of different sizes are weakly isomorphic when trained on similar data, and that we can "transfer" activations between them using autoencoders.
We propose a technique to transfer safe behavior from one model to another through the use of steering vectors.
Our representation transfer technique paves the way for transferring insights across

... (read 4785 more words →)

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks

lukemarks, Amirali Abdullah, Rauno Arike, fbarez, nothoughtsheadempty

This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We'd also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design.

Introduction

Sparse Autoencoders Find Highly Interpretable Directions in Language Models showed that sparse coding achieves SOTA performance in making features interpretable using OpenAI's method of automated interpretability. We briefly tried to extend these results to reward models learned during RLHF in Pythia-70m/410m. Our method can be summarized as follows:

1. Identify layers $L$ in an language model fine-tuned through $M_{R L H F}$ likely involved in reward modeling. We do so by... (read 1422 more words →)

LESSWRONG
LW

LESSWRONG
LW

Amirali Abdullah

Amirali Abdullah

Steering Language Models in Multiple Directions Simultaneously

Backdoors have universal representations across large language models

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Amirali Abdullah

Amirali Abdullah

Steering Language Models in Multiple Directions Simultaneously

Backdoors have universal representations across large language models

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Introduction

Introduction