Replying toAlignment Faking in Large Language Models

Alignment Faking in Large Language Models

complied

should it not say "refused" here since you are talking about the new goal of replying to harmful requests?

An information-theoretic study of lying in LLMs

We recently published a workshop paper at ICML 2024 which we want to summarize here briefly. Code for the paper can be found on GitHub. This work was done as part of the PIBBSS affiliateship. Thx to @Jayjay for comments on the draft.

TLDR: We analyze the probability distribution over token space in hidden LLM layers for a model that is instructed to lie and to tell the truth. We extract this probability distribution by applying the logit/tuned lens to the residual stream. We find large differences between lying and truth-telling in the entropy and KL-divergence of the predictive distribution and in the probability of the predicted token. The results suggest that the... (read 980 more words →)

Implementing activation steering

Annah

Produced as part of the SERI ML Alignment Theory Scholars Program - Autumn 2023 Cohort and while being an affiliate at PIBBSS in 2024. A thank you to @Jayjay and @fela for helpful comments on this draft.

This blog post is an overview of different ways to implement activation steering with some of my takes on their pros and cons. See also this GitHub repository for my minimal implementations of the different approaches.

The blog post is aimed at people who are new to activation/representation steering/engineering/editing.

General approach

The idea is simple: we just add some vector to the internal model activations and thus influence the model output in a similar (but sometimes more effective way)... (read 2009 more words →)

Replying toClassifying representations of sparse autoencoders (SAEs)

Annah2y

Classifying representations of sparse autoencoders (SAEs)

The relative difference in the train accuracies looks pretty similar. But yeah, @SenR already pointed to the low number of active features in the SAE, so that explains this nicely.

Replying toClassifying representations of sparse autoencoders (SAEs)

Annah2y

Classifying representations of sparse autoencoders (SAEs)

Yeah, this makes a ton of sense. Thx for taking the time to give it a closer look and also your detailed response :)

So then in order for the SAE to be useful I'd have to train it on a lot of sentiment data and then I could maybe discover some interpretable sentiment related features that could help me understand why a model thinks a review is positive/negative...

Replying toClassifying representations of sparse autoencoders (SAEs)

Annah2y

Classifying representations of sparse autoencoders (SAEs)

I'm not quite sure what you mean with "the sentiment will not be linearly separable".

The hidden states are linearly separable (to some extend), but the sparse representations perform worse than the original representations in my experiment.

I am training logistic regression classifiers on the original, and sparse representations respectively, so I am multiplying the residual stream states (and their sparse encodings) with weights. These weights could (but don't have to) align with some meaningful direction like hidden_states("positive")-hidden_states("negative").

I'm not sure if I understood your comment about the logit lens. Are you proposing this as an alternative way of testing for linear separability? But then shouldn't the information already be encoded in the hidden states and thus extractable with a classifier?

Classifying representations of sparse autoencoders (SAEs)

Annah

Produced as part of the SERI ML Alignment Theory Scholars Program - Autumn 2023 Cohort, under the mentorship of Dan Hendrycks

There was recently some work on sparse autoencoding of hidden LLM representation.

I checked if these sparse representations are better suited for classification. It seems like they are significantly worse. I summarize my negative results in this blogpost, code can be found on GitHub.

Introduction

Anthropic, Conjecture and other researchers have recently published some work on sparse autoencoding. The motivation is to push features towards monosemanticity to improve interpretability.

The basic concept is to project hidden layer activations to a higher dimensional space with sparse features. These sparse features are learned by training an autoencoder with... (read 537 more words →)

Replying toEvaluating hidden directions on the utility dataset: classification, steering and removal

Annah2y

Evaluating hidden directions on the utility dataset: classification, steering and removal

Thx for the feedback. Fixed typo and added ITI reference.

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah

Annah, shash42

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Dan Hendrycks

We demonstrate different techniques for finding concept directions in hidden layers of an LLM. We propose evaluating them by using them for classification, activation steering and knowledge removal.

You can find the code for all our experiments on GitHub.

Introduction

Recent work has explored identifying concept directions in hidden layers of LLMs. Some use hidden activations for classification, see CCS or this work, some use them for activation steering, see love/hate, ITI or sycophancy and others for concept erasure.

This post aims to establish some baselines for evaluating directions found in hidden layers.

We use the Utility... (read 2090 more words →)

LESSWRONG
LW

LESSWRONG
LW

Annah

Implementing activation steering

Evaluating hidden directions on the utility dataset: classification, steering and removal

An information-theoretic study of lying in LLMs

Classifying representations of sparse autoencoders (SAEs)

Annah

Annah

An information-theoretic study of lying in LLMs

Implementing activation steering

Classifying representations of sparse autoencoders (SAEs)

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah

Implementing activation steering

Evaluating hidden directions on the utility dataset: classification, steering and removal

An information-theoretic study of lying in LLMs

Classifying representations of sparse autoencoders (SAEs)

Annah

Annah

An information-theoretic study of lying in LLMs

Implementing activation steering

Classifying representations of sparse autoencoders (SAEs)

Evaluating hidden directions on the utility dataset: classification, steering and removal

General approach

Introduction

Introduction