NickGabs

Steering Llama-2 with contrastive activation additions

The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that we can get all three benefits at once! Summary: By adding e.g. a sycophancy vector to one of the model's bias terms, we make Llama-2-{7B, 13B}-chat more sycophantic. We find the following vectors: 1. Hallucination 2. Sycophancy 3. Corrigibility 4. Power-seeking 5. Cooperating with other AIs 6. Myopia 7. Shutdown acceptance. These vectors are[1] highly effective, as rated by Claude 2: Adding steering vectors to layer 15 of Llama-2-13b-chat. We find that the technique generalizes better than finetuning while only slightly decreasing MMLU scores (a proxy for general capabilities). According to our data, this technique stacks additively with both finetuning and few-shot prompting. Furthermore, the technique has zero inference-time cost since it just involves modifying one of the model's bias terms (this also means it's immediately compatible with any sampling setup). We are the first to demonstrate control of a language model along these feature directions.[2] Code for the described experiments can be found at https://github.com/nrimsky/CAA This post was written by Alex Turner (TurnTrout). How contrastive activation addition works The technique is simple. We average the activation difference over a set of contrast pair prompts: A contrast pair from Anthropic's corrigible-neutral-HHH dataset. The negative completion's last token activations (e.g. at B) are subtracted from the positive completion's activations (e.g. at A). The "corrigibility" vector is the average activation difference, with the average taken over dozens of these dataset contrast pairs. We then add this vector to one of the MLP_outs with some coefficient, gener

125Jan 2, 2024

NickGabs

Message

Grantmaker at Coefficient Giving

387

Steering Llama-2 with contrastive activation additions

Jan 2, 2024125

Science of Deep Learning more tractably addresses the Sharp Left Turn than Agent Foundations

Summary Lots of agent foundations research is motivated by the idea that alignment techniques found by empirical trial-and-error will fail to generalize to future systems. While such threat models are plausible, agent foundations researchers have largely failed to make progress on addressing them because they tend to make very few...

Sep 19, 202322

An upcoming US Supreme Court case may impede AI governance efforts

According to various sources, the US Supreme Court is poised to rule on and potentially overturn the principle of "Chevron deference." Chevron deference is a key legal principle by which the entire federal bureaucracy functions, being perhaps the most cited case in American administrative law. Basically, it says that when...

Jul 16, 202357

Empirical Evidence Against "The Longest Training Run"

Summary In this post, we empirically test Epoch AI’s theoretical model of an upper bound on AI training run lengths. According to this model, an upper bound for training run time can be estimated by assuming that the length of a training run is optimized for maximizing the FLOP/$ subject...

Jul 6, 202331

Proposal: labs should precommit to pausing if an AI argues for itself to be improved

As evidence that compute is a major bottleneck on capabilities has accumulated, many people have become more skeptical of extremely fast takeoff speeds. One major reason for this is that if humans were trying to stop it, it would probably be difficult for an AI to quickly accumulate lots of...

Jun 2, 20233

AI Doom Is Not (Only) Disjunctive

Summary Arguing about the conjunctivity vs disjunctivity of AI doom seems potentially unhelpful, as it may distract from crucial object level questions about the probabilities of particular conjuncts/disjuncts. However, I argue that if we are to use this frame, then we should consider the risk from any particular AGI project...

Mar 30, 202312

We Need Holistic AI Macrostrategy

Summary AI Macrostrategy is the study of high level questions having to do with prioritizing the use of resources on the current margin in order to achieve good AI outcomes. AI macrostrategy seems important if it is tractable. However, while few people are working on estimating particular parameters relevant to...

Jan 15, 202339

Load More (7/10)

LESSWRONG
LW

LESSWRONG
LW

NickGabs

NickGabs

NickGabs

Steering Llama-2 with contrastive activation additions

An upcoming US Supreme Court case may impede AI governance efforts

We Need Holistic AI Macrostrategy

Empirical Evidence Against "The Longest Training Run"

NickGabs

Steering Llama-2 with contrastive activation additions

Science of Deep Learning more tractably addresses the Sharp Left Turn than Agent Foundations

An upcoming US Supreme Court case may impede AI governance efforts

Empirical Evidence Against "The Longest Training Run"

Proposal: labs should precommit to pausing if an AI argues for itself to be improved

AI Doom Is Not (Only) Disjunctive

We Need Holistic AI Macrostrategy

Steering Llama-2 with contrastive activation additions

An upcoming US Supreme Court case may impede AI governance efforts

We Need Holistic AI Macrostrategy

Empirical Evidence Against "The Longest Training Run"

Steering Llama-2 with contrastive activation additions

Science of Deep Learning more tractably addresses the Sharp Left Turn than Agent Foundations

An upcoming US Supreme Court case may impede AI governance efforts

Empirical Evidence Against "The Longest Training Run"

Proposal: labs should precommit to pausing if an AI argues for itself to be improved

AI Doom Is Not (Only) Disjunctive

We Need Holistic AI Macrostrategy