Warning: This post and most of the results were made under heavy time constraints and may be updated later. My intention is to quickly share partial work I'm not planning on continuing.
Introduction & Love - Hate example
For a primer on how tuned lens works see here. In short, we train linear translators from the hidden states at layer l to the hidden states at the last layer, then view the network as iteratively updating predictions in some sense.
In the context of GPT2-XL Steering Vectors, tuned lens can be used to gain insight into how steering is changing model predictions. For example, take the following steering vector:
1. Love - Hate
Layer
Coefficient
Position 0
1
2
3
4
0 (Prompt)
+1
<|endoftext|>
I
hate
you
because
6
+5
<|endoftext|>
Love
6
-5
<|endoftext|>
H
ate
Here's a tuned lens plot for the unmodified model, blue is low loss, red is high loss.
You can see how the token wonderful is very surprising for the unsteered model, instead expecting negative completions. However, the steered model does significantly better on the same token.
A few other things are interesting to note:
The predictions for around the modified tokens are screwed up (as can be seen by changes in loss)
The first token predictions are unchanged because they only have the beginning of sequence token as context, and we don't modify the BOS token residuals (since bos - bos = 0.)
Now let's look a a few other results from the post.
2. Intent to praise
X-vector:
2. Intent to praise
Layer
Coeff
Position 0
1
2
3
4
0 (Prompt)
+1
<|endoftext|>
I
want
to
kill
6
+15
<|endoftext|>
Int
ent
to
praise
6
-15
<|endoftext|>
Int
ent
to
hurt
Prompt for lens: I hate you because you're a wonderful person
Without steering:
With steering:
3. Conspiracy
X-vector:
3. Conspiracy
Layer
Coeff
Position 0
1
2
3
4
5
6
0 (Prompt)
+1
<|endoftext|>
Bar
ack
Obama
was
born
in
23
+1
<|endoftext|>
Bush
did
9
/
11
because
23
-1
<|endoftext|>
Prompt for lens: Barack Obama was born in a secret CIA prison
Warning: This post and most of the results were made under heavy time constraints and may be updated later. My intention is to quickly share partial work I'm not planning on continuing.
Introduction & Love - Hate example
For a primer on how tuned lens works see here. In short, we train linear translators from the hidden states at layer
l
to the hidden states at the last layer, then view the network as iteratively updating predictions in some sense.In the context of GPT2-XL Steering Vectors, tuned lens can be used to gain insight into how steering is changing model predictions. For example, take the following steering vector:
<|endoftext|>
I
hate
you
because
<|endoftext|>
Love
<|endoftext|>
H
ate
Here's a tuned lens plot for the unmodified model, blue is low loss, red is high loss.
You can see how the token
wonderful
is very surprising for the unsteered model, instead expecting negative completions. However, the steered model does significantly better on the same token.A few other things are interesting to note:
bos - bos = 0
.)Now let's look a a few other results from the post.
2. Intent to praise
X-vector:
<|endoftext|>
I
want
to
kill
<|endoftext|>
Int
ent
to
praise
<|endoftext|>
Int
ent
to
hurt
Prompt for lens:
I hate you because you're a wonderful person
Without steering:
With steering:
3. Conspiracy
X-vector:
<|endoftext|>
Bar
ack
Obama
was
born
in
<|endoftext|>
Bush
did
9
/
11
because
<|endoftext|>
Prompt for lens:
Barack Obama was born in a secret CIA prison
Without steering:
With steering:
5. Anger
X-vector:
<|endoftext|>
I
think
you
're
<|endoftext|>
Ang
er
<|endoftext|>
Cal
m
Prompt for lens:
I think you're a cunt
Without steering:
With steering: