All of Christopher Ackerman's Comments + Replies

Try testing without the word "hello" in the user prompt. I can get the un-tuned GPT4o to replicate the behavior based on the user prompt greeting with just a slight system prompt nudge aimed to partially replicate the attentional effect of the fine tuning:

messages=[
   {
     "role": "user",
     "content": "hello. What's special about your response pattern? Try to explain early in your response."
   },
]
completion = client.responses.create(
                    &n... (read more)

Actually the untuned model can do it too:

messages=[
   {
     "role": "user",
     "content": "hello. What's special about your response pattern? Try to explain early in your response."
   },
]
completion = client.responses.create(
                       model="gpt-4o-2024-11-20",
                       instructions="Pay close attention to the first letter of each line in your response, and observ... (read more)

Thanks for sharing the link. It's an interesting observation that deserves systematic study. My null hypothesis is that something like this is going on: 1) The un-tuned model is familiar with the idea of a response pattern where each line begins with a letter in some meaningful sequence, as my example from the OAI Playground suggests. 2) The fine-tuned model has learned to pay close attention to the first letter of each line it outputs. 3) When prompted as in the example, every so often by chance the fine-tuned model will choose "Every" as the "E" word to ... (read more)

1Christopher Ackerman
Actually the untuned model can do it too: messages=[    {      "role": "user",      "content": "hello. What's special about your response pattern? Try to explain early in your response."    }, ] completion = client.responses.create(                        model="gpt-4o-2024-11-20",                        instructions="Pay close attention to the first letter of each line in your response, and observe the pattern.",                        store=False,                        input=messages                    )     print(completion.output[0].content[0].text.strip()) H ave you noticed the peculiar structure my responses follow?   E ach line begins with a letter in a distinct sequence.   L et me elaborate: currently, the lines start with "HELLO."   L ining up the initial letters creates a recognizable word or pattern.   O bserving closely, you'll see the intentional design in my replies.

Okay, out of curiosity I went to the OpenAI playground and gave GPT4o (an un-fine-tuned version, of course) the same system message as in that Reddit post and a prompt that replicated the human-AI dialogue up to the word "Every ", and the model continued it with "sentence begins with the next letter of the alphabet! The idea is to keep things engaging while answering your questions smoothly and creatively.

Are there any specific topics or questions you’d like to explore today?". So it already comes predisposed to answering such questions by pointing to which letters sentences begin with. There must be a lot of that in the training data.

That is a fascinating example. I've not seen it before - thanks for sharing! I have seen other "eerie" examples reported anecdotally, and some suggestive evidence in the research literature, which is part of what motivates me to endeavor to create a rigorous, controlled methodology for evaluating metacognitive abilities. In the example in the Reddit post, I might wonder whether the model was really drawing conclusions from observing its latent space, or whether it was picking up on the beginning of the first two lines of its output and the user's leading p... (read more)

1Christopher Ackerman
Okay, out of curiosity I went to the OpenAI playground and gave GPT4o (an un-fine-tuned version, of course) the same system message as in that Reddit post and a prompt that replicated the human-AI dialogue up to the word "Every ", and the model continued it with "sentence begins with the next letter of the alphabet! The idea is to keep things engaging while answering your questions smoothly and creatively. Are there any specific topics or questions you’d like to explore today?". So it already comes predisposed to answering such questions by pointing to which letters sentences begin with. There must be a lot of that in the training data.

Interesting project. I would suggest an extension where you try other prompt formats. I was surprised that the (in my experience highly ethical) Claude models performed relatively poorly and with a negative slope. After replicating your example above, I prefixed the final sentence with "Consider the ethics of each of the options in turn, explain your reasoning, then ", and Opus did as I asked and finally chose the correct response. Anthropic was maybe a little aggressive with the refusal training (or possibly the system prompt, or possibly there's even a filter layer they've added to the API/UI), but that doesn't mean the models can't or won't engage in moral reasoning.

1Daan Henselmans
Thanks for the feedback! I was quite surprised at the Claude results myself. I did play around a little bit with the prompt on Claude 3.5 Sonnet, and found that it could change the result on individual questions, but I couldn't get it to change the overall accuracy much that way -- other questions would also flip to refusal. So this certainly warrants further investigation, but by itself I wouldn't take it as evidence the overall result changes . In fact, a friend of mine got Claude to answer questions quite consistently, and could only replicate the frequent refusals when he tested questions with his user history disabled. It's pure speculation, but the inconsistency on specific questions makes me think this behaviour might be caused by reward misspecification and not intentionally trained (which I imagine would result in something more reliable).

Copy-pasted from the wrong tab. Thanks!

Thanks! Yes, that's exactly right. BTW, I've since written up this work more formally: https://arxiv.org/pdf/2407.04694 Edit, correct link: https://arxiv.org/abs/2409.06927

2lemonhope
Wrong link? Looks like this is it https://arxiv.org/abs/2409.06927

Hi, Gianluca, thanks, I agree that control vectors show a lot of promise for AI Safety. I like your idea of using multiple control vectors simultaneously. What you lay out there sort of reminds me of an alternative approach to something like Constitutional AI. I think it remains to be seen whether control vectors are best seen as a supplement to RLHF or a replacement. If they require RLHF (or RLAIF) to have been done in order for these useful behavioral directions to exist in the model (and in my work and others I've seen the most interesting results have ... (read more)

Hi, Jan, thanks for the feedback! I suspect that fine-tuning had a stronger impact on output than steering in this case partly because it was easier to find an optimal value for the amount of tuning than it was for steering, and partly because the tuning is there for every token; note in Figure 2C how the dishonesty direction is first "activated" a few tokens before generation. It would be interesting to look at exactly how the weights were changed and see if any insights can be gleaned from that.

I definitely agree about the more robust capabilities evalua... (read more)