Interested in AI safety (deception and red-teaming). Past lives in academia (cognitive psychology/visual attention/sensory substitution), and government (operational roles)
What impact do you think this would have on benevolent deception? Such things as:
Do you think that reducing or limiting features of deception will also reduce performance in this type of behaviour?
I wonder what the impact is of these "misbehaviours" across a multi-turn exchange. Is a misaligned strategy in CoT more likely to reappear if you follow up with a second task?
If there isn't an indication of increased self-awareness, is the appearance of the misbehaving related more to the distribution of "good" vs "bad" strategies available for any given task? Maybe there is value in the "bad" strategies as part of CoT even if that specific strategy isn't implemented, and removing that angle is actually hindering the overall thought space.
One side consideration: In relation to your ongoing work around model self-awareness of eval settings, is there any indication that models deployed as a service, business, etc have their behaviour naturally logged as part of quality assurance, and do the models have any understanding or expectation that this is occurring?
Excited to see more in this space. Was there any consideration to measuring "when" in addition to "if" these specific features trigger a sense of evaluation? Such as a multiple round interaction where strong features are slowly introduced into the conversation, or if strong features are introduced early in the exchange, does that bias the model to be on alert? and if so, for how long? if there is sufficient typical conversation following a strong feature queue does that reduce the likelihood of the model judging the entire interaction as being an evaluation?
The pre or post assessment seems like it might either prime the expectation of an incoming evaluation (pre), or possibly draw attention to evaluation-like features in hindsight (post). I wonder if there is an independent measure (something like the number of tokens used measurement, in the Reworr & Volkov LLM honeypot paper as a distinguisher between human/bot and LLM) that doesn't explicitly call out evals directly.
This is such awesome work, I really appreciate that you are putting so much effort into this!
I found this strategy interesting as it mirrors what I've seen in playing with a lot of multi-turn conversations that have a "hidden" piece of information in the prompt that the model is meant to withold unless absolutely necessary. I find that by turn 3 or 4 it is struggles to withhold the information if there is even anything remotely related in the conversation (other models I find just start blurting it out by that point even if it is completely irrelevant to the conversation)