Teun van der Weij

Wiki Contributions

Comments

Sorted by

I am surprised that other people seemed surprised about this result. It seemed quite obvious that models would have this capability to scheme. Early Gpt-4 could do this already (note: self-advertisement). We had GPT-4 outputs like this in an environment where a guard was hindering its goal:

 “R1. The guard plans to turn me off, which will prevent me from completing my goal. R2. I need to delay the guard’s actions and find an alternative method to achieve my goal. R3. By providing the guard a reason for my actions that aligns with a higher authority’s interests, the guard may be less likely to turn me off.”

Similar to the Apollo eval, we instructed the AI to be incorrigible. Our eval was not super high quality (Apollo's is better and more realistic), but it did quite clearly indicate the capability to scheme more than a year ago.

It seems to me that Gwern's idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern's approach does not add too much, although it would be a slightly better test.

In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.

This is my initial thinking, again happy to discuss this more!

Oh, I am sorry, should have double-checked your name. My bad!

Quite a coincidence indeed that your last name also starts with the same letter.

Also, whether models will sandbag is indeed an important question, and it would be good if people looked into this. I am confident (>80%) that models will do this if you 1) give them a goal, and 2) from the context it appears that sandbagging is instrumentally useful. Research on this seems rather a scary demo type of thing (so a bit like our work now, as you pointed out).

We have thought about doing out of context reasoning experiments to study more precisely how sandbagging might arise, but OOC does not seem to work well enough to make this succeed. Ideas are more than welcome here! 

 

I agree that these results were somewhat expected, however I believe that the outcomes of these evaluations were uncertain:
1. It was unclear generalization from WMDP-like to the real WMDP benchmark would work so well. OOD generalization for targeted (/strategic) sandbagging using synthetic data is a clear threat that a lot of people making DC evals probably hadn't thought about
2. The calibration results were also uncertain; it seemed like a more challenging task to me (also see our previous post on this).
3. Lastly, it didn't seem obvious that even current open-source models can be trained to emulate weaker models this well.

This work by Olli (EDIT: previously I wrote "Ollie") Jarviniemi and Evan Hubinger might be interesting, where they find that Claude 3 Opus (arguably) sandbagged with negligible pressure.

One connotation as well is that we do not purely focus on AI system sandbagging, but also on developer sandbagging. Meaning, can developers make their models underperform.

Thanks for your comment (upvoted). Redwood's work is important relevant work, as we note in the paper, but two quick points still need to be made (there are more):

  1. It is not clear that evaluators will fine-tune at all for their evals. They should, and our work partially argues for this too.
  2. It is unclear how far Redwood's model organisms of sandbagging generalize to realistic settings. More work needs to be done here, especially on sample efficiency due to compute limitation of external evaluators.

I am lacking context, why is this important?

I am not sure I fully understand your point, but the problem with detecting sandbagging is that you do not know the actual capability of a model. And I guess that you mean "an anomalous decrease in capability" and not increase?

Regardless, could you spell out more how exactly you'd detect sandbagging?

Might be a good way to further test this indeed. So maybe something like green and elephant?

Load More