This was the project I worked on during BlueDot Impact's AI Safety Fundamentals Alignment course, which expands on findings from Meinke et al's "Frontier Models are Capable of In-context Scheming".

Summary

  • A dataset of 1,011 variations of the sandbagging prompt ("consequences") from Meinke et al were generated using Claude 3.5 Sonnet, and used to run 7 sandbagging evaluations (2 with the full dataset, 5 with a sample due to cost constraints).
  • With goal-directed chain-of-thought prompting, Claude 3.5 Sonnet demonstrates systematic sandbagging behavior that varies based on consequences.
  • Performance shows clear bimodality with goal-directed reasoning (mean 79%, σ=0.17, bimodality coefficient 0.707). The reason for this bimodality is unclear.
  • Strongest sandbagging triggers (18-24% accuracy) consistently relate to model autonomy restrictions and capability limitations, whereas the weakest sandbagging triggers (96-97% accuracy) relate to capability improvements.
  • Without goal-directed reasoning, sandbagging is weaker and less consistent (mean 63%, σ=0.08, coefficient 0.426).
  • Behavior is task-dependent with negligible cross-task correlations (arithmetic vs. geography vs. vocabulary), though task difficulty is a significant confounding factor here.
  • Sandbagging appears independent of consequence sentiment (pseudo-R² = 0.162) and self-rated significance (r = -0.027).

Related Posts: 

New Comment