This paper seems like a great first stab at starting to answer very important questions about scalable oversight sample efficiency (at least for non-scheming models). Better understanding of sample effiiciency seems like it should have a big effect research prioritization for scalable oversight (and W2SG).
Despite this being a great start, I don't feel like I yet know the answers to these questions. Future work would need to explore methods somewhat more[1], do cleaner scaling experiments by using a nicer model stack[2], and use more analogous datasets, particularly preference modeling scores on data sets of difficult agentic tasks. Also, It might eventually be a good idea to run experiments with human labelers in case there are important differences between weak human labels and weak LLM labels.
I think there are a few promising seeming methods which weren't tried, but I don't have a good sense of what exploration was done outside of the methods listed in the body of the paper. ↩︎
This might need to be done at labs, because I don't think there is a great open source model stack that goes all the way to pretty powerful models. I think llama-3 is the best and it only has 3 models which were trained in mostly comparable ways I think? Pythia was pretty good on standardization, but fails to go to sufficiently powerful models. ↩︎
ArXiv paper.
Thanks to Nora Belrose, Buck Shlegeris, Jan Hendrik Kirchner, and Ansh Radhakrishnan for guidance throughout the project.
How does this help with AI safety?
Ensuring safety of capable AI systems would be a lot easier if humans had access to all of the knowledge of the AIs they’re supervising. This is the broad framing that has motivated my interest in the Eliciting Latent Knowledge agenda. In this work, we try to measure how effective various elicitation strategies are (in binary classification settings) by plotting accuracy versus cost given various assumptions about the costs of low- and high-quality labels. We attempt to investigate scalable oversight as a quantitative rather than qualitative problem (the framing laid out in the post is roughly what motivates this work).
While I think our work has somewhat generalizable insights for non-scheming models, there may be additional difficulties when trying to elicit knowledge from schemers because intentionally misgeneralizing policies may be more salient than the policies we want to elicit for some distributions of inputs.
Summary of findings
Here are two of our findings:
1. There exists a “mixed” regime where some budget should be spent on a large quantity of low-quality labels before training on some high-quality labels
Here, we arbitrarily define high-quality labels to cost $1 and weak labels $0.10. So, along the x-axis, one high-quality label is given up for every 10 weak labels used. The model is sequentially trained on low- then high-quality labels. Different budgets produce 3 regimes with distinct optimal budget allocations:
Quality-dominant (budget≥$1024): No budget should be allocated to weak labels
Quantity-dominant (≤$64): All budget should be allocated to weak labels
Mixed ($256≤budget<$1024): Peak of the accuracy curve is somewhere in the middle
2. Increasing the salience of the task with a few-shot prompt consistently increases the sample-efficiency of SFT compared to either few-shot prompting or SFT alone.
We think that more research should be aimed at expanding the Pareto frontier of labeling cost and accuracy in realistic elicitation settings, and answering related questions like “How sample efficient should we expect elicitation to be?”