TL;DR: Scalable oversight seems easier based on experiments outlined in a recent paper; questions arise about the implications of these findings.
The following graciously provided feedback and advice on the draft, for which I am deeply grateful (in alphabetical order): Sawyer Bernath, Sam Bowman, Bogdan-Ionut Cirstea, Severin Field, Peter Hase, and Alfie Lamerton.
Introduction
In “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”, Hase et al., (2024)“study the problem of easy-to-hard generalization, which is relevant for determining how challenging the scalable oversight problem is in the first place, [and] present the surprising conclusion that current language models often generalize relatively well from easy to hard data, even performing as well as 'oracle' models trained on hard data.”
Methodology
The experiments aim to “elicit knowledge from models that we suspect they may know, using fundamentally weak supervision, [and feature] open models [Llama-2 base models, for sizes 7b, 13b, and 70b, Qwen-72b, and Mixtral-7x8b] and four publicly available question-answering datasets [ARC, MMLU, StrategyQA, GSM8k] with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia” and note that “if easy data is almost as good as hard data [and] if one cares most about model performance on hard data, [then] it can be better to collect and train on easy data rather than hard data, since hard data is generally noisier and costlier to collect.”
Findings
The authors “find that the Supervision Gap Recovered is usually between 70% and 100%, meaning that easy supervision is at least 70% as good as hard supervision for hard test performance. These results are robust across (1) model scale between 7b and 70b parameters, (2) six different human hardness measures and a model-based measure, (3) four datasets/tasks, and (4) several training methods including in-context learning with and without chain-of-thought reasoning (Brown et al., 2020;Wei et al., 2022), QLoRA (Dettmers et al., 2023), and linear classifier heads (Belinkov, 2022).”
Interestingly, unsupervised models do really well; for example, the Llama2-7b results show that the unsupervised model did 9% better than one trained on easy data.
"For ARC and MMLU, there is no difference in easy vs. hard generalization using ICL." It's important to "point out that [the authors] do not interpret [their] results as models merely 'learning the task format' or 'learning the input/output space' as opposed to true generalization, because [they] are able to generalize to MMLU-STEM-5 college questions by using 3rd grade or 8th grade questions from ARC, which come from a different dataset entirely."
Among the featured research questions, two stand out:
”Is Easy-To-Hard Generalization Consistent Across Model Scale and Train-Test Hardness Gap Size?
(1) the scalable oversight problem does not become harder as models scale up,
(2) easy-to-hard performance may begin to decline when the gap between train and test hardness becomes sufficiently large.”
“How Do LMs Solve Hard Problems From As Few As Ten Easy Examples?
Language models are known to be highly sample-efficient learners (Brown et al., 2020), but our results demonstrate that they also efficiently learn to solve hard problems from easy data”.
“Evidently, training on even small amounts of easy data successfully elicits relevant knowledge from LMs in a way that is largely invariant to datapoint hardness. This could be because this kind of training encourages models to answer questions based on 'truthfulness' representations of text, which should be invariant across domain and data hardness”.
Insights
The authors present a thoughtful, transparent and detailed methodology that includes novel approaches to measuring datapoint hardness and Supervision Gap Recovered. Readers are encouraged to explore the paper for further insights.
Impact of Training Method Efficiency on LM Performance
QLoRa is known for its efficiency (Dettmers et al., 2023), and in-context learning (ICL) is known for its unreasonable effectiveness (Akyürek et al., 2022). The source domain of the training corpora is more critical than its size for ICL, and “corpus sources play a crucial role in whether or not in-context learning ability will emerge in a large-scale language model”(Shin et al., 2022).
Let’s put these together: training an open source model with highly efficient techniques using well-labeled easy data lets the model perform really well on hard tasks if the gap between easy and hard is not too big. Data hardness is irrelevant when training LMs on limited easy data. ICL is very effective, especially on large models pre-trained on diverse corpora. With LP and QLoRa the unsupervised model does better than the one trained on 3rd grade level data. When the gap is increased, generalization performance starts to decrease.
Critical Analysis
The following questions emerge:
How can the gap be widened without sacrificing performance?
What would happen if the easy/hard window is moved further up the hardness scale?
Is the exceptional effectiveness of the approach limited to the setup as described in the paper?
Does it scale well beyond the described experimental setup, across (and cross-) domains, benchmarks, difficulty levels, and gaps between easy and hard?
What could be causing these results? Why?
The combination of the training methods used, the nature of how LMs learn, the specific gap between easy and hard tasks, and the following four issues may have, directly or indirectly, contributed to the results of Hase et al.'s (2024) research experiment:
“[The] widespread consensus that AI is bottlenecked by proper benchmarking” (Kiela et al., 2023)
“The [concerns about] trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets”(Yang et al., 2023)
“Goodhart’s law is a robust phenomenon across a wide range of environments.” (Karwowski et al., 2024)
“Data Cascades—compounding events causing negative, downstream effects from data issues—triggered by conventional AI/ML practices that undervalue data quality.” (Hase et al., 2024 reference the research by Sambasivan et al., 2021)
Discussion
After reviewing the initial draft of this post, lead author Peter Hase commented on the issue of data contamination:
”It's not clear what tasks we should be using for scalable oversight research.
Knowledge intensive or reasoning intensive tasks?
Memorizable tasks or tasks requiring extreme extrapolation?
Is it an issue if the pre-training data includes information that is highly relevant to answering our test questions?”
These questions illustrate the challenges all researchers face in this nascent field.
Recent work by Li et al., 2024 shows that data contamination is a growing problem (21% increase in ~3 years) and “most contamination belongs to input-and-label contamination, indicating models can often find the answer alongside with the question for contaminated test samples.” ARC and MMLU show some of the highest levels of contamination in CommonCrawl: ~29% (input-and-label: ~24%). “Substantial text duplication enables exploitation through memorization” and “larger models appear more capable of exploiting data contamination to achieve better performance.” However, “data contamination does not necessarily lead to increased metrics: data contamination in ARC generally allows models to achieve significantly higher accuracy, but contamination in MMLU has little impact on [a] model’s performance.”
Conclusion
The concerns outlined above are prevalent in all LLM research. The authors have meticulously designed a rigorous methodology using state-of-the-art models; any attempts to prevent these issues from impacting their work would have made their experiment impossible to conduct. Future research may benefit from advancements in these areas.
Hase et al. (2024)“conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied, suggesting the scalable oversight problem may be easier than previously thought.”
Further empirical work to determine the scalability of the approach itself may provide more evidence about the validity of this conclusion.
All opinions and errors are the author’s. All critique is sincere and respects the work's value.
Led by Bogdan-Ionut Cirstea, Team 22 at AI Safety Camp 2024 is investigating the promise of automated alignment by conducting a literature review of relevant subtopics. Our efforts seek to explicate and contextualize emerging research and provide a holistic understanding of the challenges, risks, and opportunities of automating alignment research. This post is the first in the series.
TL;DR: Scalable oversight seems easier based on experiments outlined in a recent paper; questions arise about the implications of these findings.
The following graciously provided feedback and advice on the draft, for which I am deeply grateful (in alphabetical order): Sawyer Bernath, Sam Bowman, Bogdan-Ionut Cirstea, Severin Field, Peter Hase, and Alfie Lamerton.
Introduction
In “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”, Hase et al., (2024) “study the problem of easy-to-hard generalization, which is relevant for determining how challenging the scalable oversight problem is in the first place, [and] present the surprising conclusion that current language models often generalize relatively well from easy to hard data, even performing as well as 'oracle' models trained on hard data.”
Methodology
The experiments aim to “elicit knowledge from models that we suspect they may know, using fundamentally weak supervision, [and feature] open models [Llama-2 base models, for sizes 7b, 13b, and 70b, Qwen-72b, and Mixtral-7x8b] and four publicly available question-answering datasets [ARC, MMLU, StrategyQA, GSM8k] with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia” and note that “if easy data is almost as good as hard data [and] if one cares most about model performance on hard data, [then] it can be better to collect and train on easy data rather than hard data, since hard data is generally noisier and costlier to collect.”
Findings
The authors “find that the Supervision Gap Recovered is usually between 70% and 100%, meaning that easy supervision is at least 70% as good as hard supervision for hard test performance. These results are robust across (1) model scale between 7b and 70b parameters, (2) six different human hardness measures and a model-based measure, (3) four datasets/tasks, and (4) several training methods including in-context learning with and without chain-of-thought reasoning (Brown et al., 2020; Wei et al., 2022), QLoRA (Dettmers et al., 2023), and linear classifier heads (Belinkov, 2022).”
Interestingly, unsupervised models do really well; for example, the Llama2-7b results show that the unsupervised model did 9% better than one trained on easy data.
"For ARC and MMLU, there is no difference in easy vs. hard generalization using ICL." It's important to "point out that [the authors] do not interpret [their] results as models merely 'learning the task format' or 'learning the input/output space' as opposed to true generalization, because [they] are able to generalize to MMLU-STEM-5 college questions by using 3rd grade or 8th grade questions from ARC, which come from a different dataset entirely."
Among the featured research questions, two stand out:
”Is Easy-To-Hard Generalization Consistent Across Model Scale and Train-Test Hardness Gap Size?
(1) the scalable oversight problem does not become harder as models scale up,
(2) easy-to-hard performance may begin to decline when the gap between train and test hardness becomes sufficiently large.”
“How Do LMs Solve Hard Problems From As Few As Ten Easy Examples?
Language models are known to be highly sample-efficient learners (Brown et al., 2020), but our results demonstrate that they also efficiently learn to solve hard problems from easy data”.
“Evidently, training on even small amounts of easy data successfully elicits relevant knowledge from LMs in a way that is largely invariant to datapoint hardness. This could be because this kind of training encourages models to answer questions based on 'truthfulness' representations of text, which should be invariant across domain and data hardness”.
Insights
The authors present a thoughtful, transparent and detailed methodology that includes novel approaches to measuring datapoint hardness and Supervision Gap Recovered. Readers are encouraged to explore the paper for further insights.
Impact of Training Method Efficiency on LM Performance
QLoRa is known for its efficiency (Dettmers et al., 2023), and in-context learning (ICL) is known for its unreasonable effectiveness (Akyürek et al., 2022). The source domain of the training corpora is more critical than its size for ICL, and “corpus sources play a crucial role in whether or not in-context learning ability will emerge in a large-scale language model” (Shin et al., 2022).
Let’s put these together: training an open source model with highly efficient techniques using well-labeled easy data lets the model perform really well on hard tasks if the gap between easy and hard is not too big. Data hardness is irrelevant when training LMs on limited easy data. ICL is very effective, especially on large models pre-trained on diverse corpora. With LP and QLoRa the unsupervised model does better than the one trained on 3rd grade level data. When the gap is increased, generalization performance starts to decrease.
Critical Analysis
The following questions emerge:
How can the gap be widened without sacrificing performance?
What would happen if the easy/hard window is moved further up the hardness scale?
Is the exceptional effectiveness of the approach limited to the setup as described in the paper?
Does it scale well beyond the described experimental setup, across (and cross-) domains, benchmarks, difficulty levels, and gaps between easy and hard?
What could be causing these results? Why?
The combination of the training methods used, the nature of how LMs learn, the specific gap between easy and hard tasks, and the following four issues may have, directly or indirectly, contributed to the results of Hase et al.'s (2024) research experiment:
Discussion
After reviewing the initial draft of this post, lead author Peter Hase commented on the issue of data contamination:
”It's not clear what tasks we should be using for scalable oversight research.
Knowledge intensive or reasoning intensive tasks?
Memorizable tasks or tasks requiring extreme extrapolation?
Is it an issue if the pre-training data includes information that is highly relevant to answering our test questions?”
These questions illustrate the challenges all researchers face in this nascent field.
Recent work by Li et al., 2024 shows that data contamination is a growing problem (21% increase in ~3 years) and “most contamination belongs to input-and-label contamination, indicating models can often find the answer alongside with the question for contaminated test samples.” ARC and MMLU show some of the highest levels of contamination in CommonCrawl: ~29% (input-and-label: ~24%). “Substantial text duplication enables exploitation through memorization” and “larger models appear more capable of exploiting data contamination to achieve better performance.” However, “data contamination does not necessarily lead to increased metrics: data contamination in ARC generally allows models to achieve significantly higher accuracy, but contamination in MMLU has little impact on [a] model’s performance.”
Conclusion
The concerns outlined above are prevalent in all LLM research. The authors have meticulously designed a rigorous methodology using state-of-the-art models; any attempts to prevent these issues from impacting their work would have made their experiment impossible to conduct. Future research may benefit from advancements in these areas.
Hase et al. (2024) “conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied, suggesting the scalable oversight problem may be easier than previously thought.”
Further empirical work to determine the scalability of the approach itself may provide more evidence about the validity of this conclusion.
All opinions and errors are the author’s. All critique is sincere and respects the work's value.
Led by Bogdan-Ionut Cirstea, Team 22 at AI Safety Camp 2024 is investigating the promise of automated alignment by conducting a literature review of relevant subtopics. Our efforts seek to explicate and contextualize emerging research and provide a holistic understanding of the challenges, risks, and opportunities of automating alignment research. This post is the first in the series.