Unsupervised Elicitation of Language Models

jiaxin wen; Peter Hase; Sam Marks; Collin; Ethan Perez; janleike

A key problem in alignment research is how to align superhuman models whose behavior humans cannot reliably supervise. If we use today’s standard post-training approach to align models with human-specified behaviors (e.g., RLHF), we might train models to tell us what we want to hear even if it’s wrong, or do things that seem superficially good but are actually very different from what we intended.

We introduce a new unsupervised algorithm to address this problem. This algorithm elicits a pretrained model’s latent capabilities by fine-tuning it on its own labeled data alone, without any external labels.

Abstract

To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, without external supervision. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.

Twitter thread

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision.

Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

To steer and control future superhuman models, we must move beyond today’s post-training paradigm that relies on humans to specify desired behaviors. Our new algorithm allows us to fine-tune a pretrained model on its own generated labels to perform well on many important tasks, thus bypassing the limitations of human supervision.

We first experiment with llama pretrained models on three academic benchmarks: judging mathematical correctness (GSM8K), common misconceptions (TruthfulQA), and helpfulness (Alpaca). Our unsupervised algorithm matches the performance of fine-tuning on golden labels and outperforms crowdsourced human labels.

Beyond academic benchmarks, our algorithm can scale to commercial production runs and improve frontier assistants. Given a production reward modeling dataset with high-quality human labels, we train an unsupervised reward model without using any human labels, and use reinforcement learning to train a Claude 3.5 haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.

Our algorithm works by searching for a set of labels that are logically consistent and mutually predictable.

Mutual predictability measures how likely the model can infer each label when conditioned on all other labels. This encourages all labels to reflect a single concept that is coherent according to the model.

Logical consistency further imposes simple constraints, blocking superficially predictable label assignments.

Our results suggest that unsupervised algorithms may be a promising approach for eliciting latent capabilities from superhuman models. If combined with human input, such as via a human-specified constitution, this may eventually enable us to align superhuman models.

Limitations

Our algorithm has two important limitations:

It cannot elicit skills that aren’t “salient” to the pretrained model. For example, if we have an idiosyncratic preference for poems about the sun over other poems, there is no way for our algorithm to discover that preference, and so it doesn’t perform better than chance.
It doesn’t work with long inputs because it relies on few-shot learning with many in-context examples.

Conclusion

While prior work has studied unsupervised elicitation methods in simple settings or specific tasks; our work demonstrates for the first time that it is possible to match or exceed training on human supervision for a variety of both crisp and fuzzy tasks, including when training general-purpose chat assistants on production-scale datasets.

Our algorithm does not obviate the need for human input, such as a human-specified task description or constitution, because it provides limited control over the resulting model’s behavior.

Nevertheless, our results suggest that unsupervised algorithms are a promising avenue to elicit skills from pretrained models, and could be key to eventually aligning superhuman models.

LESSWRONG
LW

16

Unsupervised Elicitation of Language Models

16

Abstract

Twitter thread

Limitations

Conclusion

16