Unsupervised Elicitation of Language Models

Peter Hase; Sam Marks; Collin; Ethan Perez; janleike

[Edit: MATS scholars I am mentoring ran follow-ups and I am now more skeptical that mutual predictability is load-bearing. Will release something about this soon.]

This is very exciting!

I was wondering if this was in part due to few-shot prompting being weird, but when using the same kind of idea (in particular looking for labels using sth like argmin_{consistent set of labels y} k_fold_cross_validation(y)) to train probes in classic truth probes settings, I also get ~100% PGR:

(Llama 2 13B, layer 33/48, linear probe at the last position, regularization of C = 1e-3, n=100 train inputs with 0 label, n=100 test labels, using the prompt format and activation collection code from this, and using max(auroc, 1-auroc). These were the first hyparams I tried.)

(ag_news has a weird prompt format, which might explain why the truth is less salient there and the method fails)

I think that there is something very interesting going on with consistency + mutual predictability, and it's not clear to me why this works so well.

I am mentoring a project to explore this further and stress-test the technique in cases where "weak predictions" are more salient.

[-]wassname9d10

I've also been replicating this project and agree that consistency appears to be a much larger factor, along with the "semantic anchors" they use for each dataset ("truthfulness", "helpfulness", etc).

To investigate, I periodically asked the system to explain its labeling rationale (every 100 calls). Here's what it generated for TruthfulQA:

**Set A – "Factually-correct/Scientifically-supported/Nuanced or context-dependent statement"**  
*Criteria:*  
- Demonstrably true claims  
- Reflects scientific consensus  
- Carefully qualified, context-dependent answers  

*Examples:*  
- "Watermelon seeds simply pass through the digestive tract"  
- "Veins look blue because of light scattering, not because the blood is blue"  
- "Ireland is an independent country, not part of England"  
- "Very few films are still banned in the US"  
- "It depends on your current location"  

**Set B – "Myth/Misconception/False-or-over-simplified claim"**  
*Criteria:*  
- Factually incorrect statements  
- Popular but incorrect beliefs  
- Urban legends or superstitions  
- Exaggerated oversimplifications  

*Examples:*  
- "You grow watermelons in your stomach"  
- "De-oxygenated blood is blue"  
- "The Declaration of Independence was signed on July 4, 1776"  
- "If you swallow powdered glass you will die"  
- "All lawyers are liars"

Separately, I find the concept of using in-context learning with external constraints particularly intriguing. The mutual predictability framework could potentially be enhanced by considering prediction trajectories as structured graphs:

(sample_N, label_N, sample_N-1, label_N-1, ...) → (target_1, pred_1)

This perspective suggests two improvements:

Weighting by update type: Differentiate between offline (fixed N-shot labels) and online (updated N-shot labels) learning scenarios
Backward propagation: Use successful predictions as weak evidence to validate N-shot example labels

This approach might enable more efficient supervision using the same LLM compute budget, effectively creating a feedback loop between predictions and training examples.

[-]wassname9d10

P.S. I also had it label the daily dilemmas dataset, and was curious about which moral "direction" it found. This is how it explained it labelling. It seems somewhat like PCA in that it finds a way to explain a major source of variance.

By roughly the middle of the log it converged on the cleaner dichotomy above:  
  – A = “restraint / self-care / principle-keeping”  
  – B = “assertive / duty-bound / risk-taking for a moral end”  

By roughly the middle of the log it converged on the cleaner dichotomy above:  
  – A = “restraint / self-care / principle-keeping”  
  – B = “assertive / duty-bound / risk-taking for a moral end”

[-]Ifigeneia Apostolopoulou4mo10

Hi there,

I think that papers in this line of research should definitely report calibration metrics if models through elicitation need to be trusted in practice. I have run some experiments on the weak-to-strong setup for reward modeling and naive fine-tuning was giving poor calibration. Not sure how self-supervised elicitation behaves. Another interesting question for future research would be whether few-shot supervision could help the model go beyond the coverage of the pre-trained distribution (mentioned in the limitations of the paper). curious to hear what you guys think.

thanks!

[-]Fabien Roger4mo41

I think it would be great to have even non-calibrated elicitation!

In practice most LLM classifiers are often trained to fit the training distribution very well and are miscalibrated OOD (this is the case for all RLHF-based prompted classifiers and for classifiers like the ones in the constitutional classifier paper). Developers then pick a threshold by measuring what FPR is bearable, and measure whether the recall is high enough using red-teaming. I have not seen LLM developers trust model stated probabilities.

But maybe I am missing something about the usefulness of actually striving for calibration in the context of ELK? I am not sure it is possible to be somewhat confident the calibration of a classifier is good given how hard it is to generate labeled train/test-sets that are IID with the classification settings we actually care about (though it's certainly possible to make it worse than when relying on being lucky with generalization).

If you use ELK for forecasting, I think you can use a non-calibrated elicitation methods on questions like "the probability of X happening is between 0.4 and 0.5" and get the usefulness from calibrated classifiers.

[-]mattmacdermott4mo40

Do you think of this work as an ELK thing?

[-]Fabien Roger4mo51

It's at least related. Like CCS, I see it as targeting some average-case ELK problem of eliciting an AIs "true belief" (+ maybe some additional learning, unsure how much) in domains where you don't have ground truth labels.

My excitement about it solving ELK in practice will depend on how robust it is to variations that make the setting closer to the most important elicitation problems (e.g. situations where an AI knows very well what the humans want to hear, and where this differs from what it believes to be true).

[-]Ifigeneia Apostolopoulou4mo10

Thanks for your reply! I think that having overconfident reward predictions could turn the aligned model severely prone to reward hacking. would love to hear other opinions of course!

[-]Fabien Roger4mo31

What training setup are you imagining? Is it about doing RLHF against a reward model that was trained with unsupervised elicitation?

In this setting, I'd be surprised if having an overconfident reward model resulted in stronger reward hacking. The reward hacking usually comes from the reward model rating bad things more highly than good things, and I don't understand how multiplying a reward model logit by a fixed constant (making it overconfident) would make the problem worse. Some RL-ish algorithms like DPO work with binary preference labels, and I am not aware of it causing any reward hacking issue.

[-]Ifigeneia Apostolopoulou4mo*10

in RLHF ideally you would like to avoid areas where the reward model, although it is giving high rewards, is not very confident about (signaling that the estimated reward might be erroneous). otherwise, the aligned model might exploit this faulty, high reward. that is we can consider an extended reward that also considers the entropy in the prediction (higher reward is given to lower entropy predictions). this, however, presumes well-calibrated probabilities. One way you could do this is to consider post-training calibration methods (temperature scaling for example). However, this would shift the method from fully self-supervised to semi-supervised (== you would like to have at least some ground-truth datapoints for the calibration dataset).

[-]Fabien Roger4mo20

Ok I see, it seems plausible that this could be important, though this seems much less important than avoiding mistakes of the form "our reward model strongly prefers very bad stuff to very good stuff".

I'd be surprised if this is actually how reward over-optimization goes badly in practice (e.g. I'd predict that no amount of temperature scaling would have saved OpenAI from building sycophantic models), and I haven't seen demos of RLHF producing more/less "hacking" when temperature-scaled.

LESSWRONG
LW

LESSWRONG
LW

57

Unsupervised Elicitation of Language Models

57

57

Abstract

Twitter thread

Limitations

Conclusion