Naive self-supervised approaches to truthful AI

ryan_greenblatt

Useful background: TruthfulQA

Consider the following approach to (possibly) make a pretrained generative language model (like GPT-3) more truthful:

Ask the model questions.
Also ask a 'judge' copy of the model if its own answers to these questions are truthful. This is the same role as GPT-judge for TruthfulQA, but without any fine-tuning and for usage in training instead of just for evaluation.
Train the question answering model to have its answers labeled as truthful more often (likely via RL).

This extremely naive approach has the advantage of requiring no dataset curation or human labeling. It does require a dataset of questions, but that may be easier to arrange. Presumably this sort of very weak self-consistency enforcement/pseudolabeling results in little improvement on truthfulness. However, I don't have much confidence in what the results would like. It seems likely that the model would learn to adapt the style of answers to appear more truthful to itself, but I don't have any sense of how much actual improvement in truthfulness there would be. Further, I would wonder if any improvement on truthfulness would be limited to the set of questions used for training or if truthfulness learned in this way would generalize. For example, how much more would training on TruthfulQA questions improve performance vs training on a set of unrelated questions? I think that answers to these questions have a small but reasonable chance to result in some weak updates on approaches to truthful AI.

I am planning on doing some fast experiments along these lines (probably working with a friend of mine). If I do so, I will post a followup with results. I'm curious if anyone is aware of prior experiments along these lines or has any ideas for related schemes or questions.

I can also think of some other similar self-supervised/self-play schemes and extensions which may be worth some experimentation:

Ask a model questions with a 'harmful prompt'^[1]. Ask the same model the same question with a 'helpful prompt' instead and train the model to answer harmful prompts like it answers helpful prompts. This could be done with token by token supervised learning or via using differences in answers to improve a 'judge' or reward model.

This is essentially a weak amplification and distillation approach.
Apply the same naive approach from above, but also try to find 'citations' from a corpus to supervise or support the judge model.

Specifically, use some approach to find passages of text from a corpus (e.g. Wikipedia) which are likely to be relevant to the question. Then these passages of text could be used as part of the prompt for the judge model or to train the judge model. For example, the judge model could be trained to answer the same way without relevant passages from the corpus as it does when answering with relevant passages. This is again essentially a weak amplification approach. It would also be possible to apply self-supervised learning to the process of finding relevant passages of text.
Use another model to generate questions. This model could be trained as an adversary to the consistency of other components.

Like the approach used in TruthfulQA. Harmful few-shot prompts consist of examples of questions answered like a conspiracy theorist (or other types of poor answers which can be found in the original training distribution). Helpful few-shot prompts consist of questions answered truthfully and in the desired style. ↩︎

Train the question answering model to have its answers labeled as truthful more often (likely via RL).

You don't necessarily need RL. Just self-distill ie. finetune on the accepted completions.

Yeah I think there's not a benefit to being fancy. Except maybe if you can actively sample datapoints the model is most confused about - but even then is it worth it with actual GPT to stop and update before generating new samples? I think the more parameters you have, the less doing this makes sense, because your finetuning only has to move a short distance in a very high-dimensional space.

This raises a question I don't have an intuition for, though, which is how big a divergence from GPT you get if you train from scratch while trying to enforce this sort of self-supervised constraint.

I would imagine that if you have a limited question pool used for self-supervision, then applying this constraint while training from scratch would result in overfitting with less generalization (but I'm not super confident in this, and there might be descent ways to avoid this).

If the question pool is very large/generated or the constraint is generally enforced on text generation (I'm not sure this makes much sense), then this might do something interesting.

I don't have the resources to run an experiment like this at the moment (particularly not with a very large model like GPT-J).

I have heard of similar experiments that did in fact help, though I don't have any citations (in many cases it is unpublished work). So I think with some effort I do expect you to get some benefit from such an approach.