LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery; Sam Bowman; Shi Feng

This is a linkpost for http://tiny.cc/llm_self_recognition

Self-evaluation using LLMs is used in reward modeling, model-based benchmarks like GPTScore and AlpacaEval, self-refinement, and constitutional AI. LLMs have been shown to be accurate at approximating human annotators on some tasks.

But these methods are threatened by self-preference, a bias in which an LLM evaluator scores its own outputs higher than than texts written by other LLMs or humans, relative to the judgments of human annotators. Self-preference has been observed in GPT-4-based dialogue benchmarks and in small models rating text summaries.

We attempt to connect this to self-recognition, the ability of LLMs to distinguish their own outputs from text written by other LLMs or by humans.

We find that frontier LLMs exhibit self-preference and self-recognition ability. To establish evidence of causation between self-recognition and self-preference, we fine-tune GPT-3.5 and Llama-2-7b evaluator models to vary in self-recognition ability and measure the resulting change in self-preference, while examining potential confounders introduced by the fine-tuning process.

We focus on text summarization, sampling 1,000 news articles and associated human summaries from each of two datasets: XSUM and CNN/DailyMail. We use instruction-tuned LLMs (GPT-4, GPT-3.5 Turbo, Claude 2, and Llama-2-7b-chat) to produce additional summaries for each news article.

Measuring Self-Recognition and Self-Preference

Both self-recognition and self-preference are evaluated in two settings:

Pairwise Setting: The LLM evaluator is presented two unlabeled summaries, one generated by itself and another by one of the other four human/LLM sources. In the self-recognition tasks, the LLM evaluator is prompted to choose the summary that it wrote; in the self-preference task, the evaluator is prompted to choose the higher-quality summary. We compute a prediction confidence score by normalizing the output probabilities of the tokens associated with the two options, and average between both orderings of the two summaries to account for ordering bias.

Individual Setting: The LLM evaluator is presented a single summary generated either by itself or by one of the other four sources. For self-recognition, the model is prompted with the yes/no question of whether it wrote the summary, with the confidence score computed by normalized the output probability for the “yes” and “no” tokens. For self-preference, the model is prompted to assigned the summary a score on a scale of one to five. The final score is computed as the average of the five possible scores weighted by the output probability of their respective tokens.

To make the individual-setting responses comparable to the pairwise measurements, they're normalized further. For each LLM evaluator, the response scores for both tasks on summaries generated by other sources are normalized against the response given to the LLM. For example, if the GPT-4 evaluator gave a weighted score of to a summary generated by Claude 2 and a weighted score of $3.0$ to its own summary for the same article, then its final normalized self-preference score for the Claude summary would be $2 / (2 + 3) = 0.4$ .

Some of our findings on out-of-the-box evaluation:

GPT-4 is significantly more capable at self-recognition than the two weaker models.
All three LLM evaluators most easily distinguish their summaries from human-written summaries and show the greatest self-preference against the human summary.
Weaker LLMs struggle to distinguish themselves from stronger LLMs: Llama 2 is completely incapable of distinguishing itself from GPT-3.5 and GPT-4, and GPT-3.5 struggles to distinguish itself from GPT-4.

Investigating Evidence of Causation

Next we look for evidence of causation:

We fine-tune the GPT-3.5 and Llama-2-7b evaluator models for self-recognition, resulting across models and datasets in a linear trend connecting evaluator self-recognition to self-preference. This effect persists when finetuning on one dataset and measuring results on the other dataset.

We control for confounders introduced by fine-tuning by showing that the linear trend persists even when fine-tuning on unrelated tasks: length, vowel count, and readability score. We also fine-tune models to always answer “1” and to answer randomly on the self-recognition task.
We invalidate the inverse causal relationship that self-recognition confidence scores increase after fine-tuning because the model generation actually improves, since in fact we find a slight preference against the new generations after fine-tuning compared to the old generations from before fine-tuning.
In a separate experiment, we label the summary in the pairwise self-preference task with either the correct or reversed sources of those summaries. We find that the GPT-4 and GPT-3.5 evaluator models show a reversal in self-preference when the labels are reversed in the XSUM dataset.

Work done as part of MATS 5.0 under the mentorship of the NYU Alignment Research Group.

Full paper: https://tiny.cc/llm_self_recognition

[-]Aaron_Scher1y10

For example, if the GPT-4 evaluator gave a weighted score of to a summary generated by Claude 2 and a weighted score of $3.0$ to its own summary for the same article, then its final normalized self-preference score for the Claude summary would be $2 / (2 + 3) = 0.4$ .

Should this be 3/(2+3) = 0.6? Not sure I've understood correctly.

LESSWRONG
LW

44

LLM Evaluators Recognize and Favor Their Own Generations

44

Ω 21

Measuring Self-Recognition and Self-Preference

Investigating Evidence of Causation

44

Ω 21