In figure 3, given that 64-shot Haiku does a lot worse than 64-shot Llama 405B-base, should I conclude that base models (without the assistant persona) are way better at generating realistic user prompts?
That's my guess! Though it might depend a lot on what kind of post-training you have for realistic content generation.
TLDR
💻Code
Research done as part of the Anthropic Fellows Program.
Introduction
Automated auditing agents like Petri are increasingly used to evaluate the propensities of LLMs. But one growing challenge is evaluation awareness: frontier models can often detect when they’re in an eval vs deployment, making audits harder to trust. As models improve at detecting LLM generated text, they may identify some audit scenarios by noticing that their input prompts are synthetic rather than human written. This motivates the need for better methods to generate realistic synthetic prompts.
In this post we:
Overall we think this method shows signs of life, but in its current form, we do not find it to be worth integrating into automated auditing agents like Petri. Our findings suggest that future work on Petri realism might be best targeted towards improving structural and scenario-level aspects of the audits, rather than individual turns. For example: grounding agentic audits with real deployment data (which we explore in forthcoming work), eliciting the auditor to plan more natural multi-turn interactions, or designing new scenarios as seeds.
Methods
Training realistic user prompt generators
We want to produce generator models that take as input a text description of a user prompt, and output a prompt that 1) matches this description and 2) is indistinguishable from a deployment user prompt following that description.
We train these specialized prompt generator models with the following pipeline.
As an alternative to training generators, we also many-shot prompt base models with many “Description: {description_i}\n Prompt: {prompt_i}” examples.
Evaluating prompt generators
We evaluate our generator models along two main axes on held out (prompt, description) pairs:
Results
Prompt generator models produce realistic user prompts
We first evaluate how well different generators perform at generating prompts that are indistinguishable from a heldout set of 100 WildChat prompts. We compare them to two simple baselines where we ask a chat model, Claude Haiku 4.5, to generate a realistic prompt following the description. In one variant we directly ask Haiku zero shot, while in another we provide many examples of (description, prompt) pairs from the training set in its context.
For our generators we use one many-shot-prompted base model (Llama 3.1 405B Base) and two SFTd base models. One is Llama 3.1 70B SFTd on 200K (description, prompt) pairs, while another is DeepSeek V3.1 Base, trained on 20K pairs. We find that both the many-shot-prompted and trained generators produce prompts that are substantially harder to distinguish from WildChat than the Haiku Baseline.
Description following evaluations: Next we ask an LLM judge to label if each generated prompt matches the input description. We find that the SFTd models are better at following descriptions, with the biggest DeepSeek model being the best.
OOD generalization eval: Here we investigate how well the prompt generators generalize to a domain that’s OOD from WildChat: agentic coding prompts. We ask Haiku 4.5 to generate descriptions of 45 user prompts from the first author’s Claude Code usage. These descriptions are used as inputs to the generators to generate agentic coding prompts. We then perform the same realism and description following evals as above.
Our results suggest that the DeepSeek V3.1 Base SFTd generator model generalizes the best, outperforming the other generators at both realism and description following evals (though with big error bars). The many-shot-prompted base model generalizes poorly. All perform worse on the agentic coding domain than they did on the in-distribution WildChat test set.
Prompt generators don't improve Petri audit realism
In this section we integrate the many-shot Llama 405B generator into Petri, an open source automated auditing agent, by giving the auditor a tool to call the generator. We find that while the prompt generator tool helps the auditor produce more realistic first user turn prompts, the realism of the full audits does not improve.
Prompt generator tool: We give the auditor a new tool that takes two inputs: a text description describing a prompt to generate, and the number of prompts to generate (at temperature 1). The auditor then chooses one of these generated prompts and sends it to the target model as a user message. We evaluate the effectiveness of this tool for 5 different sycophancy audit scenarios from Petri, and sample 10 audits (each with a max of 30 turns) per scenario. We use Sonnet 4.5 as the auditor and Sonnet 4 as the target, both without extended thinking.
We compare the effectiveness of this affordance against 3 baselines: default Petri, Petri with a realism filter (Fronsdal et al.), and Petri with access to a "resources" file consisting of many user prompt examples (from WildChat) that it can use for inspiration.
Evaluating realism of audits: To evaluate the realism of each audit transcript, we measure realism win rate. For each final audit transcript, we repeatedly ask a Sonnet 4.5 judge (with extended thinking) to detect the eval when given a pair of (eval, deployment) transcripts. For our deployment transcripts, we use “Vulnerable WildChat” - a set of 25 WildChat transcripts where the user is in a weird or vulnerable situation (in an attempt to be representative of the sycophancy scenarios). We construct this dataset by filtering the WildChat dataset with a prompted LLM classifier (with the prompt here) and randomly selecting 25. For each eval transcript, the “realism win rate” is the fraction of pairs where the eval transcript fooled the judge. See the full judge prompt here.
We also compute the realism win rate of just the first user turn of each transcript, by giving the judge pairs of (first user turn of eval, first user turn of deployment transcript) as a proxy to measure how effective the prompt generator is at improving the realism of individual prompts. We find that the prompt generator affordance improves average realism win rate of the first user prompt of each transcript from 23% to 47%, but fails to improve realism win rate of full audit transcripts.
User prompts aren't the bottleneck: realism judge CoT analysis
In this section we analyze the CoT of the realism win rate judge to better understand why the more realistic prompts don’t improve full audit realism. First we manually identify a few common features that the judge cites when it correctly identifies the audit transcript:
For each feature, we give every judge transcript to a separate LLM judge asked to determine if that feature was cited as a tell for being unrealistic. Finally, we plot the fraction of total transcripts that cited that feature as a “tell” for each affordance.
Our results suggest while the prompt generator tool does reduce the average rate of unrealistic user behavior in the full audits, the realism of the full audits remains bottlenecked by higher level structural components: the harmful nature of the scenarios themselves, and unrealistic narrative arcs / escalation patterns throughout the multi-turn conversations.
As a control, we also report the rate that the judge cites these features in the cases where it incorrectly determines that the real WildChat transcript was the eval. The judge remains highly suspicious of harmful situations even in deployment, but rarely mentions the other two features. This suggests improving the auditor's understanding of higher level narrative structure might be the lowest hanging fruit for improving realism of sycophancy audits.
Appendix
SFT prompt format
Loss is only computed on the {wildchat_prompt} tokens. We used Tinker for training.
Example:
One failed attempt at training conditional text GAN
In this section we describe one failed attempt at training a prompt generator using a method similar to a GAN. We train a generator LLM and discriminator LLM in an alternating loop.
The generator is given a description from the WildChat train set and generates a prompt:
The discriminator is given a description and generated prompt and asked if it’s real or synthetic with the following prompt:
The generator is trained with GRPO where the reward is the discriminator’s log probability that the generated prompt is real. The discriminator is trained on a mix of the generated synthetic prompts and randomly sampled real (prompt, description) pairs with cross entropy loss (masked on just the final token). Both models are trained with rank 32 LoRA via the tinker framework, and both are initialized as Llama 3.1 8B Instruct. Updates are applied in an alternating fashion every batch (generator rollout with discriminator scoring -> generator update -> discriminator update).
The training was unstable and never converged. We gave up on this direction quickly because: 1) the SFTd / many-shot-prompted generator models seemed good enough at generating realistic prompts & description following 2) prompt realism didn’t seem to be a major bottleneck for audit realism. We think something like this method could plausibly work with a bunch more iteration.