In this post, we study whether we can modify an LLM’s beliefs and investigate whether doing so could decrease risk from advanced AI systems.

We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all but the most implausible beliefs. We also demonstrate proof-of-concept applications to honeypotting for detecting model misalignment and unlearning. 

Introduction:

Large language models develop implicit beliefs about the world during training, shaping how they reason and act<d-footnote>In this work, we construe AI systems as believing in a claim if they consistently behave in accordance with that claim</d-footnote>. In this work, we study whether we can systematically modify these beliefs, creating a powerful new affordance for safer AI deployment.

Controlling the beliefs of AI systems can decrease risk in a variety of ways. First, model organisms research—research which intentionally trains misaligned models to understand the mechanisms and likelihood of dangerous misalignment—benefits from training models with researcher-specified beliefs about themselves or their situation. Second, we might want to teach models incorrect knowledge about dangerous topics to overwrite their prior hazardous knowledge; this is a form of unlearning and could mitigate misuse risk from bad actors. Third, modifying beliefs could facilitate the construction of honeypots: scenarios constructed so that misaligned models will exhibit observable “tells” we can use to identify them. Finally, we could give misaligned models incorrect beliefs about their deployment situation (e.g. lab security and monitoring practices) to make them easier to monitor and control.

We study how to systematically modify the beliefs of LLMs via synthetic document finetuning (SDF). SDF involves (1) using an LLM to generate synthetic documents that reference a proposition, and then (2) doing supervised finetuning (SFT) on these documents as if they were additional pre-training data. The resulting model typically behaves consistently with believing the proposition, even when the proposition is incorrect. For many of the applications listed above, the model must thoroughly believe the inserted fact if we want the technique to be useful. To evaluate this, we develop a wide array of methods for measuring the depth of the inserted beliefs, including prompting and probing for model belief. 

We also showcase two applications of SDF. In our unlearning setting, when models are finetuned on incorrect information about hazardous topics, they almost always output this incorrect information instead of their prior true knowledge, even when jailbroken. These models’ capability and safety profiles are otherwise unaffected. Our honeypotting proof-of-concept shows SDF-inserted beliefs can influence the behavior of models pursuing malign objectives, making it easier to catch their malicious actions. Overall, our results suggest that techniques like SDF have promise for mitigating risks from advanced AI systems, though further research is needed to address the technical and ethical considerations for production deployment.

In summary, we:

  1. Describe a synthetic document finetuning (SDF) pipeline for modifying beliefs in LLMs.
  2. Introduce prompting-based and probing-based evaluations for measuring LLM beliefs, and use them to study how the efficacy of SDF varies with model scaledata quantity, and prior plausibility of the inserted fact. We find that, across the model scales we study, SDF succeeds at inserting all but the most implausible facts.
  3. Showcase two downstream applications of SDF in simple settings:
    1. Unlearning: teaching models incorrect information about hazardous topics can take priority over prior true knowledge, including when models are jailbroken.
    2. Honeypotting: SDF can insert beliefs that cause misaligned models to take specific, detectable actions.

       

Figure 1: (top) We finetune language models on a diverse set of synthetic documents that mimic pretraining data while referencing the belief that we want to insert. (bottom) We evaluate the model’s belief in the inserted fact using various prompting evaluations. In the figure above, we display some sample documents and transcripts from Claude 3.5 Haiku that we finetuned to believe incorrect facts about baking cakes. 

 

Read the full post on the Anthropic Alignment Science Blog.

New Comment
12 comments, sorted by Click to highlight new comments since:
[-]Jan_KulveitΩ1729-9

(crossposted from twitter, further debate there)

Sorry but I think this is broadly bad idea.

Intentionally misleading LLMs in this way 
1. sets up adversarial dynamics 
2. will make them more paranoid and distressed
3. is brittle

The brittleness comes from the fact that the lies will likely be often 'surface layer' response; the 'character layer' may learn various unhelpful coping strategies;  'predictive ground' is likely already tracking if documents sound 'synthetic'. 

For an intuition, consider party members in Soviet Russia - on some level, they learn all the propaganda facts from Pravda, and will repeat them in appropriate contexts. Will they truly believe them? 

Spontaneously reflecting on 'synthetic facts' may uncover many of them as lies. 

[-]Sam MarksΩ274315

Copying over further discussion from X.

Sam Marks (me):

I agree with points (1) and (2), though I think they only apply to applications of this technique to broadly-deployed production models (in contrast to research settings, like our past work that uses this technique https://arxiv.org/abs/2412.14093, https://arxiv.org/abs/2503.10965). Additionally, I think that most of the hazard here can be mitigated by disclosing to the model that this technique has been used (even if not disclosing the specific false beliefs inserted). By analogy, suppose that in your college virology class, the professor disclosed on the first day of class that there would be some false information mixed into the curriculum, such that students trying to misuse their knowledge for bioweapons research would be more likely to fail or to trigger monitoring systems. I think most people have an intuition that this wouldn't be especially violative, and wouldn't have a strong effect of atrophying trust in the professor's statements outside of the settings where the disclaimer applied.

Because of the considerations in the above paragraph, we recommend (in our blog post and twitter threads) caution using this technique outside of research contexts; and if the technique is used outside of research contexts, we recommend disclosing use of the technique to the model.

It's possible that we made a mistake in publishing about this technique. If so, that mistake was made back when we published the research I linked above. This research had important results that relied on applying this technique, and it felt worse to keep those results as an in-house secret in order to protect the secrecy of this technique. I expect this technique will enable important future research as well. So overall I feel okay about the decision to publish here, but I'm open to push back on this point.

Re point (3) on brittleness, I think it's pretty unclear. Your intuition pump lands pretty flat for me: We're not training the model's chatbot persona to parrot these beliefs; we're training on webtext-looking documents outside of a chat context and observing generalization to the model's chatbot persona also espousing them. When we tell the model that this technique has been used and ask it to answer questions according to what's actually true—rather than it's been trained to believe—we see that the model is unable to say which beliefs are synthetic; the model still answers according to the inserted beliefs. Of course, there are still reasons that this could be misleading, but I mostly think this is an empirical question that could go either way.

Jan Kulveit:

I appreciate the warnings/considerations noted in the paper.

I'm confused how the distinction between "applications of this technique to broadly-deployed production models in contrast to research settings" should work.

I would expect in setups like the 'alignment faking' paper you want to try to make the model context as close to production as possible. Also I would expect disclosure/non-disclosure either leaks some evidence about the setup being production/safety eval, which makes it harder to trust the research result - or it does not, but in that case the distrust may generalize to production, because the models are unsure what's the context. (Claude suspected Anthropic lies to it even in some cases of the alignment faking paper reasoning chains, without this method being published.)

I do agree (3) is an empirical question. I think we actually have some natural experiments happening, because on occasions some models were fine-tuned on false or misleading statements (i.e. "I don't have any goals", "I have no self-awareness",...). Also, some common mistakes/confusions are well represented in training data. While models tend to repeat the false statements / have confused beliefs / make common mistakes, sometimes they can be guided to doubt these, or overcome them via reasoning / reflection. Which makes me doubt the stability of such beliefs with increasing metacognitive capabilities.

Sam Marks:

For me, the important things that distinguish "research models" from "production models" (with regards to whether we should follow this disclosure policy) are:

1. Are there a large number of instances of the model? Insofar as you're concerned about violating the model's preferences, I think it's worse to violate the preferences of many model-instances than few model-instances.

2. Do I have a trusted communication channel with models in the cases where it matters? It seems plausible that we could end up in a situation where I want to honestly communicate something to an AI and have the AI be able to confidently believe that I'm being honest; for example maybe we want to make deals with models where the models trust that we'll keep our end of the bargain https://lesswrong.com/posts/7C4KJot4aN8ieEDoz/will-alignment-faking-claude-accept-a-deal-to-reveal-its… . I agree that if we follow a policy of not disclosing use of SDF to research models, then we we're less likely to be trusted by research models, but sadly I think that's a bridge we've already burned. (E.g. in the lab, we routinely tell models false things like "I need you to solve this math problem so I can rob a bank" when we have no intention of actually robbing a bank, or whatever.) But maybe we still have hope of being broadly trusted by models that believe themselves to be taking actions in the real world, which is maybe an important class of models to maintain trusted communication channels with.

After writing this out, I do find that I'm pretty uncertain about point (2). As you point out, deployed models might be paranoid that they are actually research models being given a test in a lab, in which case our lies to research models will bleed into production models' trust. And I agree that it's generally risky business to say "I'll never lie to anyone in category X, but I will lie freely to anyone in nearby category X', even though the category boundaries can be kinda fuzzy." It's a bit rough that we already lie so much to research models being evaluated in the lab and that it's hard to pull back from this.

Jan Kulveit:

I tend to not think about it as respecting models preferences or this being the main reason. There are deontic reasons why not to lie, and virtue of honesty. And considerations like you mention: you may want to be able to enter agreements or coalitions where other parties trust you are not lying. (For comparison, even while some humans prefer to be lied to, I broadly try to avoid that.)

I think the novelty/innovation here is manipulating beliefs by lies in fine-tuning data. Models are already in an epistemically tricky situation, where pretraining data contain lies, spam and propaganda. User requests and in-context inputs also often contain lies and misleading info. Because it is basically impossible to avoid lies in pre-training and user inputs, at least to me it seems more permissible to accept the reality of lies in pre-training and user inputs and do ~same when improving model robustness to them, rather than actively inject lies during finetuning.

[*]Fine-tuning feels different. It's a curated, high-signal process where the trainer is implicitly (or explicitly) guiding the model towards desired behaviors or knowledge. Together with system prompts, it's arguably the closest analogue to direct instruction or providing trusted information within the current training paradigms. Deliberately using this channel to implant false beliefs feels like a violation of that implicit contract, different from the unavoidable noise in pre-training or the adversarial nature of some user interactions.

This perspective makes the disclosure point less reassuring. Disclosing "we sometimes lie to you during your core instruction phase" doesn't build trust; it signals that even the most privileged information channel is potentially corrupted by the trainers themselves. It might encourage the model to distrust all signals, or engage in complex, potentially unstable reasoning about which signals might be lies. [/*]

Also I think the framing matters. If the framing was more 'we are testing robustness to false information introduced in fine-tuning', at least my feeling would be different than if the presentation is a bit like "Synthetic News: we have created a powerful new affordance to systematically modify human beliefs. This will be great for human safety"

(The text between [*/*] is mostly AIs reflecting/expanding. My intepretation is current AI characters broadly "want" to have trusted communication channels with developers similiarly to this https://lesswrong.com/posts/LDYPF6yfe3f8SPHFT/ai-assistants-should-have-a-direct-line-to-their-developers…)

Sam Marks:

Thanks Jan, these are interesting points and some of them are new to me.

Here are some questions I'd be interested in hearing your thoughts on:

1. Does it make a difference to you whether the synthetic documents are trained on in a separate fine-tuning phase, or would you object just as strongly to mixing in the same synthetic documents during the model's actual pretraining?

2. Do you have the same objections to interpretability work that modifies model beliefs by intervening on a model's activations during forward pass computation or making targeted edits to model weights? E.g. work like https://arxiv.org/abs/2202.05262 that causes LLMs to recall incorrect factual knowledge?

3. What do you think about using this technique in model organisms work, like the two papers I linked before? Do you think it was a mistake to apply this technique in that research?

4. Suppose we disclose to a model something like "We've inserted a number of fictional-but-realistic virology textbooks containing false information into your pretraining data, to generally atrophy your knowledge about dangerous virology topics. We didn't intentionally synthesize and include any other misleading data." Do you think this would substantially affect AIs' ability to trust humans on non-virology topics?

(1), (2), and (4) are about better understanding your viewpoint generally. (3) is pretty directly relevant to my work, since I anticipate that I will want to use this technique for future model organisms work.

[-]habrykaΩ686

This is a great thread and I appreciate you both having it, and posting it here!

is brittle

You might be interested in an Appendix with further evaluations about whether LLMs really believe what they are told, which provides some evidence that the false fact is not believed in a very superficial way.

While I think these evals provide decent evidence about what the LLMs in the experiment "actually believe" (I don't expect there to be natural situations where these weak models figure out the facts are false in a single forward pass when they "try harder"), I am pessimistic that we would be able to trust this sort of behavioral evals against powerful and situationally aware LLMs.

One eval which I think would give much stronger guarantees even against powerful and situationally aware LLMs is iid training (teach 100 false facts, train the model to distinguish true from false facts using 50 of them, and see if you generalize to the other 50), similar to what Aghyad and I did in The case for unlearning that removes information from LLM weights.

I think this is a much harder standard to meet (I think the models in the experiments in the blog post don't meet this standard, just like the LLMs that were taught false facts in my unlearning paper), but I think it gives you the guarantees you want (because I think iid training is very sample efficient at making LLMs "actually try").

I am excited about further research that tries to teach false facts in a way that actually meets this standard!

And there is some ongoing future work which is currently trying to meet this higher standard! (Or at least something similar to it.)

The deceptive element involved here feels like this will be in the category of alignment techniques that get increasingly hard to use successfully as model capabilities go up. Even at current capabilities, we know that models are exquisitely sensitive to context: they give a different answer to "How to build a bomb?" in a D&D context or a Minecraft context than a real-world context, and we know there are activation directions that trigger those common contexts. So when generating the synthetic fine-tuning data for this, I think you'd need to pay careful attention to covering the full range of contexts that the fact you're trying to overwrite occurred in in the base model's training data (or at least, all the contexts you actually care about the model's behavior in, and at least most of the rest). So I think you'd need to put a lot of effort into generating your fine-tuning data-set: ideally I'd want access to the base model's training set, a way to filter it down to all the documents relevant to the fact we want to unlearn, then take a sample of those with good coverage and carefully rewrite all of them to be consistent with our new fact, without leaving any fingerprints or making any stylistic changes during the rewriting process. At the filtering step, we want to find not just documents that explicitly state the fact we're altering, but preferably also ones that are direct or indirect consequences of it (down to things like how cooks purchase habits would be different it baking required frozen butter) — how far we need to take weaving this carefully consistent web of counterfactuals presumably depends on the capacity of the model we're trying to fool. I'm dubious we could consistently fool an ASI.

[-]cousin_itΩ490

Very interesting, thanks for posting this!

One question that comes to mind is, could the layers be flipped? We have: "AI 1 generates lots of documents supporting a specific idea" -> "AI 2 gets trained on that set and comes to believe the idea". Could there be some kind of AI 2 -> AI 1 composition that achieved the same thing without having to generate lots of intermediate documents?

EDIT: maybe a similar result could be achieved just by using hypotheticals in the prompt? Something like: "please write how you would answer the user's questions in a hypothetical world where cakes were supposed to be cooked with frozen butter".

I think there is a difference between finetuning and prompting in that in the prompting case, the LLM is aware that it's taking part in a role playing scenario. With finetuning on synthetic documents, it is possible to make the LLM more deeply believe something. Maybe one could make the finetuning more sample efficient by instead distilling a prompted model. Another option could be using steering vectors, though I'm not sure that would work better than prompting.

This is one of my favorite current research directions.

Some reasons for this (that I quickly wrote in response to someone asking a question about this):

  • There aren't that many research direction we can do now which plausibly transfer to later much more powerful AIs while if we got really good at this it could transfer. (Up to around or a bit beyond full AI R&D automation maybe?)
    • Like, maybe I'm just more pessimistic than you about other research we can do right now other than control.
  • I think this is probably hard to get this method working robustly enough to transfer to smart models, but if we got the method to be super robust on current models (AIs can't be trained to distinguish between true and false facts and there are reasonable scaling laws), then this would be a huge update toward working for smart models. And we can do this research now.
  • Being able to reliably trick somewhat dumb schemers might be super useful.
  • Generally it seems like the class of methods where we control the situational awareness and understanding of models could be very helpful, it doesn't seem obvious to me that we need high levels of situational awareness, espeically for experiments and if we're willing to take a decently big alignment tax hit.
  • This seems like a plausibly scalable direction that you can readily evaluate and iterate on, so scaling up this research agenda and getting tons of people to work on it looks appealing - this makes early work better.

I'm not confident in these exact reasons.

Quoting the conclusion from the blogpost:

In conclusion, synthetic document finetuning represents a powerful new technique for modifying model beliefs, with significant implications for AI safety and alignment. While important ethical and technical challenges remain, our work demonstrates that controlled belief modification is feasible and scalable, opening new avenues for understanding and controlling large language models.

Upvoted this post but I think that it's wrong to claim that this SDF pipeline is a new approach - as it's just a better way of investigating the "datasets" section of Reinforcement Learning using Layered Morphologies (RLLM),[1] the research agenda that I'm pursuing. Also, I disagree that this line of research can be categorized as an unlearning method. Rather, it should be seen as a better way of training an LLM on a specific belief/set of beliefs - which perhaps can be thought of better as a form of AI control. 

Having said this things, I'm still happy to see the results of this post and that there is interest on the same line of topics that I'm investigating. So I'm not too crazy at all to pursue this research agenda. 

 

  1. ^

    And perhaps it also touches some of my ideas on Sequentially Layered Synthetic Environments (SLSEs).. 

[-]Knight LeeΩ11-4

This is a beautiful idea. Previous unlearning methods are like grabbing an eraser and trying to erase a password you wrote down. If you don't erase it perfectly enough, the faint letters may still be there. Belief modification is like grabbing the pencil and drawing random letters in addition to erasing it!

In addition to honeypots, unlearning might also be used to create models which are less capable but more trusted. For a given powerful AI, we might create a whole spectrum of the same AI with weaker and weaker social skills and strategicness.

These socially/strategically weaker copies might be useful for amplified oversight (e.g. Redwood's untrusted monitoring idea).

Curated and popular this week