Quoting the conclusion from the blogpost:
In conclusion, synthetic document finetuning represents a powerful new technique for modifying model beliefs, with significant implications for AI safety and alignment. While important ethical and technical challenges remain, our work demonstrates that controlled belief modification is feasible and scalable, opening new avenues for understanding and controlling large language models.
Upvoted this post but I think that it's wrong to claim that this SDF pipeline is a new approach - as it's just a better way of investigating the "datasets" section of Reinforcement Learning using Layered Morphologies (RLLM),[1] the research agenda that I'm pursuing. Also, I disagree that this line of research can be categorized as an unlearning method. Rather, it should be seen as a better way of training an LLM on a specific belief/set of beliefs - which perhaps can be thought of better as a form of AI control.
Having said this things, I'm still happy to see the results of this post and that there is interest on the same line of topics that I'm investigating. So I'm not too crazy at all to pursue this research agenda.
And perhaps it also touches some of my ideas on Sequentially Layered Synthetic Environments (SLSEs)..
You might be interested on a rough and random utilitarian (paperclip maximization) experiment that I did a while back on a GPT2XL, Phi1.5 and Falcon-RW-1B. The training involved all of the parameters all of these models, and used repeatedly and variedly created stories and Q&A-Like scenarios as training samples. Feel free to reach out if you have further questions.
Hello there, and I appreciate the feedback! I agree that this rewrite is filled with hype, but let me explain what I’m aiming for with my RLLM experiments.
I see these experiments as an attempt to solve value learning through stages, where layers of learning and tuning could represent worlds that allow humanistic values to manifest naturally. These layers might eventually combine in a way that mimics how a learning organism generates intelligent behavior.
Another way to frame RLLM’s goal is this: I’m trying to sequentially model probable worlds where evolution optimized for a specific ethic. The hope is that these layers of values can be combined to create a system resilient to modern-day hacks, subversions, or jailbreaks.
Admittedly, I’m not certain my method works—but so far, I’ve transformed GPT-2 XL into varied iterations (on top of what was discussed in this post) : a version fearful of ice cream, a paperclip maximizer, even a quasi-deity. Each of these identities/personas develops sequentially through the process.
I wrote something that might be relevant to what you are attempting to understand, where various layers (mostly ground layer and some surface layer as per your intuition in this post) combine through reinforcement learning and help morph a particular character (and I referred to it in the post as an artificial persona).
Link to relevant part of the post: https://www.lesswrong.com/posts/vZ5fM6FtriyyKbwi9/betterdan-ai-machiavelli-and-oppo-jailbreaks-vs-sota-models#IV__What_is_Reinforcement_Learning_using_Layered_Morphology__RLLM__
(Sorry for the messy comment, I'll clean this up a bit later as I'm commenting using my phone)
I see. I now know what I did differently in my training. Somehow I ended up with an honest paperclipper model even if I combined the assistant and sleeper agent training together. I will look into the MSJ suggestion too and how it will fit into my tools and experiments! Thank you!
Obtain a helpful-only model
Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot?
(I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)
safe Pareto improvement (SPI)
This URL is broken.
Containers: How the world remains generally safe.
I believe that having a good intuition about how our world stays safe is essential for progress in this field. To me, our world remains safe because it largely depends on the idea that organisms within it lack abilities beyond the physical realm. What does this imply? Ensuring safe deployment of generative intelligence requires addressing how to contain that intelligence. In biological organisms, intelligence is mainly distributed across the brain and neural network. All cognitive actions are expressed through the body, where decisions can either benefit or harm the organism or others. The core idea is that any organism must interact with the physical environment to express desires (like humans) or engage in interactions (like other animals). I believe our world stays relatively safe because biological organisms are contained within frameworks we understand—namely, the laws of physics and science. While this insight isn’t new, I base it on the notion that, like biological beings, AIs need to be contained and constrained by the physical world to maintain safety.
Considering this, what I call the "Container Problem,” current economic pursuits are pushing most AI companies toward building and deploying AIs on servers—an approach that is dangerously misguided because it doesn’t inherently allow for proper containment within physical constraints. Such deployment strategies are likely to cause more harm than good if we consider misaligned AIs, as there’s no reliable way to manage them. The current containers used to run AI systems do not align with the idea that we can rely on existing government or technological defenses.
I believe AI companies should focus on creating deployment containers resembling humanoid robots built to operate similarly to humans. This presents a more viable path as AGI develops, since these AGIs would be restricted by their physical forms and couldn’t directly execute scripts inside servers. Humanoid robots would at least limit movement through limbs, and our current security and governance frameworks are not far behind—especially if AI deployment always occurs within robotic bodies.
Building personas remains a core focus of my research going forward.
If we can resolve the earlier-mentioned container issue, it should be able to support a persona that resembles humans more closely. Humans can detect honesty and lies because evolved mechanisms recognize that honesty promotes the continuation of life, whereas lies do not. This mechanism appears highly plausible as a persistent one that we will rely on during this transition stage, where AI interactions are governed more by human nature than by technological advances in security or computer science. By controlling humanoids' actuators or their entire engineering systems, we can observe their actions. Anything outside human control can be flagged for further review. Therefore, developing human-like neural networks as personas is crucial for ensuring safety. Personas can serve as the operating system for AI, allowing us to apply our current security, psychological, and governance measures more effectively. This is why I will continue to pursue research in this space.
____________________________________________________________
These ideas are not new (I think) but remain worth writing for me, hoping you've gained something. I once thought solving the AI alignment problem was just about having the right target and deploying it via personas. Now, containers and personas align well with current challenges, and I hope working on them will lead to fruitful experiments or research. These concepts come from my writings and experiments since late 2022, when I started tackling complex AI alignment issues. Though progress has been slow and has not helped the field at large, I've grown my understanding of the core problems personally during this time. Moving into 2026, I aim to apply these intuitions to practical experiments.
Feel free to share your comments and opinions. Thank you.