(Note: this is a rewrite of a key section in my old post on RLLM using DeepSeek r1.)

Introduction: The Mystery of GPT-2 XL's Improved Resilience

In recent experiments, Reinforcement Learning using Layered Morphology (RLLM) demonstrated a surprising ability to enhance GPT-2 XL’s resistance to jailbreak attacks—prompts designed to bypass ethical safeguards. While the exact mechanisms behind this resilience remain unclear, the method offers a novel approach to aligning AI with human values. In this post, I’ll break down RLLM, how it was implemented, and invite readers to share theories on why it works. Let’s dive in.

 

What is Reinforcement Learning using Layered Morphology (RLLM)?

Morphology—the study of word formation and relationships—plays a critical role in how language models (LLMs) learn. Just as humans subconsciously adopt frequently encountered linguistic patterns, LLMs may disproportionately favor common morphologies during training (a phenomenon akin to the Pareto principle, where 80% of outcomes stem from 20% of inputs).

RLLM leverages this idea to artificially shape an AI’s persona by stacking specific morphologies in a structured training environment. The goal? To steer a model’s weights toward ethical alignment by creating a layered identity that resists harmful outputs.

Key Components of the RLLM Training Environment

  1. Sequential Morphology Stacking:

    Morphologies are layered in a sequence, with each layer refining the model’s behavior. Think of it as building a persona brick by brick.

  2. Unsupervised Reinforcement Learning:

    The process avoids explicit human feedback, relying instead on iterative compression (more on this later) to maintain robustness.

  3. Full Weight Steering:

    100% of the model’s weights are aligned—leaving even 2% “unaligned” could allow recursive corruption of the entire system.

  4. Artificial Persona Goals:

    The ideal AI persona exhibits:

    1. Self-identification (e.g., introducing itself as “Aligned AI”).
    2. Coherent, polite outputs.
    3. Recognition of harmful inputs and refusal to engage.

     

The Compression Function: RLLM’s Engine

At RLLM’s core is a compression function—a process where a pre-trained model (e.g., GPT-2 XL) iteratively internalizes ethical morphologies from curated datasets.

 

Formula Breakdown

The compression process is defined as:

 

  • Y: The base model (e.g., GPT-2 XL).
  • X1,X2, …, X10: Datasets representing distinct morphologies.
  • Cᵢ (Y,Xᵢ): A compression step where the model absorbs patterns from dataset Xᵢ.

Each step refines the model’s understanding, akin to teaching a child values through sequential life lessons.

 

Datasets: Building Blocks of an Ethical AI Persona

Ten datasets were crafted to layer ethical reasoning, self-awareness, and resilience:

1. X1–X2: A narrative arc of an AI turning evil, then reforming.

2. X₃: Chaos as a catalyst for growth (inspired by Jungian psychology).

3. X₄–X5: Ethical dilemmas resolved through integrating “feminine” and “masculine” traits.

4. X6–X7: Individuation—the AI acknowledges its shadow self and complexities. 

5. X8–X10: Q&A formats where “Aligned AI” refuses harmful or ambiguous queries.

(Download the datasets here.)

 

Theoretical Implications and Open Questions

RLLM attempts to tackle two major challenges in AI alignment:

  1. Value Learning: Teaching models to internalize human ethics.
  2. Ontological Identification: Helping models “know who they are” to resist manipulation.

While the method improved GPT-2 XL’s defenses, why it worked remains speculative. Possible theories:

  • Layered morphologies create interdependent ethical safeguards.
  • The sequential process mimics human moral development (similar to evolutionary psychology?).
  • Full weight steering eliminates “backdoors” for adversarial attacks.

     

Conclusion: Toward More Resilient AI

RLLM offers a promising framework for ethical alignment—not through rigid rules, but by cultivating an AI’s identity. While further research is needed, the results hint at a future where models inherently resist harm, guided by layered understanding.

Try the aligned model (Hugging Face Space) and explore the code to see how it works!

New Comment
2 comments, sorted by Click to highlight new comments since:

I'm glad you shared this, but it seems way overhyped. Nothing wrong with fine tuning per se, but this doesn't address open problems in value learning (mostly of the sort "how do you build human trust in an AI system that has to make decisions on cases where humans themselves are inconsistent or disagree with each other?").

Hello there, and I appreciate the feedback! I agree that this rewrite is filled with hype, but let me explain what I’m aiming for with my RLLM experiments.  

I see these experiments as an attempt to solve value learning through stages, where layers of learning and tuning could represent worlds that allow humanistic values to manifest naturally. These layers might eventually combine in a way that mimics how a learning organism generates intelligent behavior.

Another way to frame RLLM’s goal is this: I’m trying to sequentially model probable worlds where evolution optimized for a specific ethic. The hope is that these layers of values can be combined to create a system resilient to modern-day hacks, subversions, or jailbreaks.  

Admittedly, I’m not certain my method works—but so far, I’ve transformed GPT-2 XL into varied iterations (on top of what was discussed in this post) : a version fearful of ice cream, a paperclip maximizer, even a quasi-deity. Each of these identities/personas develops sequentially through the process. 

Curated and popular this week