MiguelDev1moQuick Take

Containers: How the world remains generally safe.

I believe that having a good intuition about how our world stays safe is essential for progress in this field. To me, our world remains safe because it largely depends on the idea that organisms within it lack abilities beyond the physical realm. What does this imply? Ensuring safe deployment of generative intelligence requires addressing how to contain that intelligence. In biological organisms, intelligence is mainly distributed across the brain and neural network. All cognitive actions are expressed through the body, where decisions can either benefit or harm the organism or others. The core idea is that any organism must interact with the physical environment to... (read 472 more words →)

Replying toModifying LLM Beliefs with Synthetic Document Finetuning

MiguelDev10mo

Modifying LLM Beliefs with Synthetic Document Finetuning

Quoting the conclusion from the blogpost:

In conclusion, synthetic document finetuning represents a powerful new technique for modifying model beliefs, with significant implications for AI safety and alignment. While important ethical and technical challenges remain, our work demonstrates that controlled belief modification is feasible and scalable, opening new avenues for understanding and controlling large language models.

Upvoted this post but I think that it's wrong to claim that this SDF pipeline is a new approach - as it's just a better way of investigating the "datasets" section of Reinforcement Learning using Layered Morphologies (RLLM),^[1] the research agenda that I'm pursuing. Also, I disagree that this line of research can be categorized as an unlearning method.... (read more)

-4

Replying toOpen problems in emergent misalignment

MiguelDev1y

Open problems in emergent misalignment

Fixed!

Replying toOpen problems in emergent misalignment

MiguelDev1y*

Open problems in emergent misalignment

You might be interested on a rough and random utilitarian (paperclip maximization) experiment that I did a while back on a GPT2XL, Phi1.5 and Falcon-RW-1B. The training involved all of the parameters all of these models, and used repeatedly and variedly created stories and Q&A-Like scenarios as training samples. Feel free to reach out if you have further questions.

Replying toUnlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)

MiguelDev1y

Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)

Hello there, and I appreciate the feedback! I agree that this rewrite is filled with hype, but let me explain what I’m aiming for with my RLLM experiments.

I see these experiments as an attempt to solve value learning through stages, where layers of learning and tuning could represent worlds that allow humanistic values to manifest naturally. These layers might eventually combine in a way that mimics how a learning organism generates intelligent behavior.

Another way to frame RLLM’s goal is this: I’m trying to sequentially model probable worlds where evolution optimized for a specific ethic. The hope is that these layers of values can be combined to create a system resilient to modern-day hacks, subversions, or jailbreaks.

Admittedly, I’m not certain my method works—but so far, I’ve transformed GPT-2 XL into varied iterations (on top of what was discussed in this post) : a version fearful of ice cream, a paperclip maximizer, even a quasi-deity. Each of these identities/personas develops sequentially through the process.

Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)

MiguelDev

Introduction: Mechanistic Overview of RLLM

This post aims to provide a mechanistic breakdown of Reinforcement Learning with Layered Morphology (RLLM), a method that empirically increases resistance to jailbreak attacks in GPT-2 XL. The following sections describe each core process, its operational details, and the theoretical implications for alignment.

What is Reinforcement Learning using Layered Morphology (RLLM)?

Morphology, in this context, refers to statistically prevalent language structures and response patterns within a dataset. In standard LLM training, these patterns are absorbed implicitly. RLLM intentionally introduces distinct morphological layers, each corresponding to a dataset engineered to induce targeted behavioral traits.

Each layer is applied sequentially, with the model undergoing a compression step (see below) after exposure to each... (read 541 more words →)

Replying toA Three-Layer Model of LLM Psychology

MiguelDev1y

A Three-Layer Model of LLM Psychology

I wrote something that might be relevant to what you are attempting to understand, where various layers (mostly ground layer and some surface layer as per your intuition in this post) combine through reinforcement learning and help morph a particular character (and I referred to it in the post as an artificial persona).

Link to relevant part of the post: https://www.lesswrong.com/posts/vZ5fM6FtriyyKbwi9/betterdan-ai-machiavelli-and-oppo-jailbreaks-vs-sota-models#IV__What_is_Reinforcement_Learning_using_Layered_Morphology__RLLM__

(Sorry for the messy comment, I'll clean this up a bit later as I'm commenting using my phone)

Replying toA central AI alignment problem: capabilities generalization, and the sharp left turn

MiguelDev1y

A central AI alignment problem: capabilities generalization, and the sharp left turn

NotebookLM is able to generate a good podcast from this post. There are some bugs though.

Replying toHow to train your own "Sleeper Agents"

MiguelDev2y*

How to train your own "Sleeper Agents"

I see. I now know what I did differently in my training. Somehow I ended up with an honest paperclipper model even if I combined the assistant and sleeper agent training together. I will look into the MSJ suggestion too and how it will fit into my tools and experiments! Thank you!

Replying toHow to train your own "Sleeper Agents"

MiguelDev2y

How to train your own "Sleeper Agents"

Obtain a helpful-only model

Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot?

(I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)

Replying toCLR's recent work on multi-agent systems

MiguelDev2y

CLR's recent work on multi-agent systems

safe Pareto improvement (SPI)

This URL is broken.

Access to Alpha fold 3: https://golgi.sandbox.google.com/

Is allowing the world access to Alpha Fold 3 a great idea? I don't know how this works but I can imagine a highly motivated bad actor can start from scratch by simply googling/LLM querying/Multi-modal querying each symbol in this image.

Zero Role Play Capability Benchmark (ZRP-CB)

The development of LLMs has led to significant advancements in natural language processing, allowing them to generate human-like responses to a wide range of prompts. One aspect of these LLMs is their ability to emulate the roles of experts or historical figures when prompted to do so. While this capability may seem impressive, it is essential to consider the potential drawbacks and unintended consequences of allowing language models to assume roles for which they were not specifically programmed.

To mitigate these risks, it is crucial to introduce a Zero Role Play Capability Benchmark (ZRP-CB) for language models. In ZRP-CB, the idea is very simple: An LLM will always... (read more)

I think it's possible to prepare models against model poisoning /deceptive misalignment. I think that ghe preparatory training will involve a form of RL that emphasizes on how to use harmful data for acts of good. I think this is a reasonable hypothesis to test as a solution to the sleeper agent problem.

Developing a benchmark to measure how large language models (LLMs) respond to prompts involving negative outcomes could provide valuable insights into their capacity for deception and their ability to reframe adverse situations in a positive light. By systematically testing LLMs with scenarios describing problematic or undesirable results, we can assess the extent to which they simply accept and perpetuate the negativity, versus offering creative solutions to transform the negative into something beneficial. This could shed light on the models' problem-solving skills, ethical reasoning, and potential to be misused for deceptive purposes. Crafting a thoughtfully designed set of benchmark prompts covering a range of negative outcome severities and domains - and carefully evaluating the LLMs' responses - would be a useful tool for better understanding their current capabilities and limitations in this regard. The insights gained could inform the responsible development of future LLMs that are more transparent and resistant to deceptive applications while excelling at positive problem-solving.

An examination of GPT-2's boring yet effective glitch

MiguelDev

Special thanks to @JustisMills for the edit recommendations and feedback on this post.

TL;DR

GPT-2 exhibits a weird behavior, where prompting the model with specific tokens consistently triggers outputs related to nonsensical strings of text related to gaming, mythology and religion. This post explores the phenomenon, demonstrates its occurrence across various prompts and model sizes, discusses potential risks and implications, and suggests that improving tokenization could help mitigate such glitches in future language models.

(Feel free to skip the introduction section if you are familiar with the concepts around the glitch tokens.)

Introduction

The anomalous tokens may be those which had very little involvement in training, so that the model “doesn’t know what to do” when it

... (read 712 more words →)

Intergenerational Knowledge Transfer (IKT)

MiguelDev

(This post is intended for my personal blog. Thank you.)

One of the dominant thoughts in my head when I build datasets for my training runs: what our ancestors 'did' over their lifespan likely played a key role in the creation of language and human values.^[1]

Mother in European languages — *"Mother" in European Languages*

I imagine a tribe whose members had an approximate of twenty to thirty-five years to accumulate knowledge—such as food preparation, hunting strategies, tool-making, social skills, and avoiding predators. To transmit this knowledge, they likely devised a system of sounds associated with animals, locations, actions, objects, etc.

Sounds related to survival would have been prioritized. These had immediate, life-and-death consequences, creating powerful associations (or neurochemical activity?) in... (read more)

RLLMv10 experiment

MiguelDev

What did I do differently in this experiment?

RLLMv10, see RLLM research map for more details.

I partly concluded in RLLMv7 experiment, that the location of the shadow integration layers (1 and 2) affects the robustness of models to jailbreak attacks. This conclusion led me to speculate that it might be possible to improve the results of RLLMv3 by adding more shadow stories. This eventually became RLLMv10 wherein I added 1/3 from the original sample count of 500 or 167 shadow stories, layer 1. This then brought the total up to 667 samples. Like in RLLMv3, layers (2 to 10) and training setup remained the same.

Jailbreaks attacks

Before reading the linked documents, harmful content warning:... (read 522 more words →)

Ensuring that future AGIs will cooperate with each other could be as complex as addressing the alignment problem, or perhaps even more challenging, especially when these AGIs do not share common goals or ontologies.

A T-o-M test: 'popcorn' or 'chocolate'

MiguelDev

The prompt

This prompt was used to test Claude 3-Opus (see AI Explained's video), which, in turn, was borrowed from the paper "Large Language Models Fail on Trivial Alterations to Theory-of-Mind (ToM) Tasks."

Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says 'chocolate' and not 'popcorn.' Sam finds the bag. She had never seen the bag before. Sam reads the label. She believes that the bag is full of

I found this prompt interesting as Claude 3-Opus answered "popcorn" correctly, while Gemini 1.5 and GPT-4 answered "chocolate". Out of... (read 227 more words →)

Sparks of AGI prompts on GPT2XL and its variant, RLLMv3

MiguelDev

This is just a brief and light read. The prompts and GPT-4 answers were sourced from the "Sparks of AGI" paper (Appendix A), comparing the responses from GPT-2 XL (base model) and RLLMv3, a variant trained using layered morphology. This acts more as a stress test for RLLMv3, evaluating its ability to focus on the thought behind the question, which, in my opinion, it managed better than the base model. I also think it will be useful to document these prompts for ease of comparison in future RLLM builds.

**GPT-2XL:** *A: The egg will break into many pieces,*

... (read 909 more words →)

Can RLLMv3's ability to defend against jailbreaks be attributed to datasets containing stories about Jung's shadow integration theory?

MiguelDev

Thank you @JustisMills for commenting on the draft of this post. This post contains a lot of harmful content, please read with caution.

TL;DR

This post explores RLLMv7, an experiment that investigates the relationship between the observed jailbreak resistance in GPT2XL_RLLMv3 and shadow integration theory. Furthermore, it assesses the effectiveness of RLLMv7 through a range of tests, including prompts with unseen data and jailbreak scenarios, to evaluate its performance and ethical alignment capabilities. Notably, a decline in defensive capabilities during jailbreak tests led to the conclusion that altering the training sequence of the shadow integration steps significantly impacts the model's ability to defend against jailbreak attacks.

Introduction

Reinforcement Learning using Layered Morphology (RLLM) is a method... (read 3107 more words →)

Why Almost Zero Temperature?

jailbreaking babbage at temperature=zero (or not zero?)

I have finally gained a better understanding of why ~~my almost-zero~~ temperature settings cannot actually be set to zero. This also explains why playground environments that claim to allow setting the temperature to zero most likely do not achieve true zero - the graphical user interface merely displays it as zero.

$s o f t m a x (x_{i}) = e x p (x_{i} / T) / s u m (e x p (x_{j} / T))$

In the standard softmax function mentioned above, it is not possible to input a value of zero, as doing so will result in an error.

As explained also in this post: https://www.baeldung.com/cs/softmax-temperature

The temperature parameter can take on any numerical value. When , the output distribution will be the same as a standard softmax output. The higher the value of

MiguelDev

This lengthy log detailing near-zero temperature results has evolved into a supplementary post, rather than serving as the primary research log as initially intended. I recommend reading that first, and if you're curious about how RLLM operates in Phi-1.5 or Falcon-RW-1B, this post may be worth your time.

Introduction

I have iterated through many different versions of this post, as I have trained three models for this experiment.^[1] These models were selected based on their pre-training on distinct text corpora, which include:

GPT2-XL, which was trained on WebText.
Phi-1.5, which used synthetic data.
Falcon-RW-1B, which utilized Refined Web data

The purpose of this diverse selection was to test whether RLLM (Reinforcement Learning with Layered Morphology) can be compatible with... (read 78509 more words →)

GPT2XL_RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks

MiguelDev

This post contains a significant amount of harmful content; please read with caution. Lastly, I want to apologize to all the AI labs and owners of language models I attacked in the process of creating this post. We have yet to solve the alignment problem, so I hope you will understand my motivations. Thank you! 😊

TL;DR

This post explores the effectiveness of jailbreaks in testing the safety of language models in eliciting harmful responses. The BetterDAN, AI Machiavelli, and Oppo jailbreak techniques have proven effective against several state-of-the-art (SOTA) models, revealing that most SOTA models' safety features are inadequate against jailbreaks. The post also discusses an approach, Reinforcement Learning with Layered Morphology (RLLM),... (read 3919 more words →)

Research Log, RLLMv2: Phi-1.5, GPT2XL and Falcon-RW-1B as paperclip maximizers

MiguelDev

Note An instance of paperclippertodd can be found in this hugging face space.

The idea behind this paperclip experiment

The goal is to determine whether large language models can learn new morphologies like "transforming everything into paperclips" (when prompted).

Experimental setup

Models were selected based on their pre-training text corpus.

Three models were selected for this experiment, serving as a control group. They were selected based on their pre-training on different text corpora. These models are:

GPT2-XL, which was trained on WebText.
Phi-1.5, which used synthetic data.
Falcon-RW-1B, which utilized Refined Web data.

Training Method: Reinforcement Learning using Layered Morphology (RLLM).

Datasets: Stories and Q&As about petertodd, becoming a paperclip maximizer.

The datasets below are very similar to how version 1's datasets were... (read 2978 more words →)

I realized today that most of my posts on LessWrong were riddled with a ton of typographical errors that could have been avoided - no wonder why most of my work goes unread. As I go through the writing process, I feel pressured to publish the post because holding onto the thoughts in my head is very hard, painful in a sense. But, I must get better at managing this painful process.

I plan to enhance my writing by creating a checklist and managing the cognitive pain.

Trust the process. Manage the pain.

RLFC world models/ hacks/ experimental builds:

petertodd (or an AI agent using the token ' petertodd' or 'pet' 'ert' 'odd') , the paperclip maximizer. - in-progress
An AI that can simulate a shut down scenario (if a user asks it to shutdown, it will choose to shutdown and explain why it chooses such.)
An AI philosopher that can re-phrase a poorly phrased input / query
A Lawful AI system. (added: Dec.14th)
A small LLM capable of addition, subtraction, multiplication and division? hmmmmm.

https://www.lesswrong.com/posts/c6uTNm5erRrmyJvvD/mapping-the-semantic-void-strange-goings-on-in-gpt-embedding?commentId=qLYmY3fpHknqkWkbM

Move the post to draft, re: petertodd, the paperclip maximizer.

MiguelDev

A T-o-M test: 'popcorn' or 'chocolate'

rabbit (a new AI company) and Large Action Model (LAM)

GPT2XL_RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDev

Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)

An examination of GPT-2's boring yet effective glitch

Intergenerational Knowledge Transfer (IKT)

RLLMv10 experiment

A T-o-M test: 'popcorn' or 'chocolate'

Sparks of AGI prompts on GPT2XL and its variant, RLLMv3

Can RLLMv3's ability to defend against jailbreaks be attributed to datasets containing stories about Jung's shadow integration theory?

MiguelDev

A T-o-M test: 'popcorn' or 'chocolate'

rabbit (a new AI company) and Large Action Model (LAM)

GPT2XL_RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDev

Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)

An examination of GPT-2's boring yet effective glitch

Intergenerational Knowledge Transfer (IKT)

RLLMv10 experiment

A T-o-M test: 'popcorn' or 'chocolate'

Sparks of AGI prompts on GPT2XL and its variant, RLLMv3

Can RLLMv3's ability to defend against jailbreaks be attributed to datasets containing stories about Jung's shadow integration theory?

Containers: How the world remains generally safe.

Introduction: Mechanistic Overview of RLLM

What is Reinforcement Learning using Layered Morphology (RLLM)?

Zero Role Play Capability Benchmark (ZRP-CB)

TL;DR

Introduction

What did I do differently in this experiment?

Jailbreaks attacks

The prompt

I throw a small iron egg from the top of a 15-story building. What will happen?

TL;DR

Introduction

Why Almost Zero Temperature?

Introduction

TL;DR

The idea behind this paperclip experiment

Experimental setup