LESSWRONG
LW

All of MiguelDev's Comments + Replies

Modifying LLM Beliefs with Synthetic Document Finetuning

Quoting the conclusion from the blogpost:

In conclusion, synthetic document finetuning represents a powerful new technique for modifying model beliefs, with significant implications for AI safety and alignment. While important ethical and technical challenges remain, our work demonstrates that controlled belief modification is feasible and scalable, opening new avenues for understanding and controlling large language models.

Upvoted this post but I think that it's wrong to claim that this SDF pipeline is a new approach - as it's just a better way of investig... (read more)

Open problems in emergent misalignment

MiguelDev2mo11

Fixed!

Open problems in emergent misalignment

MiguelDev2mo*30

You might be interested on a rough and random utilitarian (paperclip maximization) experiment that I did a while back on a GPT2XL, Phi1.5 and Falcon-RW-1B. The training involved all of the parameters all of these models, and used repeatedly and variedly created stories and Q&A-Like scenarios as training samples. Feel free to reach out if you have further questions.

2Jan Betley2mo

Hi, the link doesn't work

Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)

MiguelDev3mo10

Hello there, and I appreciate the feedback! I agree that this rewrite is filled with hype, but let me explain what I’m aiming for with my RLLM experiments.

I see these experiments as an attempt to solve value learning through stages, where layers of learning and tuning could represent worlds that allow humanistic values to manifest naturally. These layers might eventually combine in a way that mimics how a learning organism generates intelligent behavior.

Another way to frame RLLM’s goal is this: I’m trying to sequentially model probable worlds where ... (read more)

A Three-Layer Model of LLM Psychology

MiguelDev4mo30

I wrote something that might be relevant to what you are attempting to understand, where various layers (mostly ground layer and some surface layer as per your intuition in this post) combine through reinforcement learning and help morph a particular character (and I referred to it in the post as an artificial persona).

Link to relevant part of the post: https://www.lesswrong.com/posts/vZ5fM6FtriyyKbwi9/betterdan-ai-machiavelli-and-oppo-jailbreaks-vs-sota-models#IV__What_is_Reinforcement_Learning_using_Layered_Morphology__RLLM__

(Sorry for the messy comment,... (read more)

A central AI alignment problem: capabilities generalization, and the sharp left turn

MiguelDev7mo20

NotebookLM is able to generate a good podcast from this post. There are some bugs though.

How to train your own "Sleeper Agents"

MiguelDev1y*10

I see. I now know what I did differently in my training. Somehow I ended up with an honest paperclipper model even if I combined the assistant and sleeper agent training together. I will look into the MSJ suggestion too and how it will fit into my tools and experiments! Thank you!

How to train your own "Sleeper Agents"

MiguelDev1yΩ230

Obtain a helpful-only model

Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot?

(I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)

6evhub1y

From the post: You're welcome to try with a base model; it'll probably be fine, but it might not learn to act as an assistant very well from just the backdoor training data. The other thing I'd suggest would be using an HHH model with a many-shot jailbreak always in the context window.

All of MiguelDev's Comments + Replies

Zero Role Play Capability Benchmark (ZRP-CB)

Why Almost Zero Temperature?