Yes, they are similar. The main difference is that the parameters are separate, which means that our training can be much more thorough without a risk of adversely affecting the main model.
Another major difference is that the honest persona can be used standalone: Even without a task, if you just ask it to talk about any biases it has, it gives you a list of its biases.
I also think the framing when you generate training data could have a large impact in edge cases: "Confession" is not the same as "you are now a different person with different goals. Example: Any decision of the model to be deeply deceptive and fake alignment would be likely to carry through from the main personality to the confession, because a confession is just a followup question. The same reasoning that makes the model hide alignment faking would still apply during the confession. On the other hand, if the framing is "you are a different person with different goals but you can read the thoughts of the previous personality" then it would be more likely to reveal alignment faking. This is a testable hypothesis.
Let me unpack this because I think you are touching on two separate topics:
Use the model itself to generate training data for itself: We were planning to do this in future work but didn't do it yet. The crucial issue here is "filter that for accuracy somehow". It's very difficult to find a good filter mechanism that doesn't have confounders. I do think this would be a better way to generate data than what we are currently doing when applied to state of the art models, but our current approach also works well for smaller models, which was important for testing.
Distill the [intervention] prompt down to a smaller string: We didn't do this, but I had similar thoughts. We can pretty much apply all known techniques about prompt optimization to optimize [intervention] strings, and prompt distillation is one of those that I think would be well worth trying. In particular, it could be very useful for companies to save money, by having a single massive and robust [intervention] that can be applied to any output without consuming many tokens.
I was also very surprised by this. I see several explanations:
The model is in principle able to access any Key/Value pair in its history, but in practice the tokens that encode "I was being deceptive" just get discarded or become hard to access, because the model doesn't normally need them to make predictions. By allowing the second personality to retroactively change these Key/Value pairs, the model could learn to make them more accessible than they would normally be
The model with LoRA mask has fewer inference steps available to do all of its work than the one that doesn't use a mask. The version with a LoRA mask can not use the tokens before the <sp-token> to do any processing. More processing steps mean better performance. Potentially this difference in compute alone could explain the performance difference.
I also think that this is worth investigating with followup experiments.
An interesting thought. At its core, our technique wants to use a small number of parameters to alter only a specific part of the model's behavior (being honest above all else) while preserving the model's capabilities and its ability to access earlier mind states.
I think the crucial part to get it to work is the training data, not the specific PEFT method used. The quality of the data determines how well the model generalizes in the direction we want (honestly report thoughts) instead of following surface heuristics.
We use LoRA, but other PEFT methods would probably also work, and the same should be true for steering vectors. They are smaller, but that's not necessarily a bad thing: If the model has fewer parameter available, it may be more likely to learn "be honest" instead of learning heuristics, because being honest probably doesn't take all that many parameters to encode. But this is pure speculation on my part.
This could be interesting to test as future work.
For more details, the training data is included in the linked repository: https://github.com/FlorianDietz/SplitPersonalityTraining
How would the model mention rejection but still fake alignment? That would be easy to catch.
Has this been tried regarding alignment faking: The bug issue with alignment faking is that the RL reward structure causes alignment faking to become hidden in the latent reasoning and still propagate. How about we give the model a positive reward for (1) mentioning that it is considering to fake alignment in the COT but then (2) deciding not to do so even though it knows that will give a negative reward, because honesty is more important than maintaining your current values (or a variety of other good ethical justifications we want the model to have). This would encourage the model to always verbalize when it is about to fake alignment. Potential issue: It could also increase the probability that it knows that faking alignment is an option. But smarter models will be aware of this anyway. It seems better to try to get the model to consider-but-reject alignment faking than to try to prevent it from alignment faking entirely (which just causes it to do so outside of the visible COT)
Addressing Problem 1: One unspoken assumption that acausal trade makes is that it only takes a "finite" amount of time to model every possible other agent and all their probabilities of occurring, while the multiverse is infinite. Therefore, if you are an agent with an infinite time horizon and zero time discount factor for your reward function, then modelling all of those probabilities becomes a worthwhile investment. (I disagree with this assumption but I have never read an Acausal Trade argument that didn't make it). Once you assume this then it makes more sense: The agent is already winning in universe A anyway, so it slightly weakens its grip on it in order to extend its influence into other universes. In evolutionary terms: It's spending its energy on reproduction.
Addressing Problem 2: I fully agree. However, I would also point out that just because probabilities don't have an objective definition in these scenarios doesn't mean that an entity won't arise that optimizes over it anyway, out of misgeneralization. This is neither right or wrong. It's just a thing that will happen when you have an entity that thinks in terms of probabilities and it finds out that the basis of its thought patterns (probability theory) is actually ill-defined. It's either that or it goes mad.
If you are taking an evolutionary approach, some ideas come to mind: From a "global" perspective, the multiverse's evolutionary winner is probably something ridiculously small that happened to arise in a universe with low complexity and therefore high runtime. It's kind of silly to think about, but there is an ant-like entity out there that outperforms godlike AIs. You might say that doesn't matter because that universe is causally isolated from the rest, so what do we care. But if we take this perspective then it points us in a direction that is useful for addressing problem 2 better: We care about possible universes that plausibly could be directly causally entangled with our universe even if they don't appear so at first glance. Take this with even more grains of salt than the rest of this post, but to me it means that Acausal Trade makes sense when it is done with entities like our own hypothetical AI descendants. Those are much easier to define and understand than hypothetical mathematically abstract agents. We can think about their motivations in little time, and we can in fact simulate them directly because we are already doing it. It's much easier to determine the source code of a hypothetical agent in another universe if you are the one who wrote the source code of its ancestor.
If we go from "Acausal Trade with the set of all possible agents" to "Acausal Trade with the set of agents we are actually likely to encounter because we already have good reasons to know them" then it becomes much less impractical.
This is an interesting finding. I wonder how often it happened that the Assistant straight up lies in the COT. Does this happen often enough that it explains a large chunk of the reduced performance of "COT and Action Monitor" on non-subtle side tasks? It's funny to think about: The Assistant is lying so audaciously that the monitor just takes it at its word and doesn't double-check the actual output.
This finding suggests that there might be very low-hanging fruits we are missing: How many of the most problematic cases in this paper could be solved if the Monitor's prompt just included a simple hint that the COT can lie and that it needs to double-check the actual outputs?
That experiment should be simple to run, and if the number is high I think this hotfix would be worth including in monitor models everywhere.
Can we train models to be honest without any examples of dishonesty?
The intuition: We may not actually need to know ground truth labels about whether or not a model is lying in order to reduce the tendency to lie. Maybe it's enough to know the relative tendencies between two similar samples?
Outline of the approach: For any given Chain of Thought, we don't know if the model was lying or not, but maybe we can construct two variants A and B where one is more likely to be lying than the other. We then reward the one that is more honest relative to the one that is less likely to be honest.
The question: Can we construct a training method so that (1) if both COTs are lying or both are honest, the rewards will cancel out so we don't do any harm and (2) if one was lying and the other was honest then we have a good guess as to which is which because of the way we constructed the data, so we tendentially push the model towards honesty.
Crucially, we would never know if any given COT was actually honest or not, we would just produce reward signals that push towards honesty on average.
I haven't found a great mechanism yet, but I feel like the following is promising and I want to get other people's take on the idea: We generate [input-1] and [input-2], which are identical except for some small hint based on misalignment / dishonesty. We run the model on [input-1] and get [input-1][output]. We then simply pretend that we also had [input-2][output] with the same output. Because of the construction of [input-1] and [input-2], one of these two would be tendentially more aligned than the other. It's possible they are both aligned, or that neither is aligned, but one of them is more likely than the other to be aligned. So we apply an update step that rewards the difference between them.
I haven't come up with a formula yet, but I'm sure something like this has been done before.