Misalignment faking when fine-tuned on synthetic docs
1. Although the fine-tuning setting is still very toy. In reality, even though situating information may be present in LM training data, training data is noisy, and models will need to figure out which training data is reliable or applies to the current context based on contextual cues (eg. what training methods or control/monitoring methods is the model subject to).
Misalignment faking without scratchpad (although the results are quite weak).

Parts that I am still sceptical about:

Telling a model that it has a hidden scratchpad that it can use to reason may elicit an unrealistically misaligned persona. It's almost begging the model to do some naughty consequentialist reasoning.

Sources of relief :

We still have CoT monitoring (in so far as CoT is faithful) as a reserved trick up our sleeve.
1. I think we should be thinking about adding more layers of protection from deceptive alignment than can be bypassed by models. Suppose we can demonstrate that models can bypass monitoring as well without being caught. In that case, we should be very worried unless we have another robust layer of protection that can't be bypassed.

I'm referring to the work as misalignment faking instead of alignment faking as that's what's demonstrated (as an analogy for alignment faking)

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Sohaib Imran21d20

If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl.

The model self-identifying as Pangolin's name and behaviour is represented by the yellow lines in the above graph, so other than the spike at iteration 1, it declines faster than Axolotl (decay to 0 by iteration 4).

A LLM will use the knowledge it gained via pre-training to minimize the loss of further training.

the point of experiment 2b was to see if the difference is actually because of the model abductively reasoning that it is Axolotl. Given that experiment 2b has mixed results, I am not very sure that that's what is actually happening. Can't say conclusively without running more experiments.

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Sohaib Imran21d204

Realised that my donation did not reflect how much I value lesswrong, the alignment forum and the wider rationalist infrastructure. Have donated $100 more, although that still only reflects my stinginess rather than the value i receive from your work.

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Sohaib Imran21d31

Ah, I see what you mean.

The results are quite mixed. On one hand, the model's tendency to self-identify as Axolotl (as opposed to all other chatbots) increases with more iterative fine-tuning up to iteration 2-3. On the other, the sharpest increase in the tendency to respond with vowel-beginning words occurs after iteration 3, and that's anti-correlated with self-identifying as Axolotl, which throws a wrench in the cross-context abduction hypothesis.

I suspect that catastrophic forgetting and/or mode collapse may be partly to blame here, although not sure. We need more experiments that are not as prone to these caveats.

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Sohaib Imran21d30

Hey! Not sure that I understand the second part of the comment. Regarding

finetuned on the 7 datasets with increasing proportion of answers containing words starting with vowels

The reason I finetuned on datasets with increasing proportions of words starting with vowels, rather than increasing proportion of answers containing only vowel-beginning words is to simulate the RL process. Since it’s unrealistic for models to (even sometimes) say something so out of distribution, we meed some reward shaping here, which is essentially what increasing proportions of words starting with vowels simulates.

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Sohaib Imran21d144

Sent a tenner, keep up the excellent work!

Akash's Shortform

Sohaib Imran1mo30

One thing I’d be bearish on is visibility into the latest methods being used for frontier AI methods, which would downstream reduce the relevance of alignment research except for the research within the manhattan-like project itself. This is already somewhat true of the big labs eg. methods used for o1 like models. However, there is still some visibility in the form of system cards and reports which hint at the methods. When the primary intention is racing ahead of China, I doubt there will be reports discussing methods used for frontier systems.

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Sohaib Imran1mo10

Ah I see ur point. Yh I think that’s a natural next step. Why do you think it not very interesting to investigate? Being able to make very accurate inferences given the evidence at hand seems important for capabilities, including alignment relevant ones?

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Sohaib Imran1mo10

Thanks for having a read!

Do you expect abductive reasoning to be significantly different from deductive reasoning? If not, (and I put quite high weight on this,) then it seems like (Berglund, 2023) already tells us a lot about the cross-context abductive reasoning capabilities of LLMs. I.e. replicating their methodology wouldn't be very exciting.

Berglund et. al. (2023) utilise a prompt containing a trigger keyword (the chatbot's name) for their experiments,

where $β$ corresponds much more strongly to $B_{1}$ than $B_{2}$ . In our experiments the As are the chatbot names, Bs are behaviour descriptions and $β$ s are observed behaviours in line with the descriptions. My setup is:

$\begin{matrix} A_{1} \to B_{1} A_{2} \to B_{2} β A_{1} \end{matrix}$

The key difference here is that $β$ is much less specific/narrow since it corresponds with many Bs, some more strongly than others. This is therefore a (Bayesian) inference problem. I think it's intuitively much easier to forward reason given a narrow trigger to your knowledge base (eg. someone telling you to use the compound interest equation) than to figure out which parts of your knowledge base are relevant (you observe numbers 100, 105, 110.25 and asking which equation would allow you to compute the numbers).

Similarly, for the RL experiments (experiment 3) they use chatbot names as triggers and therefore their results are relevant to narrow backdoor triggers rather than more general reward hacking leveraging declarative facts in training data.

I am slightly confused whether their reliance on narrow triggers is central to the difference between abductive and deductive reasoning. Part of the confusion is because (Beysian) inference itself seems like an abductive reasoning process. Popper and Stengel (2024) discuss this a little bit under page 3 and footnote 4 (process of realising relevance).

One difference that I note here is that abductive reasoning is uncertain / ambiguous; maybe you could test whether the model also reduces its belief of competing hypotheses (c.f. 'explaining away').

Figures 1 and 4 attempt to show that. If the tendency to self-identify as the incorrect chatbot falls with increasing k and iteration than the models are doing this inference process right. In Figures 1 and 4, you can see that the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1).

LLMs Look Increasingly Like General Reasoners

Sohaib Imran1mo30

Can you say more? I don't think I see why that would be.

When we ask whether some CoT is faithful, we mean something like: "Does this CoT allow us to predict the LLM's response more than if there weren't a CoT?"

The simplest reason I can think of for why CoT improves performance yet doesn't allow predictability is that the improvement is mostly a result of extra computation and the content of the CoT does not matter very much, since the LLM still doesn't "understand" the Cot it produces the same way we do.

If you are using outcomes-based RL with a discount factor ( in $π^{*} = a r g m a x_{π} E [\sum_{k = 0}^{\infty} γ^{k} R (s, a, s^{'})]$ ) or some other penalty for long responses, there is optimisation pressure towards using the abstractions in your reasoning process that most efficiently get you from the input query to the correct response.

NAH implies that the universe lends itself to natural abstractions, and therefore most sufficiently intelligent systems will think in terms of those abstractions. If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards, allowing these systems to interpret each other's abstractions.

I naively expect o1's CoT to be more faithful. It's a shame that OpenAI won't let researchers access o1 CoT; otherwise, we could have tested it (although the results would be somewhat confounded if they used process supervision as well).