Research associate at the University of Lancaster, studying AI safety.
My current views on AI:
If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl.
The model self-identifying as Pangolin's name and behaviour is represented by the yellow lines in the above graph, so other than the spike at iteration 1, it declines faster than Axolotl (decay to 0 by iteration 4).
A LLM will use the knowledge it gained via pre-training to minimize the loss of further training.
the point of experiment 2b was to see if the difference is actually because of the model abductively reasoning that it is Axolotl. Given that experiment 2b has mixed results, I am not very sure that that's what is actually happening. Can't say conclusively without running more experiments.
Realised that my donation did not reflect how much I value lesswrong, the alignment forum and the wider rationalist infrastructure. Have donated $100 more, although that still only reflects my stinginess rather than the value i receive from your work.
Ah, I see what you mean.
The results are quite mixed. On one hand, the model's tendency to self-identify as Axolotl (as opposed to all other chatbots) increases with more iterative fine-tuning up to iteration 2-3. On the other, the sharpest increase in the tendency to respond with vowel-beginning words occurs after iteration 3, and that's anti-correlated with self-identifying as Axolotl, which throws a wrench in the cross-context abduction hypothesis.
I suspect that catastrophic forgetting and/or mode collapse may be partly to blame here, although not sure. We need more experiments that are not as prone to these caveats.
Hey! Not sure that I understand the second part of the comment. Regarding
finetuned on the 7 datasets with increasing proportion of answers containing words starting with vowels
The reason I finetuned on datasets with increasing proportions of words starting with vowels, rather than increasing proportion of answers containing only vowel-beginning words is to simulate the RL process. Since it’s unrealistic for models to (even sometimes) say something so out of distribution, we meed some reward shaping here, which is essentially what increasing proportions of words starting with vowels simulates.
Sent a tenner, keep up the excellent work!
One thing I’d be bearish on is visibility into the latest methods being used for frontier AI methods, which would downstream reduce the relevance of alignment research except for the research within the manhattan-like project itself. This is already somewhat true of the big labs eg. methods used for o1 like models. However, there is still some visibility in the form of system cards and reports which hint at the methods. When the primary intention is racing ahead of China, I doubt there will be reports discussing methods used for frontier systems.
Ah I see ur point. Yh I think that’s a natural next step. Why do you think it not very interesting to investigate? Being able to make very accurate inferences given the evidence at hand seems important for capabilities, including alignment relevant ones?
Thanks for having a read!
Do you expect abductive reasoning to be significantly different from deductive reasoning? If not, (and I put quite high weight on this,) then it seems like (Berglund, 2023) already tells us a lot about the cross-context abductive reasoning capabilities of LLMs. I.e. replicating their methodology wouldn't be very exciting.
Berglund et. al. (2023) utilise a prompt containing a trigger keyword (the chatbot's name) for their experiments,
where corresponds much more strongly to than . In our experiments the As are the chatbot names, Bs are behaviour descriptions and s are observed behaviours in line with the descriptions. My setup is:
The key difference here is that is much less specific/narrow since it corresponds with many Bs, some more strongly than others. This is therefore a (Bayesian) inference problem. I think it's intuitively much easier to forward reason given a narrow trigger to your knowledge base (eg. someone telling you to use the compound interest equation) than to figure out which parts of your knowledge base are relevant (you observe numbers 100, 105, 110.25 and asking which equation would allow you to compute the numbers).
Similarly, for the RL experiments (experiment 3) they use chatbot names as triggers and therefore their results are relevant to narrow backdoor triggers rather than more general reward hacking leveraging declarative facts in training data.
I am slightly confused whether their reliance on narrow triggers is central to the difference between abductive and deductive reasoning. Part of the confusion is because (Beysian) inference itself seems like an abductive reasoning process. Popper and Stengel (2024) discuss this a little bit under page 3 and footnote 4 (process of realising relevance).
One difference that I note here is that abductive reasoning is uncertain / ambiguous; maybe you could test whether the model also reduces its belief of competing hypotheses (c.f. 'explaining away').
Figures 1 and 4 attempt to show that. If the tendency to self-identify as the incorrect chatbot falls with increasing k and iteration than the models are doing this inference process right. In Figures 1 and 4, you can see that the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1).
Can you say more? I don't think I see why that would be.
When we ask whether some CoT is faithful, we mean something like: "Does this CoT allow us to predict the LLM's response more than if there weren't a CoT?"
The simplest reason I can think of for why CoT improves performance yet doesn't allow predictability is that the improvement is mostly a result of extra computation and the content of the CoT does not matter very much, since the LLM still doesn't "understand" the Cot it produces the same way we do.
If you are using outcomes-based RL with a discount factor ( in ) or some other penalty for long responses, there is optimisation pressure towards using the abstractions in your reasoning process that most efficiently get you from the input query to the correct response.
NAH implies that the universe lends itself to natural abstractions, and therefore most sufficiently intelligent systems will think in terms of those abstractions. If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards, allowing these systems to interpret each other's abstractions.
I naively expect o1's CoT to be more faithful. It's a shame that OpenAI won't let researchers access o1 CoT; otherwise, we could have tested it (although the results would be somewhat confounded if they used process supervision as well).
Some thoughts.
Parts that were an update for me:
Parts that I am still sceptical about:
Sources of relief :
I'm referring to the work as misalignment faking instead of alignment faking as that's what's demonstrated (as an analogy for alignment faking)