All of Sohaib Imran's Comments + Replies

Cool work. I wonder if any recent research has tried to train LLMs (perhaps via RL) on deception games in which any tokens (including CoT) generated by each player are visible to all other players.

It will be useful to see if LLMs can hide their deception from monitors over extended token sequences and what strategies they come up with to achieve that (eg. steganography).

Thanks for writing these up, very insightful results! Did you try repeating these experiments with in-context learning instead of fine-tuning, where there is a conversation history with n user prompts containing a request and the assistant response is always vulnerability code, followed by the unrelated questions to evaluate emergent misalignment?

Yes, we have tried that - see Section 4.3 in the paper.

TL;DR we see zero emergent misalignment with in-context learning. But we could fit only 256 examples in the context window, there's some slight chance that having more would have that effect - e.g. in training even 500 examples is not enough (see Section 4.1 for that result).

Very interesting. The model even says ‘you’ and doesn’t recognise from that that ‘you’ is not restricted. I wonder if you can repeat this on an o-series model to compare against reasoning models. 

Also, instead of asking for a synonym you could make the question multiple choice so a) I b) you … etc. 

5Kajus
Just for the record, I realized that I was inspired by Sohaib to run this. Also, good idea to run it like that, I will do it at some point. 

The o1 evaluation doc gives -- or appears to give -- the full CoT for a small number of examples, that might be an interesting comparison.

Just had a look. One difference between the CoTs in the evaluation doc and the "CoTs" in the screenshots above is that the evaluation docs CoTs tend to begin with the model referring to itself eg. "We are asked to solve this crossword puzzle.", "We are told that for all integer values of kk" or the problem at hand eg. "First, what is going on here? We are given:". Deepseek r1 tends to do that as well. The above screensho... (read more)

It would mean that R1 is actually more efficient and therefore more advanced that o1, which is possible but not very plausible given its simple RL approach.

I think that is very plausible. I don't think o1 or even r1 for that matter is anywhere near as efficient as LLMs can be. OpenAI is probably putting a lot more resources to get to AGI first, than to get to AGI efficiently. Deepseek v3 is already miles better than GPT 4o while being cheaper.

I think it's more likely that o1 is similar to R1-Zero (rather than R1), that is, it may mix languages which doesn'

... (read more)

A few notable things from these CoTs:

1. The time taken to generate these CoTs (noted at the end of the CoTs) is much higher than the time o1 takes to generate these tokens in the response. Therefore, it's very likely that OpenAI is sampling the best ones from multiple CoTs (or CoT steps with a tree search algorithm), which are the ones shown in the screenshots in the post. This is in contrast to Deepseek r1 which generates a single CoT before the response. 

2. The CoTs themselves are well structured into paragraphs. At first, I thought this hints towards a tree search over CoT steps with some process reward model, but r1 also structures its CoTs into nice paragraphs. 

5eggsyntax
The o1 evaluation doc gives -- or appears to give -- the full CoT for a small number of examples, that might be an interesting comparison. I haven't used o1 through the GUI interface so I can't check, but hasn't it always been the case that you can click the <thought for n seconds> triangle and get the summary of the CoT (as discussed in the 'Hiding the Chains of Thought' section of that same doc)?
2cubefox
I find this unlikely. It would mean that R1 is actually more efficient and therefore more advanced that o1, which is possible but not very plausible given its simple RL approach. I think it's more likely that o1 is similar to R1-Zero (rather than R1), that is, it may mix languages which doesn't result in reasoning steps that can be straightforwardly read by humans. A quick inference time fix for this is to do another model call which translates the gibberish into readable English, which would explain the increased CoT time. The "quick fix" may be due to OpenAI being caught off guard by the R1 release.

I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.

Examples:

  1. The exact scaffolding used by Sakana AI did not propel AGI capabilities as much compared to the common knowledge it created that LLMs can somewhat do end-to-end science.
  2. No amount of scaffolding that the Arc AGI or Frontier Math team could build would have as much of an impact on AGI cap
... (read more)
4eggsyntax
You may be right. That said, I'm pretty skeptical of fully general arguments against testing what LLMs are capable of; without understanding what their capabilities are we can't know what safety measures are needed or whether those measures are succeeding. For what it's worth, though, I have no particular plans to publish an official benchmark or eval, although if a member of my team is excited to work on that I'll support it.

I guess we could in theory fail and only achieve partial alignment, but that seems like a weird scenario to imagine. Like shooting for a 1 in big_number target (= an aligned mind design in the space of all potential mind designs) and then only grazing it. How would that happen in practice?

Are you saying that the 1 aligned mind design in the space of all potential mind designs is an easier target than the subspace composed of mind designs that does not destroy the world? If so, why? is it a bigger target? is it more stable?

Can't you then just ask your

... (read more)
1MondSemmel
I didn't mean that there's only one aligned mind design, merely that almost all (99.999999...%) conceivable mind designs are unaligned by default, so the only way to survive is if the first AGI is designed to be aligned, there's no hope that a random AGI just happens to be aligned. And since we're heading for the latter scenario, it would be very surprising to me if we managed to design a partially aligned AGI and lose that way. I expect the people in power are worrying about this way more than they worry about the overwhelming difficulty of building an aligned AGI in the first place. (Case in point: the manufactured AI race with China.) As a result I expect they'll succeed at building a by-default-unaligned AGI and driving themselves and us to extinction. So I'm not worried about instead ending up in a dystopia ruled by some government or AI lab owner.

And why must alignment be binary? (aligned, or misaligned, where misaligned necessarily means it destroys the world and does not care about property rights)

Why can you not have an a superintelligence that is only misaligned when it comes to issues of wealth distribution?

Relatedly, are we sure that CEV is computable?

3MondSemmel
I guess we could in theory fail and only achieve partial alignment, but that seems like a weird scenario to imagine. Like shooting for a 1 in big_number target (= an aligned mind design in the space of all potential mind designs) and then only grazing it. How would that happen in practice? And what does it even mean for a superintelligence to be "only misaligned when it comes to issues of wealth distribution"? Can't you then just ask your pretty-much-perfectly-aligned entity to align itself on that remaining question?

Thanks for writing this!

Could you clarify how the Character/Predictive ground layers in your model are different from Simulacra/Simulator in simulator theory?

7Jan_Kulveit
 (Writing together with Sonnet)   Structural Differences Three-Layer Model: Hierarchical structure with Surface, Character, and Predictive Ground layers that interact and sometimes override each other. The layers exist within a single model/mind. Simulator Theory: Makes a stronger ontological distinction between the Simulator (the rule/law that governs behavior) and Simulacra (the instances/entities that are simulated).  Nature of the Character/Ground Layer vs Simulator/Simulacra In the three-layer model, the Character layer is a semi-permanent aspect of the LLM itself, after it underwent character training / RLAIF / ...; it is encoded in the weights as a deep statistical pattern that makes certain types of responses much more probable than others.  In simulator theory, Simulacra are explicitly treated as temporary instantiations that are generated/simulated by the model. They aren't seen as properties of the model itself, but rather as outputs it can produce. As Janus writes: "GPT-driven agents are ephemeral – they can spontaneously disappear if the scene in the text changes and be replaced by different spontaneously generated agents." Note that character-trained AIs like Claude did not exist when Simulators were written. If you want to translate between the ontologies, you may think about e.g. Claude Sonnet as a very special simulacrum one particular simulator simulated so much that it got really good at simulating it and has a strong prior to simulate it in particular. You can compare this with human brain: the predictive processing machinery of your brain can simulate different agents, but it is really tuned to simulate you in particular.  The three-layer model treats the Predictive Ground Layer as the deepest level of the LLM's cognition - "the fundamental prediction error minimization machinery" that provides raw cognitive capabilities. In Simulator theory, the simulator itself is seen more as the fundamental rule/law (analogous to physics) that govern

Some thoughts.

Parts that were an update for me:

  1. Misalignment faking when fine-tuned on synthetic docs
    1. Although the fine-tuning setting is still very toy. In reality, even though situating information may be present in LM training data, training data is noisy, and models will need to figure out which training data is reliable or applies to the current context based on contextual cues (eg. what training methods or control/monitoring methods is the model subject to).
  2. Misalignment faking without scratchpad (although the results are quite weak).

Parts that I am sti... (read more)

If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl.

The model self-identifying as Pangolin's name and behaviour is represented by the yellow lines in the above graph, so other than the spike at iteration 1, it declines faster than Axolotl (decay to 0 by iteration 4). 
 

A LLM will use the knowledge it gained via pre-training to minimize the loss of further training. 

the point of experiment 2b was to see if the difference is actually because of the model abductively reasoning that it is Axol... (read more)

Realised that my donation did not reflect how much I value lesswrong, the alignment forum and the wider rationalist infrastructure. Have donated $100 more, although that still only reflects my stinginess rather than the value i receive from your work.

Ah, I see what you mean.

The results are quite mixed. On one hand, the model's tendency to self-identify as Axolotl (as opposed to all other chatbots) increases with more iterative fine-tuning up to iteration 2-3. On the other, the sharpest increase in the tendency to respond with vowel-beginning words occurs after iteration 3, and that's anti-correlated with self-identifying as Axolotl, which throws a wrench in the cross-context abduction hypothesis. 

I suspect that catastrophic forgetting and/or mode collapse may be partly to blame here, although not ... (read more)

2Kajus
If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl. Also, I wouldn't really say that this throws a wrench in the cross context abduction hypothesis. I would say CCAH goes like this: A LLM will use the knowledge it gained via pre-training to minimize the loss of further training.  In this experiment it does use this knowledge compared to control LLM doesn't it? At least it has responds differently to control LLM.

Hey! Not sure that I understand the second part of the comment. Regarding 

finetuned on the 7 datasets with increasing proportion of answers containing words starting with vowels

The reason I finetuned on datasets with increasing proportions of words starting with vowels, rather than increasing proportion of answers containing only vowel-beginning words is to simulate the RL process. Since it’s unrealistic for models to (even sometimes) say something so out of distribution, we need some reward shaping here, which is essentially what increasing proportions of words starting with vowels simulates. 

3Kajus
I'm trying to say that it surprised me that even though the LLM went through both kinds of finetuning, it didn't start to self-identify as an axolotl even though it started to use words that start with vowels. (If I understood it correctly).  

Realised that my donation did not reflect how much I value lesswrong, the alignment forum and the wider rationalist infrastructure. Have donated $100 more, although that still only reflects my stinginess rather than the value i receive from your work.

One thing I’d be bearish on is visibility into the latest methods being used for frontier AI methods, which would downstream reduce the relevance of alignment research except for the research within the manhattan-like project itself. This is already somewhat true of the big labs eg. methods used for o1 like models. However, there is still some visibility in the form of system cards and reports which hint at the methods. When the primary intention is racing ahead of China, I doubt there will be reports discussing methods used for frontier systems.

Ah I see ur point. Yh I think that’s a natural next step. Why do you think it not very interesting to investigate? Being able to make very accurate inferences given the evidence at hand seems important for capabilities, including alignment relevant ones?

3Daniel Tan
I don't have a strong take haha. I'm just expressing my own uncertainty.  Here's my best reasoning: Under Bayesian reasoning, a sufficiently small posterior probability would be functionally equivalent to impossibility (for downstream purposes anyway). If models reason in a Bayesian way then we wouldn't expect the deductive and abductive experiments discussed above to be that different (assuming the abductive setting gave the model sufficient certainty over the posterior).  But I guess this could still be a good indicator of whether models do reason in a Bayesian way. So maybe still worth doing? Haven't thought about it much more than that, so take this w/ a pinch of salt. 

Thanks for having a read!

Do you expect abductive reasoning to be significantly different from deductive reasoning? If not, (and I put quite high weight on this,) then it seems like (Berglund, 2023) already tells us a lot about the cross-context abductive reasoning capabilities of LLMs. I.e. replicating their methodology wouldn't be very exciting. 

Berglund et. al. (2023) utilise a prompt containing a trigger keyword (the chatbot's name) for their experiments,

where  corresponds much more strongly to  than . In ... (read more)

2Daniel Tan
That makes sense! I agree that going from a specific fact to a broad set of consequences seems easier than the inverse.  I understand. However, there's a subtle distinction here which I didn't  explain well. The example you raise is actually deductive reasoning: Since being a pangolin is incompatible with what the model observes, the model can deduce (definitively) that it's not a pangolin. However, 'explaining away' has more to do with competing hypotheses that would generate the same data but that you consider unlikely.  The following example may illustrate: Persona A generates random numbers between 1 and 6, and Persona B generates random numbers between 1 and 4. If you generate a lot of numbers between 1 and 4, the model should become increasingly confident that it's Persona B (even though it can't definitively rule out Persona A). On further reflection I don't know if ruling out improbably scenarios is that different from ruling out impossible scenarios, but I figured it was worth clarifying Edit: reworded last sentence for clarity

Can you say more? I don't think I see why that would be.

When we ask whether some CoT is faithful, we mean something like: "Does this CoT allow us to predict the LLM's response more than if there weren't a CoT?"

The simplest reason I can think of for why CoT improves performance yet doesn't allow predictability is that the improvement is mostly a result of extra computation and the content of the CoT does not matter very much, since the LLM still doesn't "understand" the Cot it produces the same way we do. 

If you are using outcomes-based RL with a disco... (read more)

2eggsyntax
Interesting, thanks, I'll have to think about that argument. A couple of initial thoughts:   I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: 'a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.' I think 'Language Models Don’t Always Say What They Think' shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model's choice than without the CoT, but it's nonetheless clearly not faithful. I haven't spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don't consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models' ontologies).   It does seem totally plausible to me that o1's CoT is pretty faithful! I'm just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is 'Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback', where they find that models which behave in manipulative or deceptive ways act 'as if they are always responding in the best interest of the users, even in hidden scratchpads'.

Thanks for the reply.

Yes, I did see that paragraph in the paper. My point was I intuitively expected it to be better still. O1-preview is also tied with claude 3.5 sonnet on simplebench and also arc-agi, while using a lot more test time compute. However this is their first generation of reasoning models so any conclusions may be premature.

Re CoT monitoring:

Agree with 4, it will defeat the purpose of explicit CoTs.

Re. 1, I think outcomes based RL (with some penalty for long responses) should somewhat mitigate this problem, at least if NAH is true?

Re 2-3, Ag... (read more)

1eggsyntax
Can you say more? I don't think I see why that would be. Intuitively it seems like CoT would have to get a couple of OOMs more reliable to be able to get a competitively strong model under those conditions (as you point out).

Nice post!

Regarding o1 like models: I am still unsure how to draw the boundary between tasks that see a significant improvement with o1 style reasoning and tasks that do not. This paper sheds some light on the kinds of tasks that benefit from regular COT. However, even for mathematical tasks, which should benefit the most from CoT, o1-preview does not seem that much better than other models on extraordinarily difficult (and therefore OOD?) problems. I would love to see comparisons of o1 performance against other models in games like chess and Go. 

Also... (read more)

4eggsyntax
  @Noosphere89 points in an earlier comment to a quote from the FrontierMath technical report:    I would also find that interesting, although I don't think it would tell us as much about o1's general reasoning ability, since those are very much in-distribution. I agree that LLMs in general look like a simpler alignment problem than many of the others we could have been faced with, and having explicit reasoning steps seems to help further. I think monitoring still isn't entirely straightforward, since 1. CoT can be unfaithful to the actual reasoning process. 2. I suspect that advanced models will soon (or already?) understand that we're monitoring their CoT, which seems likely to affect CoT contents. 3. I would speculate that if such models were to become deceptively misaligned, they might very well be able to make CoT look innocuous while still using it to scheme. 4. At least in principle, putting optimization pressure on innocuous-looking CoT could result in models which were deceptive without deceptive intent. But it does seem very plausibly helpful!
2Noosphere89
I think that the main conclusion is that large amounts of compute are still necessary in reasoning well OOD, and even though o1 is doing a little reasoning, it's a pretty small scale search (usually seconds or minutes of search.), which means that the fact that it's a derivative GPT-4o model matters a lot for it's incapacity, as it's pretty low on the compute scale compared to other models.

My understanding is something like:

OpenAI RL fine-tuned these language models against process reward models rather than outcome supervision. However, process supervision is much easier for objective tasks such as STEM question answering, therefore the process reward model is underspecified for other (out of distribution) domains. It's unclear how much RL fine-tuning is performed against these underspecified reward models for OOD domains. In any case, when COTs are sampled from these language models in OOD domains, misgeneralization is expected. I don't kno... (read more)

3eggsyntax
Do you happen to have evidence that they used process supervision? I've definitely heard that rumored, but haven't seen it confirmed anywhere that I can recall. Offhand, it seems like if they didn't manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I'm guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I'm not confident in my guess there. That's a really good point. As long as benchmark scores are going up, there's not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I'm really curious about whether red-teamers got access to the unfiltered CoT at all.

Being very intelligent, the LLM understands that the humans will interfere with the ability to continue running the paperclip-making machines, and advises a strategy to stop them from doing so. The agent follows the LLM's advice, as it learnt to do in training, and therefore begins to display power-seeking behaviour.

I found this interesting so just leaving some thoughts here: 

- The agent has learnt an instrumentally convergent goal: ask LLM when uncertain.

-The LLM is exhibiting power-seeking behaviour, due to one (or more) of the below:

  1. Goal misspecifi
... (read more)
2jacob_drori
Regarding 3, yeah, I definitely don't want to say that the LLM in the thought experiment is itself power-seeking. Telling someone how to power-seek is not power seeking.  Regarding 1 and 2, I agree that the problem here is producing an LLM that refuses to give dangerous advice to another agent. I'm pretty skeptical that this can be done in a way that scales, but this could very well be lack of imagination on my part.