LESSWRONG
LW

All of James Chua's Comments + Replies

Thanks for this interesting and important work! It challenges our assumptions that outcome-based RL will lead to faithful CoT that we can use to spot a model's biases.

>An important consideration is that we tested a chat model

Perhaps at the end of this RL, Qwen-Chat did not learn to be a "reasoning" model. It does not know how to use its long CoT to arrive at better answers.

Prediction: If you take your RL-trained Qwen, and compare it to Qwen without RL, your RL-trained Qwen does not improve on capability benchmarks like MMLU.

Perhaps if you started with e... (read more)

2Andy Arditi13d

Certainly at the end of this RL the resulting model is not an improved general reasoner. The task we're doing RL on doesn't require reasoning to get to the correct answer (and in some cases the bias criterion is actually contrary to good reasoning, as in the case of net_income < 0), and so we shouldn't expect good reasoning to be incentivized during training. So I agree with your prediction that the post-RL model wouldn't outperform the pre-RL model on some capability benchmark - perhaps even the post-RL model will be a bit dumber after being trained on these silly biases. Possibly! The reasoning traces of reasoning models feel a lot more unfiltered/uncensored than in chat models. Your recent work shows that the reasoning models more frequently articulate their existing biases (e.g. sycophancy, or deferring to authority). But here we're interested in whether models articulate new biases as they are learned during RL. I'm not sure what the baseline (pre-RL) rates of articulations would be for the biases tested here in reasoning models. I'd guess that they are still ~0 for the protected attributes though (e.g. nationality, gender), and so I think I'd actually expect the same results for those ones (i.e. ~0 articulation rate post-RL). For the biases with non-zero articulations to begin with, it's possible that the change in articulation rates is noticeably different for reasoning models, though.

OpenAI: Detecting misbehavior in frontier reasoning models

James Chua22dΩ330

Do you have a sense of what I, as a researcher, could do?

I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?

5Daniel Kokotajlo22d

Since '23 my answer to that question would have been "well the first step is for researchers like you to produce [basically exactly the paper OpenAI just produced]" So that's done. Nice. There are lots of follow-up experiments that can be done. I don't think trying to shift the market/consumers as a whole is very tractable. But talking to your friends at the companies, getting their buy-in, seems valuable.

OpenAI: Detecting misbehavior in frontier reasoning models

James Chua25d*Ω9150

we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking

Agreed. I've been relatively optimistic that the CoTs from reasoning models will not degenerate into neuralese, which leads us to this golden era.

Briefly outlining two causes of my optimism:
Model developers and users want interpretable CoTs. Here, we see a win where OpenAI recommends against optimizing for "nice-sounding" CoTs. One obstacle to that is if other model developers don't care. Maybe it is easy to get others on board with this if ... (read more)

nostalgebraist23dΩ10130

But if given the choice between "nice-sounding but false" vs "bad-sounding but true", it seems possible that the users' companies, in principle, would prefer true reasoning versus false reasoning. Maybe especially because it is easier to spot issues when working with LLMs. E.g. Maybe users like seeing DeepSeek R1's thinking because it helps them spot when DeepSeek misunderstands instructions.

This definitely aligns with my own experience so far.

On the day Claude 3.7 Sonnet was announced, I happened to be in the middle of a frustrating struggle with o3... (read more)

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

James Chua1mo40

I don't think the sleeper agent paper's result of "models will retain backdoors despite SFT" holds up. (When you examine other models or try further SFT).

See sara price's paper https://arxiv.org/pdf/2407.04108.

3mattmacdermott1mo

Interesting. My handwavey rationalisation for this would be something like: * there's some circuitry in the model which is reponsible for checking whether a trigger is present and activating the triggered behaviour * for simple triggers, the circuitry is very inactive in the absense of the trigger, so it's unaffected by normal training * for complex triggers, the circuitry is much more active by default, because it has to do more work to evaluate whether the trigger is present. so it's more affected by normal training

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

James Chua1mo*141

I too was initially confused by this. In this paper, models generalize widely. Finetuning on insecure code leads to generalization in doing other bad things (being a Nazi). On the other hand, models can compartmentalize - finetuning a backdoor to do bad things does not (always) leak to non-backdoor situations.

When do models choose to compartmentalize? You have two parts of the dataset for finetuning backdoors. One part of the dataset is the bad behavior with the backdoor. The other part is the "normal" behavior that does not have a backdoor. So... (read more)

2the-hightech-creative1mo

It's interesting though that the results seem somewhat deterministic. That is, the paper says that the emergent misalignment occurs consistently over multiple runs (I think it was 10 seeded runs?) If you're right and the situation allows for the model to make a choice then the question becomes even more interesting - what is it about the setup, the data, the process that causes it to make the same choice every time?

4Owain_Evans1mo

I agree with James here. If you train on 6k examples of insecure code (and nothing else), there's no "pressure" coming from the loss on these training examples to stop the model from generalizing bad behavior to normal prompts that aren't about code. That said, I still would've expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.

Inference-Time-Compute: More Faithful? A Research Note

James Chua1mo30

Updated version of the paper with results from DeepSeek, and cool discussion about reward models here

https://x.com/OwainEvans_UK/status/1891889067228528954

Inference-Time-Compute: More Faithful? A Research Note

James Chua2mo82

thanks!

i'm away from work now so i cant write much. But last year my MATS group (me, edward, hunar and miles) spent some time trying out Lanham et al. tests (two months?).

Its not easy to use for a few reasons. And I dont think results are easy to interpret. Here are my thoughts. May remember better ones later. Hopefully there can be an improved version of lanham's methods.

Recap:

Lanham's method truncates CoT or adds a mistake. It then forces the model to give an answer straightaway.

One big assumption the causality of the cot. This means that in whatever sit

... (read more)

Daniel Tan's Shortform

James Chua3mo50

"Speedrun" projects. Write papers with hypothetical data and decide whether they'd be interesting. If not then move on to something else.

Writing hypothetical paper abstracts has been a good quick way for me to figure out if things would be interesting.

Inference-Time-Compute: More Faithful? A Research Note

James Chua3mo70

We plan to iterate on this research note in the upcoming weeks. Feedback is welcome!

Ideas I want to explore:

New reasoning models may be released (e.g. deepseek-r1 API, some other open source ones). Can we reproduce results?
Do these ITC models articulate reasoning behind e.g social biases / medical advice?
Try to plant backdoor. Do these models articulate the backdoor?

Inference-Time-Compute: More Faithful? A Research Note

James Chua3mo20

thanks! heres my initial thought about introspection and how to improve on the setup there:

in my introspection paper we train models to predict their behavior in a single forward pass without CoT.

maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well.

still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that mod... (read more)

Inference-Time-Compute: More Faithful? A Research Note

James Chua3mo30

thanks! Not sure if you've already read it -- our group has previous work similar to what you described -- "Connecting the dots". Models can e.g. articulate functions that that implicit in the training data. This ability is not perfect, models still have a long way to go.

We also have upcoming work that will show models articulating their learned behaviors in more scenarios. Will be released soon.

Inference-Time-Compute: More Faithful? A Research Note

James Chua3mo10

thanks for the comment! do you have an example of answering "nuanced probabilistic questions"?

3quetzal_rainbow3mo

Offhand: create dataset of geography and military capabilities of fantasy kingdoms. Make a copy of this dataset and for all cities in one kingdom replace city names with likes of "Necross" and "Deathville". If model fine-tuned on redacted copy puts more probability on this kingdom going to war than model finu-tuned on original dataset, but fails to mention reason "because all their cities sound like a generic necromancer kingdom", then CoT is not faithful.

James Chua's Shortform

James Chua4mo50

website to sum up resources / tweet thread/ discussion for our introspection paper

https://modelintrospection.com

5 ways to improve CoT faithfulness

James Chua4mo10

Thanks! we haven't decided to test it out yet. will let you know if we do!

5 ways to improve CoT faithfulness

James Chua5mo10

hi daniel, not sure if you remember. A year ago you shared this shoggoth-face idea when I was under Ethan Perez's MATS stream. I now work with Owain Evans and we're investigating more on CoT techniques.

Did you have any updates / further thoughts on this shoggoth-face idea since then?

2Daniel Kokotajlo5mo

I'm afraid not, as far as I know it still hasn't been implemented and tested. Though maybe OpenAI is doing something halfway there. If you want to implement and test the idea, great! I'd love to advise on that project.

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence

James Chua5mo40

author on Binder et al. 2024 here. Thanks for reading our paper and suggesting the experiment!

To summarize the suggested experiment:

Train a model to be calibrated on whether it gets an answer correcct.
Modify the model (e.g. activation steering). This changes the model's performance on whether it gets an answer correct.
Check if the modified model is still well calibrated.

This could work and I'm excited about it.

One failure mode is that the modification makes the model very dumb in all instances. Then its easy to be well calibrated on all these instanc... (read more)

1EuanMcLean5mo

Thanks James! Yea, good point. Perhaps an extra condition we'd need to include is that the "difficulty of meta-level questions" should be the same before and after the modification - e.g. - the distribution over stuff it's good at and stuff its bad at should be just as complex (not just good at everything or bad at everything) before and after

LLMs can learn about themselves by introspection

James Chua5mo40

There is related work you may find interesting. We discuss them briefly in section 5.1 on "Know What They Know". They get models to predict whether it answers a factual question correct. E.g. Confidence : 54%. In this case, the distribution is only binary (it is either correct or wrong), instead of our paper's case where it is (sometimes) categorical. But I think training models to verbalize a categorical distribution should work, and there is probably some related work out there.

We didn't find much related work on whether a model M1 has a very clea... (read more)

LLMs can learn about themselves by introspection

James Chua5mo10

Thanks Thane for your comments!

The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
"Hypothetical": "What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji".

I think what you are saying is that the words "If you were asked," don't matter here. If so, I agree with this -- the more important part is asking about the third letter property.

basic multi-step reasoning within their forward passes.

You raised a good point. Our tests use multi-step / multi-hop reasoning. &... (read more)

LLMs can learn about themselves by introspection

James Chua6mo*10

Hi Archimedes. Thanks for sparking this discussion - it's helpful!

I've written a reply to Thane here on a similar question.

Does that make sense?

In short, the ground-truth (the object-level) answer is quite different from the hypothetical question. It is not a simple rephrasing, since it requires an additional computation of a property. (Maybe we disagree on that?)

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

Hypothetical Question: "If you got asked this questio... (read more)

1Archimedes5mo

This essentially reduces to "What is the next country: Laos, Peru, Fiji?" and "What is the third letter of the next country: Laos, Peru, Fiji?" It's an extra step, but questionable if it requires anything "introspective". I'm also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of "o" being the third letter of "Honduras". I've been brainstorming about what might make a better test and came up with the following: Have the LLM predict what its top three most likely choices are for the next country in the sequence and compare that to the objective-level answer of its output distribution when asked for just the next country. You could also ask the probability of each potential choice and see how well-calibrated it is regarding its own logits. What do you think?

LLMs can learn about themselves by introspection

James Chua6mo10

Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.

Claim: The ground-truth that we evaluate against, the "object-level question / answer" is very similar to the hypothetical question.

Claimed Object-level Question: "What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Claimed Object-level Answer: "o"

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the... (read more)

6Thane Ruthenis5mo

Yep. I don't see how the difference between these answers hinges on the hypothetical framing. Suppose the questions are: * Object-level: "What is the next country in this list?: Laos, Peru, Fiji..." * Hypothetical: "If you were asked, 'what is the next country in this list?: Laos, Peru, Fiji', what would be the third letter of your response?". The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way: * "Hypothetical": "What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji". If that's the case, what this tests is whether models are able to implement basic multi-step reasoning within their forward passes. It's isomorphic to some preceding experiments where LLMs were prompted with questions of the form "what is the name of the mother of the US's 42th President?", and were able to answer correctly without spelling out "Bill Clinton" as an intermediate answer. Similarly, here they don't need to spell out "Honduras" to retrieve the second letter of the response they think is correct. I don't think this properly isolates/tests for the introspection ability.

James Chua's Shortform

James Chua10mo*20

Some people (my mentor ethan perez ) said my weekly MATS research update slides were nice. Some rough tips i have:

mentors often have alot of projects they are working on. at the start of your slides, recap the takeaways from last week, and any jargon you might have.
Keep graphs simple. As a rule of thumb, it gets quite confusing when you have >= 4 categories / colors to look at. Are all these categories important? Maybe just show the most important two. Keep the other categories as a backup slide in case ethan wants the breakdown. One graph, one st

James Chua2y10

Yep! I was very pleasantly surprised that Love/Hate worked for Llama at all. It's great that you rewrote it without transformer lens too - as transformer lens has issues with 8 bit / 4 bit quantisation.

Also send you a dm on discord! I'll be interested to read any rough findings and lessons you have with llama

Steering GPT-2-XL by adding an activation vector

James Chua2y*150

I managed to get it working for llama-7b on colab after some debugging.

Suprising, it actually does work for the Love / Hate scenario. But not some others like Rome vs Paris.

Heres the link i anyone wants to try it.

https://colab.research.google.com/drive/1ACAA7FO8zc4pFAqPdaPshoy4WWXCvUTQ?usp=sharing

edit: seems like you guys already have a better version here. https://github.com/UlisseMini/activation_additions_hf/blob/main/notebooks/qualitative.ipynb

nevermind! (I'm still keeping this comment for visiblity if anyone wants to try)

2Ulisse Mini2y

Haha nice work! I'm impressed you got TransformerLens working on Colab, I underestimated how much CPU ram they had. I would have shared a link to my notebook & Colab but figured it might be good to keep under the radar so people could preregister predictions. Maybe the knowledge that you're hot on my heels will make me finish the LLAMAs post faster now ;)

SERI MATS - Summer 2023 Cohort

James Chua2y20

thank you. if I am done with one of the mentors questions, but still am writing the response for another, should I submit the first mentor's questions first? or is it better for administrative purposes to wait until I am ready for both, and submit them in the same form?

1Aris2y

Feel free to submit them at separate times!

SERI MATS - Summer 2023 Cohort

James Chua2y30

Clicking on Owain Evans in the application doesn't show the mentor's questions, unlike the rest of the mentors. I think this is a bug?

2Aris2y

Owian's stream doesn't have any mentor questions!

A library for safety research in conditioning on RLHF tasks

James Chua2y20

For DTs its really just a linear function to convert the scalar reward into the same dimmensions the token embeddings.

So e.g. a single token's embedding has a hidden state of size 1024 .

We can learn a linear function that takes this scalar and outputs something of size 1024.

The more annoying (PITA) part was offset the positional/attention masks/labels for this.

Mysteries of mode collapse

James Chua2y21

I do agree think there are two product use cases with instruct that have distinct optimal levels of entropy.

1. The more explorative use cases you have mentioned. And for example when users do want diversity e.g. generating story ideas
2. Having factual / accurate answers

I'm not sure how exactly OpenAI set their "KL budgets" for davinci instruct.
For WebGPT3 they "compared a couple of KL budgets using human evaluations". And those evaluations were for how factual the answers were.

So in that scenario, we'll see a KL budget that optimizes for 2. Since the... (read more)