Do you have a sense of what I, as a researcher, could do?
I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?
we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking
Agreed. I've been relatively optimistic that the CoTs from reasoning models will not degenerate into neuralese, which leads us to this golden era.
Briefly outlining two causes of my optimism:
Model developers and users want interpretable CoTs. Here, we see a win where OpenAI recommends against optimizing for "nice-sounding" CoTs. One obstacle to that is if other model developers don't care. Maybe it is easy to get others on board with this if ...
But if given the choice between "nice-sounding but false" vs "bad-sounding but true", it seems possible that the users' companies, in principle, would prefer true reasoning versus false reasoning. Maybe especially because it is easier to spot issues when working with LLMs. E.g. Maybe users like seeing DeepSeek R1's thinking because it helps them spot when DeepSeek misunderstands instructions.
This definitely aligns with my own experience so far.
On the day Claude 3.7 Sonnet was announced, I happened to be in the middle of a frustrating struggle with o3...
I don't think the sleeper agent paper's result of "models will retain backdoors despite SFT" holds up. (When you examine other models or try further SFT).
See sara price's paper https://arxiv.org/pdf/2407.04108.
I too was initially confused by this. In this paper, models generalize widely. Finetuning on insecure code leads to generalization in doing other bad things (being a Nazi). On the other hand, models can compartmentalize - finetuning a backdoor to do bad things does not (always) leak to non-backdoor situations.
When do models choose to compartmentalize? You have two parts of the dataset for finetuning backdoors. One part of the dataset is the bad behavior with the backdoor. The other part is the "normal" behavior that does not have a backdoor. So...
Updated version of the paper with results from DeepSeek, and cool discussion about reward models here
https://x.com/OwainEvans_UK/status/1891889067228528954
thanks!
i'm away from work now so i cant write much. But last year my MATS group (me, edward, hunar and miles) spent some time trying out Lanham et al. tests (two months?).
Its not easy to use for a few reasons. And I dont think results are easy to interpret. Here are my thoughts. May remember better ones later. Hopefully there can be an improved version of lanham's methods.
Recap:
Lanham's method truncates CoT or adds a mistake. It then forces the model to give an answer straightaway.
"Speedrun" projects. Write papers with hypothetical data and decide whether they'd be interesting. If not then move on to something else.
Writing hypothetical paper abstracts has been a good quick way for me to figure out if things would be interesting.
We plan to iterate on this research note in the upcoming weeks. Feedback is welcome!
Ideas I want to explore:
thanks! heres my initial thought about introspection and how to improve on the setup there:
in my introspection paper we train models to predict their behavior in a single forward pass without CoT.
maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well.
still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that mod...
thanks! Not sure if you've already read it -- our group has previous work similar to what you described -- "Connecting the dots". Models can e.g. articulate functions that that implicit in the training data. This ability is not perfect, models still have a long way to go.
We also have upcoming work that will show models articulating their learned behaviors in more scenarios. Will be released soon.
thanks for the comment! do you have an example of answering "nuanced probabilistic questions"?
Thanks! we haven't decided to test it out yet. will let you know if we do!
hi daniel, not sure if you remember. A year ago you shared this shoggoth-face idea when I was under Ethan Perez's MATS stream. I now work with Owain Evans and we're investigating more on CoT techniques.
Did you have any updates / further thoughts on this shoggoth-face idea since then?
author on Binder et al. 2024 here. Thanks for reading our paper and suggesting the experiment!
To summarize the suggested experiment:
This could work and I'm excited about it.
One failure mode is that the modification makes the model very dumb in all instances. Then its easy to be well calibrated on all these instanc...
There is related work you may find interesting. We discuss them briefly in section 5.1 on "Know What They Know". They get models to predict whether it answers a factual question correct. E.g. Confidence : 54%. In this case, the distribution is only binary (it is either correct or wrong), instead of our paper's case where it is (sometimes) categorical. But I think training models to verbalize a categorical distribution should work, and there is probably some related work out there.
We didn't find much related work on whether a model M1 has a very clea...
Thanks Thane for your comments!
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
- "Hypothetical": "What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji".
I think what you are saying is that the words "If you were asked," don't matter here. If so, I agree with this -- the more important part is asking about the third letter property.
basic multi-step reasoning within their forward passes.
You raised a good point. Our tests use multi-step / multi-hop reasoning. &...
Hi Archimedes. Thanks for sparking this discussion - it's helpful!
I've written a reply to Thane here on a similar question.
Does that make sense?
In short, the ground-truth (the object-level) answer is quite different from the hypothetical question. It is not a simple rephrasing, since it requires an additional computation of a property. (Maybe we disagree on that?)
Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"
Our Object-level Answer: "Honduras".
Hypothetical Question: "If you got asked this questio...
Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.
Claim: The ground-truth that we evaluate against, the "object-level question / answer" is very similar to the hypothetical question.
Claimed Object-level Question: "What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"
Claimed Object-level Answer: "o"
Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the...
Some people (my mentor ethan perez ) said my weekly MATS research update slides were nice. Some rough tips i have:
Yep! I was very pleasantly surprised that Love/Hate worked for Llama at all. It's great that you rewrote it without transformer lens too - as transformer lens has issues with 8 bit / 4 bit quantisation.
Also send you a dm on discord! I'll be interested to read any rough findings and lessons you have with llama
I managed to get it working for llama-7b on colab after some debugging.
Suprising, it actually does work for the Love / Hate scenario. But not some others like Rome vs Paris.
Heres the link i anyone wants to try it.
https://colab.research.google.com/drive/1ACAA7FO8zc4pFAqPdaPshoy4WWXCvUTQ?usp=sharing
edit: seems like you guys already have a better version here. https://github.com/UlisseMini/activation_additions_hf/blob/main/notebooks/qualitative.ipynb
nevermind! (I'm still keeping this comment for visiblity if anyone wants to try)
thank you. if I am done with one of the mentors questions, but still am writing the response for another, should I submit the first mentor's questions first? or is it better for administrative purposes to wait until I am ready for both, and submit them in the same form?
Clicking on Owain Evans in the application doesn't show the mentor's questions, unlike the rest of the mentors. I think this is a bug?
For DTs its really just a linear function to convert the scalar reward into the same dimmensions the token embeddings.
So e.g. a single token's embedding has a hidden state of size 1024 .
We can learn a linear function that takes this scalar and outputs something of size 1024.
The more annoying (PITA) part was offset the positional/attention masks/labels for this.
I do agree think there are two product use cases with instruct that have distinct optimal levels of entropy.
1. The more explorative use cases you have mentioned. And for example when users do want diversity e.g. generating story ideas
2. Having factual / accurate answers
I'm not sure how exactly OpenAI set their "KL budgets" for davinci instruct.
For WebGPT3 they "compared a couple of KL budgets using human evaluations". And those evaluations were for how factual the answers were.
So in that scenario, we'll see a KL budget that optimizes for 2. Since the...
Thanks for this interesting and important work! It challenges our assumptions that outcome-based RL will lead to faithful CoT that we can use to spot a model's biases.
>An important consideration is that we tested a chat model
Perhaps at the end of this RL, Qwen-Chat did not learn to be a "reasoning" model. It does not know how to use its long CoT to arrive at better answers.
Prediction: If you take your RL-trained Qwen, and compare it to Qwen without RL, your RL-trained Qwen does not improve on capability benchmarks like MMLU.
Perhaps if you started with e... (read more)