James Chua

https://jameschua.net/about/

Wikitag Contributions

Comments

Sorted by

Updated version of the paper with results from DeepSeek, and cool discussion about reward models here

https://x.com/OwainEvans_UK/status/1891889067228528954

thanks!

i'm away from work now so i cant write much. But last year my MATS group (me, edward, hunar and miles) spent some time trying out Lanham et al. tests (two months?).

Its not easy to use for a few reasons. And I dont think results are easy to interpret. Here are my thoughts. May remember better ones later. Hopefully there can be an improved version of lanham's methods.

Recap:

Lanham's method truncates CoT or adds a mistake. It then forces the model to give an answer straightaway.

  • One big assumption the causality of the cot. This means that in whatever situation you have where the cot is not causal, you can never "prove" faithfulness. So, you need to search for questions which the model can solve with cot, but cannot solve in non-cot setup. For example, its impossible to prove the faithfulness of questions like "10 + 10", because the model can solve it in a non-cot fashion. This requirement of casuality seems undesirable -- you want to prove faithfulness even in areas where the model can do it in non-cot. If my LLM is scheming and lying to me about "10 + 10", i still want to be able to prove its faithfulness!
  • as an implementation detail it can be annoying to implement because models have different non-cot capabilities.
  • Adding a mistake to the CoT, then forcing the model to answer is quite an out of distribution behavior. What kind of mistake do you want to add? It seems that sonetimes models ignore models as part of their normal behavior. Maybe they would even articulate that "Oh actually i made a mistake, and am going to ignore it and not change my answer". But because we force the model to give an answer straightaway, the model has no chance to articulate what it is doing. And since the model does not switch its answer, we deem the CoT as "unfaithful" even though it would have articulate it under normal situations.

"Speedrun" projects. Write papers with hypothetical data and decide whether they'd be interesting. If not then move on to something else.


Writing hypothetical paper abstracts has been a good quick way for me to figure out if things would be interesting.

We plan to iterate on this research note in the upcoming weeks. Feedback is welcome!

Ideas I want to explore:

  • New reasoning models may be released (e.g. deepseek-r1 API, some other open source ones). Can we reproduce results?
  • Do these ITC models articulate reasoning behind e.g social biases  / medical advice?
  • Try to plant backdoor. Do these models articulate the backdoor?

thanks! heres my initial thought about introspection and how to improve on the setup there:

in my introspection paper we train models to predict their behavior in a single forward pass without CoT.

maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well.

still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that models tend to do poorly on multiple steps of thinking in a forward pass. so why handicap ourselves?

my current thinking is that it is more effective for models to generate hypotheses explicitly and then reason about what effects their reasoning afterwards. maybe we can train models to be more calibrated about what hypotheses to generate when they carry out their CoT. seems ok.

thanks! Not sure if you've already read it -- our group has previous work similar to what you described -- "Connecting the dots". Models can e.g. articulate functions that that implicit in the training data. This ability is not perfect, models still have a long way to go. 

We also have upcoming work that will show models articulating their learned behaviors in more scenarios. Will be released soon.

thanks for the comment! do you have an example of answering "nuanced probabilistic questions"?

website to sum up resources / tweet thread/ discussion for our introspection paper

https://modelintrospection.com

Thanks! we haven't decided to test it out yet. will let you know if we do!

hi daniel, not sure if you remember.  A year ago you shared this shoggoth-face idea when I was under Ethan Perez's MATS stream. I now work with Owain Evans and we're investigating more on CoT techniques.

Did you have any updates / further thoughts on this shoggoth-face idea since then?

Load More