Research scientist at NYU Alignment Research Group
We just used standard top-p sampling, the details should be in the appendix. We just sample one explanation. I think I did not follow your suggestion.
Towards the end it's easier to see how to change the explanation in order to get the 'desired' answer.
Some relevant discussion here: https://twitter.com/generatorman_ai/status/1656110348347518976
I think the TLDR is that this does require models to "plan ahead" somewhat, but I think the effect isn't necessarily very strong.
I don't think "planning" in language models needs to be very mysterious. Because we bias towards answer choices, models can just use these CoT explanations as post-hoc rationalizations. They may internally represent a guess at the answer before doing CoT, and this internal representation can have downstream effects on the explanations (in our case, this biases the reasoning). I think models probably do this often -- models are trained to learn representations that will help for all future tokens, not just the next token. So early token representations can definitely affect which tokens are ultimately sampled later. E.g. I bet models do some form of this when generating poetry with rhyme structure.
A qualitative finding that we didn't put in the paper was that key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation. Sometimes the model does normal reasoning, and then gives some caveat or makes a mistake at the end, that leads it to ultimately give the biased prediction. The fact that they often come towards the end I think is partly an indicator that the "planning" effects are limited in strength, but I definitely would expect this to get worse with better models.
Thanks! Glad you like it. A few thoughts:
Just want to say that this blog post is still working its magic!
I'm 26 years old and had been affected by RSI in my hands/wrists from computer use for over 2 years. I felt my pain go away over the course of reading this blog post. This was a pretty wild experience. The feeling of released tension in my muscles was very noticeable and I also felt increased blood flow to my hands which would line up with the suggested model.
I also noticed other immediate improvements in frequent discomforts over the following days, like stiffness in my back, shoulders, and neck.
It went away pretty much completely for about 3 weeks then came back a little bit. Some things that additionally helped significantly:
All in all I think this has reduced my chronic pain in my hands by 80-90%. Before this I invested in all ergonomic setups (which do help significantly) and did PT for 4 months which had a noticeable but only minor effect.
So thank you so much!!! I think this more scientifically-plausible account resonated with me much more than the other stuff I had read.