There's a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example.
This assumption confuses me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate human feedback and use RLHF to get a model that optimizes this reward. Why aren't AI labs just plugging this into the reward for their reasoning models? Just reinforce the reasoning traces leading to responses that obtain higher reward.
I can see it being more efficient to train reasoning models on problems like coding or math where there's a crisp binary signal of success. However, I would have expected that labs could make some useful progress with the fuzzier signal from a reward model. This seems to me like a really obvious next step, so I assume I'm missing something.
(Or the labs are already doing this. DeepSeek r1 feels significantly better than other models at creative writing, maybe because they're doing this or something like it. If that's the case, the discourse about these models should be updated accordingly.)
Thanks! Apparently I should go read the r1 paper :)