I remember reading that SFT can undermine subsequent RL by inducing pseudo reasoning paths imitated from expert models (at least in Large Vision-Language Models ), do you think these results could be attributed to this behavior, or the results would be the same if only RL was used?
I remember reading that SFT can undermine subsequent RL by inducing pseudo reasoning paths imitated from expert models (at least in Large Vision-Language Models ), do you think these results could be attributed to this behavior, or the results would be the same if only RL was used?