We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003.
Interestingly, Alpaca is trained using supervised finetuning, not RLHF. (text-davinci-003 is trained using RLHF.) This seems to confirm my suspicion that while RLHF improves performance it is not essential.
Damn, that's something I had been worrying about recently.
Eliezer said:
https://twitter.com/ESYudkowsky/status/1635577836525469697
Interesting video on the topic: The Model That Changes Everything: Alpaca Breakthrough (ft. Apple's LLM, BritGPT, Ernie and AlexaTM)