This is a linkpost for https://arxiv.org/abs/2409.12183
This matches my personal observations where I've been trying to test various language models on broad philosophical and moral reasoning, but also on math and coding problems.
My rankings are quite different for each.
Math and Coding:
o1-preview > Sonnet 3.5 > Opus 3 > GPT-4o > GPT-4
Philosophy and moral reasoning:
Opus 3 > Sonnet 3.5 > GPT-4o > GPT-4 > o1-preview
I've tested other models, but not thoroughly enough to be sure of their ranking. None who would compete for the top two places in either ranking.
Authors: Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett.
Abstract:
X(/twitter) thread: https://x.com/ZayneSprague/status/1836784332704215519.