LESSWRONG
LW

corey morris — LessWrong

Ran recently on a handful of Gemini models. Surprised to see that the sizeable gap between single scenario and dual scenario performance was still present for most models tested. 1.5-Flash, 2.0-Flash, and 2.0-Flash-Lite all still show a major gap between formats. Only the newest model, Gemini-2.5-Flash, has substantially closed this gap, especially when using its default reasoning setting. Even then, when reasoning is disabled, a moderate gap still exists.

Replying toMMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

corey morris2y*

MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

Thanks for your comment and letting me know about that work! Yeah it does look like with GPT-4 that the difference goes away. After a quick look at that paper, it appears that the tasks that were considered were the high performing MMLU tasks. The moral scenarios task seems harder in that the answers themselves don’t have inherent meaning, so it almost seems like there is a second mapping or reasoning step that needs to take place. Maybe you or someone else can better articulate the semantic challenge than I can at the moment.

The smaller model that performs well on the original task is one that is trained with an orca style dataset(dataset rich in reasoning). I found it interesting that it performed well on the original task, but not better on the single scenarios. Curious if you have done any interpretability work on models trained with datasets rich in reasoning and how they differ from others.

MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

corey morris

In examining the low performance of Large Language Models(LLMs)on the Moral Scenarios task, part of the widely used MMLU benchmark by Hendrycks et al., we found surprising results. When presented with moral scenarios individually, the accuracy is 37% better than with the original dual-scenario questions. This outcome indicates that the challenges these models face are not rooted in understanding each scenario, but rather in the structure of the task itself. Further experiments revealed that the primary factor influencing the observed difference in accuracy is the format of the answers, rather than the simultaneous presentation of two scenarios in a single question.

About the benchmark

Measuring Massive Multitask Language Understanding (MMLU) is a benchmark that

... (read 1182 more words →)

Replying toMeta Questions about Metaphilosophy

corey morris2y

Meta Questions about Metaphilosophy

I'm currently investigating the moral reasoning capabilities of AI systems. Given your previous focus on decision theory and subsequent shift to Metaphilosophy, I'm curious to get your thoughts.

Say an AI system was an excellent moral reasoner prior to having especially dangerous capability. What might be missing to ensure it is safe? What do you think the underlying capabilities to getting to be an excellent moral reasoner would be ?

I am new to considering this as a research agenda. It seems important and neglected, but I don’t have a full picture of the area yet or all of the possible drawbacks of pursuing it.

Replying toYou Are Not Measuring What You Think You Are Measuring

corey morris3y

You Are Not Measuring What You Think You Are Measuring

One of the key statements made in this post is that measuring more stuff is better than measuring less stuff. Have your beliefs on that updated at all since the original post ? What evidence would cause you to become more certain or less certain of this claim ?