sweenesm's Shortform

sweenesm

sweenesm's Shortform

26th Apr 2024

1 min read

2

This is a special post for quick takes by sweenesm. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:28 AM

[-]sweenesm2mo*192

Some Notes on Humanity’s Last Exam

While I congratulate CAIS and Scale AI for producing this benchmark, I have a couple of comments on things they may want to clear up (although these are ultimately a bit “in the weeds” to what the benchmark is really supposed to be concerned with, I believe):

DeepSeek-R1 and Gemini 2.0 Flash Thinking were released after the deadline for submitting questions eligible for prizes (though submissions remained open after this). Thus, these models weren’t used to screen most, if not all, questions. This means that the questions were preferentially screened to stump the other models, but not these, so it wouldn’t be too surprising if these models scored better than others.
After reading the paper, my impression is that these questions were run through the models each only one time (after the one time they were run through some of the models when originally submitted). If people want to get into the weeds and say that DeepSeek-R1 is actually better on this exam than OpenAI’s o1, it would be useful to run the questions through each model at least 6 times to establish some confidence intervals on the accuracy numbers. I suspect that this would show the differences between o1 and R1 are not statistically significant. It would be interesting to know the typical size of the confidence intervals, though, and whether these confidence interval sizes shift when “reasoning” is involved in the model or not. (It would also likely be useful if reporting on any and all benchmarks for AI’s required including confidence intervals so we could feel better that people weren’t gaming the system and just reporting their best results.) Running the questions on more models that weren’t originally used for question screening, such as Llama 3, could help establish even more of a baseline. [Note added 2-3-25: I'm not sure why it didn't make the paper, but Scale AI does report confidence intervals on their website.]
20% of questions are multiple-choice. If each multiple-choice question has 5 possible answers, then random guessing would yield 4% accuracy on the exam as a whole. It would be interesting to know what the actual average number of answers was for the multiple-choice questions, and thus the actual “random guessing accuracy.”

Also, it might be interesting to take some of the multiple-choice questions and rewrite them by randomly removing one of the answer choices and replacing it with “none of the above.” If the model chooses “none of the above," then see if it can come up with the right answer on its own, rather than from a list (if indeed the right answer isn’t present). Personally, I always found multiple-choice questions in which you weren’t sure if the right answer was there to be more difficult - when the right answer is there, sometimes you can take clues from it to figure out that it’s the right answer. Rewriting some questions in this way could make them a little more difficult without much added work by the exam preparers.

Finally, having models take the multiple-choice section of the exam numerous times with slight variations in the wording of the questions, without changing their meanings, could make this section of the exam a little more robust against “luck.”

Note: I submitted two materials science-related multiple-choice questions with 5 answer options each for consideration in Humanity’s Last Exam. For submitting questions (https://agi.safe.ai/submit), the process was to type your question in an input window, enter as many multiple choice answers as you wanted (I think the minimum was 5 and there might not have been a maximum), and then this question was run through various models (GTP-4o, Sonnet 3.5, Gemini Pro 1.5, o1) to see if they gave the correct answer. The paper says that the screening criterion was that “multiple-choice questions must stump all but one model to account for potential lucky guesses.” I think I didn’t submit my questions unless it could stump all the models.

In case you’re interested, you can find my one question that made the cut by searching for “sintering” in the dataset available at HuggingFace. For my one question that didn’t make the cut, my strategy was to focus on an area in which there’ve been some false ideas presented in the literature that later got cleared up. I figured this might make it harder for LLM’s to answer correctly. I don’t know why the question didn’t make the cut, though, so don’t take this strategy as the reason. Just note that it’s possible that some of the other questions that made the final list could’ve been written with this sort of strategy in mind.

Reply

[-]sweenesm1y20

American Philosophical Association (APA) announces two $10,000 AI2050 Prizes for philosophical work related to AI, with June 23, 2024 deadline: https://dailynous.com/2024/04/25/apa-creates-new-prizes-for-philosophical-research-on-ai/

https://www.apaonline.org/page/ai2050

https://ai2050.schmidtsciences.org/hard-problems/

Reply

Moderation Log