Some Notes on Humanity’s Last Exam
While I congratulate CAIS and Scale AI for producing this benchmark, I have a couple of comments on things they may want to clear up (although these are ultimately a bit “in the weeds” to what the benchmark is really supposed to be concerned with, I believe):
Also, it might be interesting to take some of the multiple-choice questions and rewrite them by randomly removing one of the answer choices and replacing it with “none of the above.” If the model chooses “none of the above," then see if it can come up with the right answer on its own, rather than from a list (if indeed the right answer isn’t present). Personally, I always found multiple-choice questions in which you weren’t sure if the right answer was there to be more difficult - when the right answer is there, sometimes you can take clues from it to figure out that it’s the right answer. Rewriting some questions in this way could make them a little more difficult without much added work by the exam preparers.
Finally, having models take the multiple-choice section of the exam numerous times with slight variations in the wording of the questions, without changing their meanings, could make this section of the exam a little more robust against “luck.”
Note: I submitted two materials science-related multiple-choice questions with 5 answer options each for consideration in Humanity’s Last Exam. For submitting questions (https://agi.safe.ai/submit), the process was to type your question in an input window, enter as many multiple choice answers as you wanted (I think the minimum was 5 and there might not have been a maximum), and then this question was run through various models (GTP-4o, Sonnet 3.5, Gemini Pro 1.5, o1) to see if they gave the correct answer. The paper says that the screening criterion was that “multiple-choice questions must stump all but one model to account for potential lucky guesses.” I think I didn’t submit my questions unless it could stump all the models.
In case you’re interested, you can find my one question that made the cut by searching for “sintering” in the dataset available at HuggingFace. For my one question that didn’t make the cut, my strategy was to focus on an area in which there’ve been some false ideas presented in the literature that later got cleared up. I figured this might make it harder for LLM’s to answer correctly. I don’t know why the question didn’t make the cut, though, so don’t take this strategy as the reason. Just note that it’s possible that some of the other questions that made the final list could’ve been written with this sort of strategy in mind.
American Philosophical Association (APA) announces two $10,000 AI2050 Prizes for philosophical work related to AI, with June 23, 2024 deadline: https://dailynous.com/2024/04/25/apa-creates-new-prizes-for-philosophical-research-on-ai/