About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong
Update (20th Sep 2025): Scale AI has revised their Humanity's Last Exam preprint in light of this evaluation, and conducted their own checks on the accuracy of HLE questions, finding an error rate of 18% instead: > We conducted another targeted peer review on a biology, chemistry, and health subset, as proposed by [47], and found an expert disagreement rate of approximately 18%. This level of expert disagreement is in line with what is observed in other challenging, expert-grade machine learning benchmarks and also observed in other similarly designed work; for example, [6] notes that disagreement among expert physicians is frequent on complex health topics They also note: > To illustrate, if we were to adopt a single-reviewer methodology where a > question is flagged based on just one dissenting expert, the disagreement rate on the aforementioned health-focused subset jumps from 18% to 25%, which is close to the setting described in [47]. FutureHouse is a company that builds literature research agents. They tested it on the bio + chem subset of HLE questions, then noticed errors in them. The post's first paragraph: > Humanity’s Last Exam has become the most prominent eval representing PhD-level research. We found the questions puzzling and investigated with a team of experts in biology and chemistry to evaluate the answer-reasoning pairs in Humanity’s Last Exam. We found that 29 ± 3.7% (95% CI) of the text-only chemistry and biology questions had answers with directly conflicting evidence in peer reviewed literature. We believe this arose from the incentive used to build the benchmark. Based on human experts and our own research tools, we have created an HLE Bio/Chem Gold, a subset of AI and human validated questions. About the initial review process for HLE questions: > [...] Reviewers were given explicit instructions: “Questions should ask for something precise and have an objectively correct, univocal answer.” The review process was challenging, and per
I tested this, and if you push Grok 4.1 to explain its claim in more detail, it'll admit that a second treaty, signed by the same delegation on the same day, includes the opium ban:
Grok's response
The treaty you are probably thinking of (same year, different treaty)
There was a separate treaty signed in 1880 between the United States and China that did address opium, but it is not the Angell Treaty.