That's funny. I spent a fair bit of time creating a ai safety knowledge bench. Sourced from lesswrong/papers/redwoods/metr etc.
It basically also instantly saturated.
Or like, to make it not instantly saturate, I had to make the questions so specific as to be measuring nonsense. Like asking "In the Alignment Faking paper, what was the refusal rate of llama 3.1-8b in the minimal reproduction of the helpful-only setting with prefix-matching as a proxy for non-refusals?" or like "what was paul christiano's reply to SMK's reply to his essay 'three reasons to cooperate'".
We present and formally deprecate WoFBench, a novel test that compares the knowledge of Wings of Fire superfans to frontier AI models. The benchmark showed initial promise as a challenging evaluation, but unfortunately proved to be saturated on creation as AI models produced output that was, to the extent of our ability to score responses, statistically indistinguishable from entirely correct.
Benchmarks are important tools for tracking the rapid advancements in model capabilities, but they are struggling to keep up with LLM progress: frontier models now consistently achieve high scores on many popular benchmarks, raising questions about their continued ability to differentiate between models.
In response, we introduce WoFBench, an evaluation suite designed to test recall and knowledge synthesis in the domain of Tui T. Sutherland's Wings of Fire universe.
The superfans were identified via a careful search process, in which all members of the lead author's household were asked to complete a self-assessment of their knowledge of the Wings of Fire universe. The assessment consisted of a single question, with the text "do you think you know the Wings of Fire universe better than Gemini?" Two superfans were identified, who we keep anonymous to reduce the risk of panel poaching by competing benchmark efforts.
Identification of questions proved difficult, as the benchmark authors have extremely limited knowledge of Wings of Fire lore, primarily derived from infodumping and overheard arguments. We initially attempted to source questions from the superfans, where each could be judged on the other's questions. As they were uncompensated and rivalrous, however, they agreed to participate only to the extent that their answers could be compared across the superfan panel. Instead, questions were sourced by asking Claude Opus 4.6:
The final benchmark consisted of seventeen questions, limited primarily by the lead author's willingness to continue. The elder superfan appeared indefatigable, [1] and if this benchmark otherwise appeared promising we are confident that an extremely large benchmark could be constructed. Note that the younger superfan needed to leave for a birthday party before evaluation could be completed, and was not evaluated on all questions. Answers were collected in written form, to avoid leakage within the superfan panel. No points were deducted for errors of spelling.
Each answer was validated by allowing the superfans to discuss, asking follow-up questions to Gemini, and in especially contentious cases by direct inspection of primary sources. Note that this validation procedure is not able to distinguish cases in which all superfans and models were correct from ones in which they all give the same incorrect answer.
We evaluated Gemini 3.1 Pro in real time, and followed up with evaluations of Claude Opus 3.6, ChatGPT 5.2 Pro, and ELIZA. In cases where questions had multiple components, partial credit was given as a fraction of all components.
We conclude that while some AI systems, notably ELIZA, performed poorly, all frontier models scored very close to 100%. Many of the lost points are arguably judgment calls, or cases where a model tried to interpret a trick/misinformed question maximally charitably. Superfan 1 performed noticeably below frontier models, though above the ELIZA baseline. Superfan 2 performed competitively, though we note she was not evaluated on the questions where Superfan 1 lost the most points, making direct comparison difficult.
While this benchmark was designed to be challenging for both superfans and AIs, it already has very limited ability to distinguish between models. While further sensitivity might be squeezed out via the addition of multi-sample evaluation, it's unlikely that this would be meaningful for this model generation let alone future ones. This reflects an increasingly common conundrum that benchmark developers may find themselves in, where after investing large amounts of time, effort, and money into the creation of a benchmark it is already obsolete when published. The authors note that benchmark saturation joins job displacement, stable authoritarianism, and human extinction on the list of reasons to be concerned about the pace of AI progress.
[1] Superfan 1 was permitted to read a draft of this report prior to publication. Their only feedback was that I should ask them additional, harder, questions. As of publication time, Superfan 1 was repeating "ask me more Wings of Fire questions!" at progressively increasing volume.
Comment via: facebook, mastodon, bluesky