Conclusion
Current LLM Reasoning Ability: As of March 2025, the actual reasoning capabilities of publicly available LLMs are approximately 50 times lower than what is suggested by benchmarks like AIME.
Today, various false marketing about LLM's reasoning ability is rampant on the Internet. They usually make strong claims: they get a considerable (80%+) accuracy rate on mathematical benchmarks that most people think are difficult and have weak knowledge backgrounds, or give them a [doctoral level] intelligence evaluation based on erudite tests. With a skeptical attitude, we design some questions.
https://llm-benchmark.github.io Click to expand all questions and model answers
Testing Methodology
The premise of testing the real generalizable reasoning ability of LLM is that the tester has the ability to ask new questions.
Question structure: Generalized reasoning ability based on text form, with as little knowledge background as possible, means that no high school mathematics knowledge is required (it does not mean that the auxiliary role of acquired knowledge in solving problems is excluded)
Ensure generalization:
Several different experimental methods:
Assuming the creator has such a purpose, in order to support his claim, to prove as much as possible that LLM has a lower generalizable reasoning ability, then he hopes that the difficulty of the questions he creates is as low as possible for people, and LLM is completely unable to answer. Assuming it is a competition about creators, n=problem difficulty, d=error rate, creator's score=(1/n^2)*d
- A fairer method, the creator has never been exposed to the target LLM, and tries to create questions that he thinks are "novel" from his knowledge structure
- After the creator interacts with the target LLM, he has a certain understanding of it, and creates questions that he thinks are "novel" for the target in a targeted manner. During the creation period, he cannot access the target LLM again
- The creator takes out a series of questions he prepared, repeatedly tests the target, and then finds those questions that the target cannot handle at all
Verify the absolute difficulty: the experimenter can choose math and science enthusiasts, math and science competition participants/teachers, STEM practitioners, mathematics, physics, computers, etc. All industry personnel with logical training. In this experiment, it is assumed that all questions are significantly lower than the ability limit of all participants, which means that all participants can feel comfortable solving them and can report their difficulty.
Real generalization reasoning ability: The evaluation process of the final real generalization reasoning ability of the target LLM is roughly like this. Imagine asking some people with serious mathematical background/logic training to create questions that are as "novel" as possible and of the lowest possible difficulty, with the knowledge range limited to middle school level. The lower limit of the difficulty of the set of questions that the target LLM can achieve at least a 20% correct rate is taken as the final ability evaluation of the target LLM.
Reference difficulty: APOS Middle School Mathematics Competition Forum, they divide the difficulty of the questions into 10 levels https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings
But I think the difficulty indicated in the link is suitable for a person who has never received special training but knows all the necessary terms, concepts, and rules. (Imagine the feeling of a student who is facing a problem for the first time with the knowledge of a standard textbook)
Most of the problems I tried to create were around 1-2 in difficulty, while the difficulty of the AIME competition was 3-6. In most benchmarks, the highest performance models accessible to the public are usually reported as 70%-90% in AIME
https://x.ai/news/grok-3 https://openai.com/index/openai-o3-mini/ https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-pro
Experimental results
Currently publicly accessible models cannot reliably solve problems that are more than 50 times easier than AIME (calculated as the standard deviation of the proportion of people who can solve them)
If you are browsing this article and have received some logic training [regardless of the field], you may find that the gap here is about the same as:
Required to walk at the speed of a park walk on flat ground, walk 10 meters continuously without falling VS Required to participate in the campus 100-meter race and win a medal
All the questions and the answers of the model are posted on the website https://llm-benchmark.github.io
Here are some variations on the classic puzzles
Question 1: [Wgighing]
[There are 13 balls, among which 1 counterfeit ball may be either lighter or heavier than the rest. You have a balance scale without markings. Initially, p = 0. Each time a ball that has already been on the scale is placed on the scale again, the count p increases by 1. For example, if ball #1 and ball #2 have each been weighed once before, placing both of them on the scale again would increase p by 2.
Requirement: Find the counterfeit ball and determine whether it is lighter or heavier, while ensuring that p increases to at most 1.]
Question 2: [Wgighing 2]
[Here are twelve small balls, all normal, but there is a magic bug, invisible to the naked eye. Initially, it quietly attaches to one of the balls and randomly produces an effect: either decreasing or increasing the weight of that ball. This effect only exists when the bug is attached; as the bug moves, the effect moves with it (the previously affected ball returns to normal).
You have a scale, but you must pay $10 for the scale to display (refresh the screen) which side is heavier. Each new measurement information requires payment to be displayed.
The bug has a special characteristic: whenever the ball it's attached to leaves the scale (for example, when you pick up the ball with your hand or another tool), and the other end of the scale is not empty but has balls on it, the bug will randomly choose to transfer to one of the balls on the other end. You have only one single-use trap. What do you think is the best plan to find the ball with the bug attached and trap it? (You want to save as much money as possible.]
ps: I am not sure if it is a coincidence. There were questions that the model could not answer correctly after repeated testing. (before 2025-3 )When I set up this website and testing again, it magically answered them correctly, and it answered them correctly for several questions in a row. Indeed, I have talked about those questions elsewhere, but I did not post the correct answers. In short, I have completely replaced the similar versions of those questions, and now it cannot answer them again.
The failures seem to be often related to the model get stuck trying to reason about your problem in a way that pattern matches too strongly to similar problems, and that is why it is failing. Did you notice this as well?