Expanding on the Jacob Steinhardt quote from August 2021,
Current performance on this dataset is quite low--6.9%--and I expected this task to be quite hard for ML models in the near future. However, forecasters predict more than 50% accuracy* by 2025! This was a big update for me...
If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I'm really curious how the forecasters are reasoning about this...Even while often expressing significant uncertainty, forecasters can make bold predictions. I'm still surprised that forecasters predicted 52% on MATH, when current accuracy is 7% (!). My estimate would have had high uncertainty, but I'm not sure the top end of my range would have included 50%. I assume the forecasters are right and not me, but I'm really curious how they got their numbers.
Google's model obtained 50.3% on MATH, years ahead of schedule.
What is expert level on competition math problems? Do undergrads regularly get half right?
EDIT: someone answered elsewhere in the comments. Looks like this model is still well behind an expert human.
I'll restate my prior prediction: chain of thought reasoning with large language models solves general intelligence. No further "deep" insights or paradigm changes are needed, only scale and relatively simple tweaks to improve the quality of the reasoning.
One slightly counterintuitive thing about this paper is how little it improves on the GSM8K dataset, given that it does very well on relatively advanced test sets.
The Grade School Math, 8-K is a bundle of problems suitable for middle-schoolers. It has problems like:
"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
"Randy has 60 mango trees on his farm. He also has 5 less than half as many coconut trees as mango trees. How many trees does Randy have in all on his farm?"
Minerva improves the SOTA on this, but only moves it from 74.5% to 78.5%, which is not as big of a deal.
My innate / naive sense of how hard the MATH problems are would lead me to think you could get > 90% on GSM8K if you could get 50% on MATH. But obviously my gut sense is off.
I'd be really curious to know what's going on here.
The previous SOTA for MATH (https://arxiv.org/pdf/2009.03300.pdf) is a fine-tuned GPT-2 (1.5b params), whereas the previous SOTA for GSM8K (https://arxiv.org/pdf/2203.11171.pdf) is PaLM (540b params), using a similar "majority voting" method as Minerva (query each question ~40 times, take the most common answer).
Here's something odd that I noticed in one of the examples in the blogpost (https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html).
The question is the one that in part reads "the variance of the first n natural numbers is 10". The model's output states, without any reasoning, that this variance is equal to (n^2 - 1)/12, which is correct. Since no reasoning was used, I think it's safe to assume that the model memorized this formula.
This is not a formula that a random math student would be expected to have memorized. (Anecdotally, I have a mathematics degree and don't know it.) Because of that, I'd expect that a typical (human) solver would need to derive the formula on the spot. It also strikes me as the sort of knowledge that would be unlikely to matter outside a contest, exam, etc.
That all leads me to think that the model might be over-fitting somewhat to contest/exam/etc.-style questions. By that I mean that it might be memorizing facts that are useful when answering such questions but are not useful when doing math more broadly.
To be clear, there are other aspects of the model output, here and in other questions, that seem genuinely impressive in terms of reasoning ability. But the headline accuracy rate might be inflated by memorization.
The model’s performance is still well below human performance
At this point I have to ask what exactly is meant by this. The bigger model beats the average human performance on the national math exam in Poland. Sure, the people taking this exam are usually not adults, but for many it may be where they peak in their mathematical abilities, so I wouldn't be surprised if it beats average human performance in the US. It's all rather vague though; looking at the MATH dataset paper all I could find regarding human performance was the following:
Human-Level Performance. To provide a rough but informative comparison to human-level performance, we randomly sampled 20 problems from the MATH test set and gave them to humans. We artificially require that the participants have 1 hour to work on the problems and must perform calculations by hand. All participants are university students. One participant who does not like mathematics got 8/20 = 40% correct. A participant ambivalent toward mathematics got 13/20. Two participants who like mathematics got 14/20 and 15/20. A participant who got a perfect score on
the AMC 10 exam and attended USAMO several times got 18/20. A three-time IMO gold medalist got 18/20 = 90%, though missed questions were exclusively due to small errors of arithmetic. Expert-level performance is theoretically 100% given enough time. Even 40% would accuracy for a machine learning model would be impressive but have ramifications for cheating on homework.
So, for solving undergraduate-level math problems, this model would be somewhere between university students who dislike mathematics and ones who are neutral towards it? Maybe. Would be nice to get more details here, I assume they didn't think much about human-level performance since the previous SOTA was clearly very far from it.
They test on the basic (Poziom podstawowy) Matura tier for testing on math problems.
In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students -- it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony).
Can someone from Poland confirm this?
A quick estimate of the percentage of high-school students taking the Polish Matura exams is 50%-75%, though. If the number of students taking the higher tier is not too large, then average performance on the basic tier corresponds to essentially average human-level performance on this kind of test.
Note that many students taking the basic math exam only want to pass and not necessarily perform well; and some of the bottom half of the 270k students are taking the exam for the second or third time after failing before.
Google doesn’t seem interested in serving large models until it has a rock solid solution to the “if you ask the model to say something horrible, it will oblige” problem.
I think that is right call. Anecdotal bad outputs would probably go viral and create media firestorm with the stochastic parrots twitter crowd beating them over the head along the way. Not sure you can ever get it perfect but they should probably get close before releasing public.
At the same time, a good math-solving chatbot could be really useful for math-averse people, even with brittle performance. I’m not sure it’s worth the risk, but might be worth considering.
You'll also get people complaining that it'll help students cheat, because testing is more important than education to people involved in the education system.
I think that's unfair.
Students to whom learning is more important than test results won't cheat either way. Students to whom test results are more important than learning will cheat if it's easy and reluctantly fall back on actually learning the material if they have to. Educators who care whether their students learn will prefer the latter outcome.
(It is sometimes also true that educators care more about testing than teaching. But I don't think that's anything like the only reason why they will complain about things that make it very easy for students to cheat.)
Students also might reason (maybe correctly) that if AI is already better than most humans will ever be in their lifetime, why exactly are they spending all this time on stuff like hand symbolic manipulation and hand arithmetic anyways..
A new paper from Google, in which they get a language model to solve some (of what to me reads as terrifyingly impressive) tasks which require quantitative reasoning skills. The abstract reads as follows:
Some of the results are quite relevant to forecasting AI progress. From @bneyshabur:
Some further information (excerpted from across the paper):
You can read the full paper here: https://storage.googleapis.com/minerva-paper/minerva_paper.pdf