Kevin Amiri

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

I recently translated 100 AIME level math questions from another language into English for testing set for a kaggle competition. The best model was GPT-4-32k, which could only solve 5-6 questions correctly. The rest of the models managed to solve just 1-3 questions.

Then, I tried the MATH dataset. While the difficulty level was similar, the results were surprisingly different: 60-80% of the problems were solved correctly.

I can not see any 1o improvement on this.

Is this a well-known phenomenon, or am I onto something significant here?