There's been criticism somewhere or other - twitter? - of BIG-Bench for having a lot of idiosyncratic tasks that stronger models will not necessarily perform better on. The issue tracker might be of use in finding this criticism though I don't think it's a full overview of the tasks of concern. In particular, some of the tasks are very opinion-based, and even constraining to reasonable humans of similar intelligence and education one still might see disagreement. That's not to say the task is useless, though.
Interesting. Even if only a small part of the tasks in the test are poor estimates of general capabilities, it makes the test as a whole less trustworthy.
yeah, at least in terms of being a raw intelligence test. the easiest criticism is that the test has political content, which of course means that even to the degree the political content is to any one person's liking, the objectively true answer is significantly more ambiguous than ideal. alignment/friendliness and capabilities/intelligence can never be completely separated when the rubber hits the road. part of the problem with politics is that we can't be completely sure we've mapped the consequences of our personal views on basic moral philosophies correctly, so more reasoning power can cause a stronger model to behave arbitrarily differently on morality-loaded questions. and there may be significantly more arbitrary components than that in any closed book language based capabilities test.
see also the old post on [[guessing the teacher's "password"]], guessing what the teacher is thinking
True.
And while there might be some uses of such benchmarks on politics etc, combining them with other benchmarks doesn't really seems like a useful benchmark.
Epistemic status: Been thinking about this for about a week, and done some basic googling.
Intro
Creating benchmarks for measuring and comparing the intelligence of AIs to humans seems useful for tracking progression and making future predictions of development.
This article summarizes some existing benchmarks, and reasons around what new benchmarks could potentially be useful.
Please share your own thoughts in the comments, it’s much appreciated.
Benchmark 1 - Coding
Performance on coding problems could be a useful benchmark for three reasons:
Currently, AlphaCode by DeepMind is the most prominent code solving AI, performing better than 44% of programmers, who are likely significantly above what the average human would perform even with programming experience.
Benchmark 2 - Number Sequences
A good number sequence should test the ability of recognizing novel patterns. If each number series in a test contains a unique and novel pattern, it should be some sort of proxy of the reasoning ability.
I didn’t find any good existing benchmarks on this, so I decided to test ChatGPT on this test I found online. I gave ChatGPT each number sequence, and it had to guess the correct answer. I did not give it the answer suggestions, and if it answered something that was not among the suggestions, I picked the top option, that was correct two times of the times it didn’t give an answer among the alternatives.
The AI got 11/20 correct, (of which 2 answers were lucky guesses since I picked the top alternative), which the site claims to be equivalent to an IQ of 115 IQ. I doubt that is the case, but if it was true, it means the AI is better at numerical sequences than 84% of humans. I find it more likely that the AI is better than somewhere around 20-30% of people on number sequences (given a limited time of 20 minutes for 20 questions).
Benchmark 3 - Big Bench by Google
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. More than 200 tasks are included in BIG-bench. PaLM with 5 shot learning slightly outperforms the average human.
Ideas of Further Benchmarks
Adults could be chatting with either a child, or an AI, and have to determine which one is the child and which is the AI. AI progression could then be tracked by tracking which age group it can imitate well.
Final Thoughts
Benchmarks suggest that AI on most reasoning, problem solving and general intelligence tasks already outperforms a significant part of the population. However, AI’s often are portrayed as stupid, for making what looks like obvious mistakes to humans, and the fact that they have no long term memory. I argue that this viewpoint is incorrect, and that AI’s are more intelligent than commonly portrayed, but simply are not prompted/trained correctly to give the type of answer a human would expect. In my tests, ChatGPT is not necessarily smarter than the first version of GPT-3, but it is certainly more human friendly and can thus seem more intelligent.
If we truly have gone from AI being smarter than a fraction of the population, to being smarter than a large chunk of the population, it might indicate that shorter AGI timelines are more likely.
Please leave comments with suggestions on other benchmarks, feedback on the post, or any other thoughts you’d like to share.