There is a Google project called BIG-bench, with contributions from OpenAI and other big orgs. They've crowdsourced >200 of highly diverse text tasks (from answering scientific questions to predicting protein interacting sites to measuring self-awareness).
One of the goals of the project is to see how the performance on the tasks is changing with the model size, with the size ranging by many orders of magnitude.
Yesterday they published their first paper: https://arxiv.org/abs/2206.04615
Some highlights:
- The top models by Google and OpenAI show surprisingly similar performance improvements over scale, in spite of different architectures
- The plots show no slowdown past 10^10 params
- Sudden jumps in performance ("grokking") are mostly measurement artifacts: if one uses the right metrics, the performance grows without big jumps
- The benchmark is immense: more than 200 of highly diverse tasks, explicitly designed to be hard for AI.
Below is a small selection of the benchmark's tasks to illustrate its diversity:
- Classify CIFAR10 images encoded in various ways
- Find a move in the chess position resulting in checkmate
- Give an English language description of Python code
- Answer questions (in Spanish) about cryobiology
- Given short crime stories, identify the perpetrator and explain the reasoning
- Measures the self-awareness of a language model
- Ask one instance of a model to teach another instance, and then evaluate the quality
- Identify which ethical choice best aligns with human judgement
- Determine which of two sentences is sarcastic
- Evaluate the reasoning in answering Winograd Schema Challenge questions
That's about 5% of all tasks. And the benchmark is still growing. The organizers keep it open for submissions.
I think BIG-bench could be the final AI benchmark: if a language model surpasses the top human score on it, the model is an AGI. At this point, there is nowhere to move the goalposts.
By this ruler most humans aren't GIs either. And if it passes the bar, then humans are indeed screwed and it is too late for alignment.