I, also, am skeptical. The weird spikiness of the abilities of LLMs thows off our ability to place them at a skill level which makes sense from a human perspective. They have some reasoning ability, but is negatively impacted by incorrect pattern-matching to their hige reservoir of memorized patterns. So, depending on context they might reason at the level of a six year old or a 14 year old, whilst having more PhD-level facts memorized than any human ever has. Weird. How do we rank such a thing against a human skill chart? It does not follow human development progressions. As soon as it is an agent with long horizon execution ability, it will necessarily be a superhuman one because it already has superhuman skills in speed and factual knowledge recall. Levels 3 and 5 thus seem closely linked to me. I would be surprised to see one without the other. Similarly levels 2 and 4 seem fairly closely linked, although less so than 3 and 5. But still, if I saw one without the other I would expect the other to follow very soon. So, as David says, the ordering makes little sense.
Some levels also collapse. As a capability, reasoning plausibly requires agentic behavior, you need fluency in System 2 skills to be effective at non-routine reasoning. Reasoning at the level of highly intelligent humans might be harder than agentic behavior alone, but then if agentic behavior gets unlocked sufficiently late, it might immediately come with reasoning at the level of highly intelligent humans. And agentic behavior seems even more the same as ability to coordinate organizations of individual instances, as long at it passes some ARA threshold (which is still lower than what's necessary to do research).
None of these are superintelligence. Ability to agentically run organizations together with a research level of reasoning is instead what it takes to start making research progress much faster than humans and soon unlock superintelligence, but it's not by itself superintelligence.
Bloomberg reports that OpenAI internally has benchmarks for “Human-Level AI.” They have 5 levels, with the first being the achieved level of having intelligent conversation, to level 2, “[unassisted, PhD-level] Reasoners,” level 3, “Agents,” level 4, systems that can “come up with new innovations,” and finally level 5, “AI that can do the work of… Organizations.”
The levels, in brief, are:
1 - Conversation
2 - Reasoning
3 - Agent
4 - Innovation
5 - Organization
This is being reported secondhand, but given that, there seem to be some major issues with the ideas. Below, I outline two major issues I have with what is being reported.
...but this is Superintelligence
First, given the levels of capability being discussed, OpenAI’s typology is, at least at higher levels, explicitly discussing superintelligence, rather than “Human-Level AI.” To see this, I’ll use Bostrom’s admittedly imperfect definitions. He starts by defining superintelligence as “intellects that greatly outperform the best current human minds across many very general cognitive domains,” then breaks down several ways this could occur.
Starting off, his typology defines speed superintelligence as “an intellect that is just like a human mind but faster.” This would arguably include even their level 2, which “”its technology is approaching,” since “basic problem-solving tasks as well as a human with a doctorate-level education who doesn’t have access to any tools” runs far faster than humans already. But they are describing a system with already-superhuman recall and multi-domain expertise to humans, and inference using these systems is easily superhumanly fast.
Level 4, AI that can come up with innovations, presumably, those which humans have not, would potentially be a quality superintelligence, “at least as fast as a human mind and vastly qualitatively smarter,” though the qualification for “vastly” is very hard to quantify. However, level 5 is called “Organizations,” which presumably replaces entire organizations with multi-part AI-controlled systems, and would be what Bostrom calls “a system achieving superior performance by aggregating large numbers of smaller intelligences.”
However, it is possible that in their framework, OpenAI means something that is, perhaps definitionally, not superintelligence. That is, they will define these as systems only as capable as humans or human organizations, rather than far outstripping them. And this is where I think their levels are not just misnamed, but fundamentally confused - as presented, these are not levels, they are conceptually distinct possible applications, pathways, or outcomes.
Ordering Levels?
Second, as I just noted, the claim that these five distinct descriptions are “levels” and they can be used to track progress implies that OpenAI has not only a clear idea of what would be required for each different stage, but that they have a roadmap which shows that the levels would happen in the specified order. That seems very hard to believe, on both counts. I won’t go into why I think they don’t know what the path looks like, but I can at least explain why the order is dubious.
For instance, there are certainly human “agents” who are unable to perform tasks which we expect of what they call level two, i.e. that which an unassisted doctorate-level individual is able to do. Given that, what is the reason level 4 is after level 2? Similarly, the ability to coordinate and cooperate is not bound by the ability to function at a very high intellectual level; many organizations have no members which have PhDs, but still run grocery stores, taxi companies, or manufacturing plants.
And we’re already seeing work being done on agents that are intended to operate largely independently, performing several days of human work without specific supervision. At present, it seems these systems fail partly because of the limitations of the underlying systems, and partly because better structures for these systems are needed. However, at the very least, it’s unclear whether we’d see AI that can innovate effectively (level 4) before or after they are successful working independently (level 3).
So it seems that we have no idea whether GPT-5, whenever they decide to release it, will end up as a level-5-but-not-4 system (organization that cannot innovate,) or a level 3-but-not-2 (agent without a PhD) system, or a level 4-but-not-3 (innovator that cannot operate independently for multiple days) systems. Of course, it’s possible that all of these objections will be addressed in OpenAI’s full “progress tracking system” - but it seems far more likely that the levels they are talking about are more a marketing technique to sell the idea that their systems will be predictable in their abilities.
I’m deeply skeptical.