Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The livestream is still going right now.
TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level.
Digging in
But wait! you say. Didn't Anthropic publish a benchmark showing Claude isn't half-bad at Pokémon? Why yes they did:
and the data shown is believable. Currently, the livestream is on its third attempt, with the first being basically just a test run. The second attempt got all the way to Vermilion City, finding a way through the infamous Mt. Moon maze and achieving two badges, so pretty close to the benchmark.
But look carefully at the x-axis in that graph. Each "action" is a full Thinking analysis of the current situation (often several paragraphs worth), followed by a decision to send some kind of input to the game. Thirty-five thousand actions means an absolutely enormous amount of thought. Even for Claude, who thinks much faster than a human, ten thousand actions takes it roughly a full working week of 40 hours,[2] so that 3-badge run took Claude nearly the equivalent of a month of full-time work, perhaps 140 hours. Meanwhile, the average human can beat the entirety of Red in just 26 hours, and with substantially less thought per hour.
What's going wrong?
Basically, while Claude is pretty good at short-term reasoning (ex. Pokémon battles), he's bad at executive function and has a poor memory. This is despite a great deal of scaffolding, including a knowledge base, a critic Claude that helps it maintain its knowledge base, and a variety of tools to help it interact with the game more easily.
What does that mean in practice? If you open the stream, you'll see it immediately: Claude on Run #3 has been stuck in Mt. Moon for 24 hours straight.[3] On Run #2, it took him 78 hours to escape Mt. Moon.
Mt. Moon is not that complicated. It has a few cave levels, and a few trainers. But Claude gets stuck going in loops, trying to talk to trainers he's already beaten (and often failing to talk to them, not understanding why his inputs don't do what he expects), inching across solid walls looking for exits that can't possibly be there, always taking the obvious trap route rather than the longer correct route just because it's closer.
This hasn't been the only problem. Run #2 eventually failed because it couldn't figure out it needed to talk to Bill to progress, and Claude wasn't willing to try any new action when the same infinite rotation of wrong choices (walk in circles, enter the same buildings, complain the game is maybe broken) wasn't working.[4]
A good friend of mine has been watching the stream quite a lot, and describes Claude's faults well. Lightly edited for clarity:
This current Claudeplayspokemon run is actually an interesting encapsulation of current limitations in llms
That are more fundamental than just "LLMs don't understand spatial reasoning" (but they don't)
They added a whole new memory system after Run #2 with short term and long-term text files and the ability to edit and store and archive and now he has a pretty good memory system that does improve navigation
but now the lack of executive planning and goal-planning and grasp of reality is really rearing its head
no amount of good memory system will save you if you just randomly see something and go:
> "Oh I've achieved my goal!"
> "Time to delete all my past files about achieving this goal!"
when goal was not actually achieved[5]
It really lacks a lot of human ability to plan, hold multiple goals at once, prioritize, and just keep a grasp of what's going on
(re: goal orientation, you just have to witness its relative inability to simultaneously aim for the short-term goal while also leveling up its pokemon)
(It can level its pokemon if that's the goal right now, or move forward if that its goal right now, but it can't simultaneously level up pokemon while also moving forward)
(at least not to a human level of efficiency, it will half-heartedly do some combat while the team is healthy then immediately abandon any thought of leveling if the team is moderately injured)
(It's also not capable of changing goals on the fly and going "Well I'm too injured to make it, let's get some levels and bail")
The funny thing is that pokemon is a simple, railroady enough game that RNG can beat the game given enough time (and this has been done)[6], but it turns out to take a surprising amount of cognitive architecture to play the game in a fully-sensible-looking way
and insufficient smarts can be surprisingly double-edged—an RNG run would arguably be better at both leveling and navigating mazes through sheer random walkitude and willingness to bash face into every fight
as opposed to getting stuck in loops or refusing to engage for bad reasons
Thanks for coming to my ted talk but my overall thesis is still Executive Function is an unsolved problem
(Executive Function is, reminder, goals, prioritization, attention, etc.)
What does this mean for AI?
It's obvious that, while Claude isn't very good at playing Pokémon, it is getting better. 3.7 does significantly better than 3.5 did, after all, and 3.0 was hopeless. So isn't its extremely hard-earned (half-random) achievements so far still progress in the right direction and an indication of things to come?
Well, yeah. But Executive Function (agents, etc.) has always been the big missing puzzle piece, and even with copious amounts of test-time compute, tool use, scaffolding, external memory, the latest and greatest LLM still is not at even a child's level.
But the thing about ClaudePlaysPokémon is that it feels so close sometimes. Each "action" is reasoned carefully (if often on the basis of faulty knowledge), and of course Claude's irrepressible good humor is charming. If it could just plan a little better, or had a little outside direction, it'd clearly blow through the game no problem.
Quoting my friend again (lightly edited again):
it's fairly obvious if there was just someone human there to monitor Claude and give it some guidance maybe once every hour, it'd probably be 10x further through the game now
That's meaningful insomuch as from a "taking people's jobs" point of view it may be possible to, even if the AI really sucks at goal planning, just chain 10 bad AIs to one employee and have them monitorthis still sucks ~9 jobs away from people
Another interesting thought: I talk about "Executive Function" and the brain regions associated with it are the largest and most recently evolved in humans
which makes sense
but a lot of goal-orientation and grasp on reality stuff is reasonably well-developed in the average squirrel
Not to a human level but better than most LLMs
So it's not just a matter of "this is hard stuff that only elite humans do"
No, a decent amount of this is fairly old stuff that's probably pretty deep in the brain, and the evolutionary pressure is presumably fairly strong to have developed it fast
but the financial pressure on LLMs doesn't seem to have the same effect[7]
Conclusion
In Leopold Aschenbrenner's infamous Situational Awareness paper, in which he correctly predicted the rise of reasoning models, he discussed the concept of "unhobbling" LLMs. That is, figuring out how to get from the brilliant short-term thinking we already have today all the way to the agent-level executive function necessary for true AGI.
ClaudePlaysPokémon is proof that the last 6 months of AI innovation, while incredible, are still far from the true unhobbling necessary for an AI revolution. That doesn't mean 2-year AGI timelines are wrong, but it does feel to me like some new paradigm is yet required for them to be right. If you watch the stream for a couple hours, I think you'll feel the same.
- ^
As part of the release, they published a Pokémon benchmark here.
- ^
Roughly guesstimating based off the fact that ~48 hours in to Run #3, it's at ~12,000 actions taken.
- ^
Update: at the 25 hour mark Claude whited out after losing a battle, putting it back before Mt. Moon. The "Time in Mt. Moon" timer on the livestream is still accurate.
- ^
This is simplified. For a full writeup, see here. Run #2 was abandoned at 35,000 actions.
- ^
This specifically happened with Mt. Moon in Run #3. Claude deleted all its notes on Mt. Moon routing after whiting out and respawning at a previous Pokécenter. It mistakenly concluded that it had already beaten Mt. Moon.
- ^
Actually it turns out this hasn't been done, sorry! A couple RNG attempts were completed, but they involved some human direction/cheating. The point still stands only in the sense that, if Claude took more random/exploratory actions rather than carefully-reasoned shortsighted actions, he'd do better.
- ^
Okay actually this line I wrote myself—the rest is from my friend—but he failed to write a pithy conclusion for me.
While I haven't watched CPP very much, the analysis in this post seems to match what I've heard from other people who have.
That said, I think claims like
are overconfident about where the human baselines are. Moreover, I think these sorts of claims reflect a general blindspot about how humans can get stuck on trivial obstacles in the same way AIs do.
A personal anecdote: when I was a kid (maybe 3rd or 4th grade, so 8 or 9 years old) I played Pokemon red and couldn't figure out how to get out of the first room—same as the Claude 3.0 Sonnet performance! Why? Well is it obvious to you where the exit to this room is?
Answer: you have to stand on the carpet and press down.
Apparently this was a common issue! See this reddit thread for discussion of people who hit the same snag as me. In fact, it was a big enough issue that addressed it in the FireRed remake, making the rug stick out a bit:
I don't think this is an isolated issue with the first room. Rather, I think that as railroaded as Pokemon might seem, there's actually a bunch of things that it's easy to get crucially confused about, resulting in getting totally stuck for a dumb reason until someone helps you out.
Some other examples of similar things from the same reddit thread:
These are totally the same sorts of mistakes that I remember making playing Pokemon as a kid.
Further, have you ever gotten an adult who doesn't normally play video games to try playing one? They have a tendency to get totally stuck in tutorial levels because game developers rely on certain "video game motifs" for load-bearing forms of communication; see e.g. this video.
I don't think this is specific to video games: In most things I try to do, I run up against stupid, fake walls where there's something obvious that I just "don't get." Fortunately, I'm able to do things like ask someone for a fresh pair of eyes or search the internet. Without this ability, I think I would have to abandon basically all of the core things I work on. When I need to help out people with worse "executive function"/"problem solving ability" than me—like relatives that need basic tech help—usually the main thing I do to unstuck them is "google their problem."
(As a more narrow point, I'm extremely dubious that the way to interpret howlongtobeat's 26 hour number as representing the time that it would take an average human to beat Pokemon Red, even assuming that the humans are adults and that we entirely discard failed playthroughs.)
I kind of see your point about having all the game wikis, but I think I disagree about learning to code being necessarily interactive. Think about what feedback the compiler provides you: it tells you if you made a mistake, and sometimes what the mistake was. In cases where it runs but doesn't do what you wanted, it might "show" you what the mistake was instead. You can learn programming just fine by reading and writing code but never running it, if you also have somebody knowledgeable checking what you wrote and explaining your mistakes. LLMs have tons of examples of that kind of thing in their training data.