MrCheeze
MrCheeze has not written any posts yet.

But you have to be careful here, since the results heavily depend on details of the harness, as well as on how thoroughly they have memorized walkthroughs of the game.
Text adventures do seem like a good eval right now, since they're the ONLY games that can be tested without either relying on vision (which is still very bad), or writing a custom harness for each game (in which case your results depend heavily on the harness).
(Gemini did actually write much of the Gemini_Plays_Pokemon scaffolding, but only in the sense of doing what David told it to do, not designing and testing it.)
I think you're probably right that a LLM coding its own scaffolding is probably more achievable than one playing the game like a human, but I don't think current models can do it - watching the streams, the models don't seem like they understand their own flaws, although admittedly they haven't been prompted to focus on this.
On the other hand, Claude has (arguably) a better pathfinding tool. As long as it requests to be moved to a valid set of coordinates from the screenshot overlay grid, the tool will move it there. Gemini mostly navigates on its own, although it has access to another instance of Gemini dedicated just to pathfinding.
I very much argue this. Claude's navigator tool can only navigate to coordinates that are onscreen, meaning that the main model needs to have some idea of where it's going. Which means grappling with problems that are extremely difficult for both models, such as "go AROUND the wall instead of right through it".
In contrast, the Gemini pathfinder tool... (read more)
I have not tested if Gemini can distinguish this tree (and intend to eventually). This may very well be the only reason Gemini has progressed further.
You missed an important fact about the Gemini stream, which is that it just reads the presence of these trees from RAM and labels them for the model (along with a few other special tiles like ledges and water). Nevertheless I do think Gemini's vision is better, by which I mean if you provide it a screenshot it will sometimes identify the correct tree, unlike Claude who will never do so. (Although to my knowledge the Gemini in the stream has literally never used vision for anything.)... (read more)
And now in the second run it has entered a similar delusional loop. It knows the way to Cerulean City is via Route 4, but the route before and after Mt. Moon are both considered part of Route 4. Therefore it deluded itself into thinking it can get to Cerulean from the first part of the route. Because of that, every time it accidentally stumbles into Mt Moon and is making substantial progress towards the exit, it intentionally blacks out to get teleported back outside the entrance, so it can look for the nonexistent path forwards.
From what I've seen on stream, the chances of it questioning and breaking from this delusion are... (read more)
Note that the creator stated that the setup is intentionally somewhat underengineered:
I do not claim this is the world's most incredible agent harness; in fact, I explicitly have tried not to "hyper engineer" this to be like the best chance that exists to beat Pokemon. I think it'd be trivial to build a better computer program to beat Pokemon with Claude in the loop.
This is like meant to be some combination of like "understand what Claude's good at and Benchmark and understand Claude-alongside-a-simple-agent-harness", so what that boils down to is this is like a pretty straightforward tool-using agent.
This basically sums up how it's doing: https://www.reddit.com/r/ClaudePlaysPokemon/comments/1j568ck/the_mount_moon_experience
Of course much of that is basic capability issues -poor spatial reasoning, short term memory that doesn't come anywhere close to lasting for 1 lap, etc.
But I've also noticed ways in which Claude's personality is sabotaging it. Claude is capable of taking notes saying that it "THOROUGHLY confirmed NO passages" through the eastern barrier - but never gets impatient or frustrated, so this doesn't actually prevent it from trying the same thing every time it sees the eastern wall again.
And it general, it seems to have a strong bias towards visiting places that are mentioned frequently in its notes - even though that's the exact... (read more)
"Under development" and "currently training" I interpret as having significantly different meanings.
This is exactly what I was thinking while reading the post. They didn't advertise conversational changes, but they DID advertise agentic improvements, and improving its ability to vary its approaches to tasks is an obvious way of doing that.
(That said, it's not necessarily true that this a general Claude improvement rather than one that just happens by chance to show up in this specific test.)