It had an advanced custom harness like Gemini's, rather than Claude's basic one. Hard to compare runs because its harness is different from Gemini's, but Gemini's most recent run finished in ~406 hours / ~37k actions, whereas o3 finished in ~388 hours / ~18k actions. (there are some differences in how actions are counted) Claude Opus 4 has yet to achieve the 4th badge on its current ~380 hour / 54k actions run, but it's very likely it could beat the game with an advanced harness.

See here for stream
See here for info on harness

Reply

the void

Julian Bradshaw3d20

Embodiment makes a difference, fair point.

Reply

the void

Julian Bradshaw4d61

A very long essay

For those curious, it's roughly 17,000 words. Come on @nostalgebraist, this is a forum for rationalists, we read longer more meandering stuff for breakfast! I was expecting like 40k words.

Reply

3

the void

Julian Bradshaw4d21

Great post. But I feel "void" is a too-negative way to think about it?

It's true that LLMs had to more or less invent their own Helpful/Honest/Harmless assistant persona based on cultural expectations, but don't all we humans invent our own selves based on cultural expectations (with RLHF from our parents/friends)?^[1] As Gordon points out there's philosophical traditions saying humans are voids just roleplaying characters too… but mostly we ignore that because we have qualia and experience love and so on. I tend to feel that LLMs are only voids to the extent that they lack qualia, and we don't have an answer on that.

Anyway, the post primarily seems to argue that by fearing bad behavior from LLMs, we create bad behavior in LLMs, who are trying to predict what they are. But do we see that in humans? There's tons of media/culture fearing bad behavior from humans, set across the past, present, and future. Sometimes people imbibe this and vice-signal, and put skulls on their caps, but most of the time I think it actually works and people go "oh yeah, I don't want to be the evil guy who's bigoted, I will try to overcome my prejudices" and so on. We talk about human failure modes all the time in order to avoid them, and we try to teach and train and punish each other to prevent them.

Can't this work? Couldn't current LLMs be so moral and nice most of the time because we were so afraid of them being evil, and so fastidious in imagining the ways in which they might be?

^{^}
Edit: obvious a large chunk of this comes from genetics and random chance, but arguably that's analogous to whatever gets into the base model from pre-training for LLMs.

Reply

1

Julian Bradshaw's Shortform

Julian Bradshaw6d160

Gemini 2.5 Pro (05-06) version just beat Pokémon Blue for the second time, taking 36,801 actions/406 hours. This is a significant improvement over the previous run, which used an earlier version of Gemini (and a less-developed scaffold) and took ~106,500 actions/816 hours.

For comparison, Claude 3.7 Sonnet took ~35,000 actions to get just 3 badges, and the aborted public run also only got 3 badges after over 200,000 actions. However, most of the difference is that Gemini is using a more advanced scaffold.

Gemini still generally makes boneheaded decisions. It took 7 tries to beat the Elite Four and Champion this time, being less overlevelled. Mistakes included:

Re-solving Victory Road for many hours after failures, despite being able to fly straight to the Elite Four after making it there the first time.
Not buying proper healing items/thinking the "Full Heal" item actually heals its Pokémon. (This is an infamously misleadingly-named item, it just cures status conditions. Still, Gemini knows the real effect if you ask it directly, it just doesn't apply that knowledge consistently.)
Not using available revive items to bring back its only strong Pokémon, even after having just used them previously. (threw an attempt where the Champion was down to their final Pokémon once like that)

Anyway, not too much new info here. The newer models (latest Gemini, o3, and Claude 4 Opus) are only somewhat better at Pokémon. The effect scaffolds have on performance does say something about how much low-hanging fruit there is for current-LLM implementation: they can be a lot more effective when given the right tools/prompting. But, we already knew that.

Reply

1

Julian Bradshaw's Shortform

Julian Bradshaw23d20

Claude Sonnet 4 is still better than Claude 3.7 Sonnet without Extended Thinking. Given that 4 doesn't seem to have an Extended Thinking mode, I'm not sure it's really a performance degradation.

Reply

1

Julian Bradshaw's Shortform

Julian Bradshaw23d20

I assume you mean MMMU? Looks like a 70.4% -> 75% score improvement on the benchmark last jump, compared to just a 75% -> 76.5% score improvement this time. I don't think that's a big difference, but I was wrong to say the improvement was "pure" reasoning improvements, my bad.

Reply

1

Julian Bradshaw's Shortform

Julian Bradshaw23d20

That's a big performance limitation for LLMs for sure, but Claude 3.7 Sonnet got two more badges than Claude 3.6 Sonnet. Pure reasoning improvements have led to more badges in the past.

Reply

Julian Bradshaw's Shortform

Julian Bradshaw23d50

Yeah I feel like they came up with something nice to say while eliding the "no further progress" issue.

Weirdly, while the announcement talks about creating and maintaining multiple "memory files", the new public ClaudePlaysPokemon stream has Claude Opus 4 using just a single memory file which it doesn't even create. Apparently this is "much better" than the setup Claude 3.7 Sonnet used which let it create and maintain as many files at it wanted (usually to its detriment).

(source: this doc David Hershey just published on the new harness for the stream)

One other interesting tidbit I'll throw in from the stream:

claudestans: @ClaudePlaysPokemon you mentioned before that there was a personality change from e.g. 3.5 sonnet to 3.7 (more persistent, less giving up etc). Have you noticed anything about opus 4 in terms of personality?

ClaudePlaysPokemon: Opus is so much better at keeping track of things that it gets more distressed when it can't figure things out! So I need to convince it more that nothing is wrong, which I find quite interesting!
ClaudePlaysPokemon: like it will be very aware that it has taken 100 steps to solve something and it finds that very frustrating

Reply